One-Line Summary
A novel Reasoning Feedback-based Learning (RFL) framework that harnesses detailed reasoning feedback from an external model to progressively refine training data, achieving 95.04% accuracy on conversational context inference — a 7.93%p gain over the baseline and 1.32%p over standard fine-tuning.
Background & Motivation
Conversational context inference — the task of understanding implicit information, speaker intentions, and situational dynamics from dialogue — is a fundamental challenge in building robust dialogue systems. While large language models (LLMs) have advanced dialogue understanding, they still falter on complex cases that require multi-step reasoning over conversation history.
Key Challenges with Existing Approaches:
- Performance plateau on hard examples: Standard fine-tuning methods improve overall accuracy but stall on difficult inference cases where surface-level pattern matching is insufficient.
- Limited learning signal: Multiple-choice question (MCQ) fine-tuning teaches the model what the correct answer is, but not why — the reasoning path remains opaque.
- No targeted error correction: Conventional training treats all examples equally rather than concentrating effort on the most challenging instances where the model repeatedly fails.
These limitations motivate RFL: a framework that supplements standard training with structured reasoning feedback from a stronger external model, explicitly teaching the target model why its incorrect predictions are wrong and guiding it toward the correct reasoning path.
Proposed Method: Reasoning Feedback-Based Learning (RFL)
RFL is a three-stage iterative framework that leverages an external, more capable model to generate detailed reasoning feedback. This feedback is used to progressively refine training data so the target model can overcome its most persistent errors.
1
Initial Fine-Tuning & Error Collection
The target model is first fine-tuned on the conversational context inference task using standard MCQ training. After training, it is evaluated on the training set to identify incorrectly predicted instances — the hard cases where the model fails despite standard supervision.
2
Reasoning Feedback Generation
An external, more powerful model receives each incorrectly predicted instance along with the target model's wrong answer. It generates detailed reasoning feedback explaining: (a) why the chosen answer is incorrect, (b) what the correct reasoning chain should be, and (c) the correct answer with justification. This transforms bare labels into rich, explanatory training signal.
3
Progressive Data Refinement & Re-Training
The reasoning feedback is incorporated into the training data, replacing or augmenting the original instances for the hard cases. The target model is then re-trained on this progressively refined dataset, concentrating learning effort on the most challenging examples. This cycle can be repeated to achieve iterative improvement.
How Reasoning Feedback Differs from Standard Labels:
In standard MCQ fine-tuning, the training signal is simply the correct answer label (e.g., "Answer: B"). RFL replaces this with a structured reasoning trace that includes three components:
- Error diagnosis: An explicit explanation of why the target model's predicted answer is wrong, pinpointing the faulty reasoning step.
- Correct reasoning chain: A step-by-step reasoning path that connects dialogue context clues to the correct answer, making the implicit inference process explicit.
- Justified answer: The correct answer accompanied by a natural language justification grounded in the conversation, reinforcing the causal link between evidence and conclusion.
This rich signal transforms each error into a targeted learning opportunity, teaching the model not just to memorize answers but to internalize the reasoning patterns behind them.
Experimental Setup
RFL is evaluated on the conversational context inference task, which frames dialogue understanding as a multiple-choice problem: given a conversation history, the model must select the correct inference about implicit information, speaker intent, or situational context from a set of candidates.
| Component | Details |
| Task | Conversational Context Inference (MCQ format) |
| Target Model | LLM fine-tuned on dialogue inference data |
| External Feedback Model | Larger, more capable LLM used to generate reasoning feedback |
| Metric | Accuracy (%) |
| Training Strategy | Iterative: fine-tune → collect errors → generate feedback → retrain |
The key design choice is the progressive refinement loop: after each training round, only the instances the model still gets wrong are sent to the external model for feedback, ensuring that computational resources are concentrated where they matter most.
Experimental Results
Main Results
RFL is compared against a baseline model (without fine-tuning) and standard MCQ fine-tuning (without reasoning feedback).
| Method | Accuracy (%) | Improvement over Baseline |
| Baseline | 87.11 | — |
| MCQ Fine-Tuning | 93.72 | +6.61%p |
| RFL (Proposed) | 95.04 | +7.93%p |
Where Does the Gain Come From?
Decomposing the RFL Improvement:
- Standard fine-tuning contribution (+6.61%p): MCQ-based training captures the majority of learnable patterns, lifting accuracy from 87.11% to 93.72%. This addresses the "easy" and "medium" difficulty examples where pattern matching suffices.
- Reasoning feedback contribution (+1.32%p): The additional gain from 93.72% to 95.04% comes entirely from hard cases that standard fine-tuning fails to solve. Though smaller in absolute terms, this gain is especially significant because it targets the most challenging tail of the distribution — cases that resist conventional training.
Analysis: Impact on Error Categories
- 7.93%p over baseline: RFL achieves 95.04% accuracy, substantially outperforming the baseline model by nearly 8 percentage points.
- 1.32%p over MCQ fine-tuning: Beyond standard fine-tuning, the additional reasoning feedback provides a meaningful further improvement, demonstrating the value of teaching models why answers are correct, not just what is correct.
- Effective on hard cases: The largest gains come from previously intractable examples where standard training repeatedly failed — detailed external feedback successfully addresses these challenging inference instances.
- Progressive refinement works: By iteratively focusing on error cases and enriching the training signal, RFL breaks through the performance ceiling that conventional methods hit on difficult examples.
- Diminishing error pool: Each refinement iteration reduces the number of error instances available for feedback generation, indicating genuine learning rather than overfitting — the model progressively internalizes the reasoning patterns taught by the external model.
Why It Matters
This work demonstrates that structured reasoning feedback from external models can serve as a powerful training signal for improving dialogue understanding. The contributions extend beyond the specific task:
- Breaking the fine-tuning plateau: RFL provides a principled method to move past the performance ceiling that standard fine-tuning hits on hard examples, by explicitly teaching reasoning rather than just answers.
- General framework: The progressive refinement strategy — identify errors, generate reasoning feedback, retrain — is task-agnostic and applicable to any NLP task where reasoning quality matters, including commonsense reasoning, reading comprehension, and relation extraction.
- Efficient use of stronger models: Rather than deploying a large model at inference time, RFL uses it only during training to generate feedback, keeping inference costs low while capturing the reasoning capabilities of stronger models in a smaller target model.
- Knowledge distillation through reasoning: Unlike traditional knowledge distillation which transfers soft probability distributions, RFL distills reasoning processes from a stronger teacher to a weaker student, offering a more interpretable and targeted form of knowledge transfer.
Dialogue
Reasoning