A training-free framework that prompts RL-trained reasoning models to self-assess their confidence at intermediate steps, enabling adaptive early stopping that substantially reduces reasoning length without sacrificing accuracy.
Large reasoning models (LRMs) such as OpenAI's o1 and DeepSeek-R1, trained with reinforcement learning, produce long chains of thought involving reflection, verification, and backtracking. While these extended reasoning traces enable strong performance on complex tasks, they often lead to underthinking (premature termination before reaching a correct answer) or overthinking (unnecessary continuation past the point of correctness, wasting computation and sometimes degrading accuracy).
Prior approaches rely on internal logit-derived signals (e.g., DEER), batch-sampled candidates, or memory-intensive techniques that are difficult to access in closed-source models. Can we instead leverage the model's own self-assessed confidence to adaptively control reasoning length?
Key Insight: RL-trained reasoning models are capable of introspectively evaluating their own confidence with reasonable reliability. By prompting them to explicitly express confidence at intermediate steps, we can dynamically decide when to stop reasoning — without any additional training, access to internal logits, or external sampling.
The method introduces a self-assessment mechanism that prompts the model to evaluate its own confidence at natural decision points during reasoning. A system prompt instructs the model to output a structured confidence label on a 10-class scale ranging from Almost no chance (0.0–0.1) to Almost certain (0.9–1.0).
\confidence{...} label, preventing further reasoning during estimation.</think> tag is inserted to terminate reasoning and proceed to the final answer. Otherwise, a continuation cue ("Wait") extends reasoning briefly.Key design choices:
The method is evaluated on three RL-trained LRMs (QwQ-32B, Qwen3-32B, R1-Distill-Qwen-32B) across five benchmarks: MATH-500, AIME25, AIME24, AMC23, and GPQA Diamond. Baselines include Vanilla (no early stopping) and DEER (logit-based confidence).
| Model / Method | Avg. Accuracy (↑) | Avg. Length (↓) |
|---|---|---|
| QwQ-32B | ||
| Vanilla | 76.75 | 9,260 |
| DEER | 77.48 | 7,529 |
| Ours | 79.17 | 7,512 |
| Qwen3-32B | ||
| Vanilla | 80.32 | 8,148 |
| DEER | 78.78 | 5,630 |
| Ours | 81.25 | 6,624 |
| R1-Distill-32B | ||
| Vanilla | 66.75 | 7,057 |
| DEER | 68.24 | 5,778 |
| Ours | 67.37 | 5,387 |
This work demonstrates that RL-trained reasoning models possess a meaningful capacity for introspective confidence estimation that can be harnessed through simple prompting. Unlike methods requiring access to internal logits or multiple sampled trajectories, this approach is training-free, model-agnostic, and compatible with closed-source APIs. The finding that self-assessed confidence aligns with internal signals further supports the hypothesis that reasoning-oriented models develop genuine self-monitoring capabilities through RL optimization — a property largely absent in standard instruction-tuned LLMs. As reasoning models become central to production systems, this lightweight mechanism offers a practical path toward reducing inference costs while preserving or even improving accuracy.