Think Just Enough - HYU NLP Lab

One-Line Summary

A training-free framework that prompts RL-trained reasoning models to self-assess their confidence at intermediate steps, enabling adaptive early stopping that substantially reduces reasoning length without sacrificing accuracy.

Background & Motivation

Large reasoning models (LRMs) such as OpenAI's o1 and DeepSeek-R1, trained with reinforcement learning, produce long chains of thought involving reflection, verification, and backtracking. While these extended reasoning traces enable strong performance on complex tasks, they often lead to underthinking (premature termination before reaching a correct answer) or overthinking (unnecessary continuation past the point of correctness, wasting computation and sometimes degrading accuracy).

Prior approaches rely on internal logit-derived signals (e.g., DEER), batch-sampled candidates, or memory-intensive techniques that are difficult to access in closed-source models. Can we instead leverage the model's own self-assessed confidence to adaptively control reasoning length?

Key Insight: RL-trained reasoning models are capable of introspectively evaluating their own confidence with reasonable reliability. By prompting them to explicitly express confidence at intermediate steps, we can dynamically decide when to stop reasoning — without any additional training, access to internal logits, or external sampling.

Proposed Method

The method introduces a self-assessment mechanism that prompts the model to evaluate its own confidence at natural decision points during reasoning. A system prompt instructs the model to output a structured confidence label on a 10-class scale ranging from Almost no chance (0.0–0.1) to Almost certain (0.9–1.0).

1

Reflective Marker Detection

During reasoning, LRMs naturally produce reflective tokens such as "Wait" and "Alternatively" that signal transitions in the thought process. These markers serve as natural decision points for confidence estimation.

2

Self-Assessed Confidence Injection

At each decision point, generation is paused and an additional prompt requests the model to express its confidence via a structured \confidence{...} label, preventing further reasoning during estimation.

3

Adaptive Early Stopping

If the confidence reaches the threshold (default: Almost certain), the </think> tag is inserted to terminate reasoning and proceed to the final answer. Otherwise, a continuation cue ("Wait") extends reasoning briefly.

Key design choices:

Keyword-triggered estimation: Confidence is assessed when reflective markers ("Wait", "Alternatively") appear naturally in the reasoning trace, aligning with the model's own thought transitions
Periodic probing alternative: For models with sparse reflective markers, a Periodic-Conf(k) variant pauses every k tokens to query confidence
Configurable threshold: Three confidence thresholds evaluated — Very good chance (0.7–0.8), Highly likely (0.8–0.9), and Almost certain (0.9–1.0) — enabling a trade-off between efficiency and conservatism
Training-free & model-agnostic: Requires no fine-tuning or access to internal model states; works through prompting alone

Experimental Results

The method is evaluated on three RL-trained LRMs (QwQ-32B, Qwen3-32B, R1-Distill-Qwen-32B) across five benchmarks: MATH-500, AIME25, AIME24, AMC23, and GPQA Diamond. Baselines include Vanilla (no early stopping) and DEER (logit-based confidence).

Model / Method	Avg. Accuracy (↑)	Avg. Length (↓)
QwQ-32B
Vanilla	76.75	9,260
DEER	77.48	7,529
Ours	79.17	7,512
Qwen3-32B
Vanilla	80.32	8,148
DEER	78.78	5,630
Ours	81.25	6,624
R1-Distill-32B
Vanilla	66.75	7,057
DEER	68.24	5,778
Ours	67.37	5,387

Higher accuracy with shorter reasoning: For QwQ-32B and Qwen3-32B, the proposed method achieves the highest average accuracy while substantially reducing reasoning length compared to Vanilla
Up to 24% token reduction on average (e.g., Qwen3-32B: 8,148 → 6,624 tokens) with no accuracy loss — in many cases accuracy even improves
Stable on hard benchmarks: Performance remains reliable on challenging datasets like AIME24 and AIME25, where Qwen3-32B achieves 75.56% accuracy (vs. 72.22% Vanilla) with shorter traces
DEER comparison: Unlike DEER, which relies on internal logits and can be overconfident (causing premature stopping on some models), self-assessed confidence provides more stable and accessible estimates
Confidence alignment: Self-assessed confidence correlates with logit-based confidence (Pearson r = 0.33), suggesting that reasoning models encode certainty coherently across both generated text and internal representations

Why It Matters

This work demonstrates that RL-trained reasoning models possess a meaningful capacity for introspective confidence estimation that can be harnessed through simple prompting. Unlike methods requiring access to internal logits or multiple sampled trajectories, this approach is training-free, model-agnostic, and compatible with closed-source APIs. The finding that self-assessed confidence aligns with internal signals further supports the hypothesis that reasoning-oriented models develop genuine self-monitoring capabilities through RL optimization — a property largely absent in standard instruction-tuned LLMs. As reasoning models become central to production systems, this lightweight mechanism offers a practical path toward reducing inference costs while preserving or even improving accuracy.

Links