EN KO
← All Publications

ADVICE: Answer-Dependent Verbalized Confidence Estimation

ACL 2026
Ki Jung Seo, Sehun Lim, Taeuk Kim

One-Line Summary

We discover that LLM overconfidence stems from "answer-independence" -- where confidence verbalization is internally decoupled from the model's own answer -- and propose ADVICE, a contrastive fine-tuning framework that grounds confidence estimation in the actual answer, achieving substantial calibration improvements (e.g., ECE from 16.9 to 10.4 on Llama-3.1-8B) with strong out-of-distribution generalization.

ADVICE overview - LLMs verbalize overconfidence regardless of answer correctness
Figure 1. LLMs express excessive confidence regardless of whether their answers are correct or incorrect (left). ADVICE learns to generate appropriate confidence scores grounded in the actual answer (right).

Background & Motivation

Large language models inevitably generate factually inaccurate content (hallucinations), and while eliminating this entirely may be theoretically unavoidable, a promising mitigation strategy is to have LLMs provide confidence estimates alongside their answers. Verbalized confidence -- where models express confidence levels in natural language -- is particularly attractive due to its universal applicability and user-friendly nature, requiring no access to internal model states.

However, a well-known and critical issue hinders broader application: systematic overconfidence, where models assign high confidence irrespective of output quality. Rather than simply mitigating overconfidence after the fact, this work asks a deeper question: why does overconfidence arise in the first place?

Key Finding: Through analysis of intermediate processes, the authors discover that LLM-generated answers and confidence verbalization are internally decoupled -- a phenomenon they term "answer-independence." Specifically:

  • Distribution analysis: The Jensen-Shannon divergence (JSD) between confidence distributions conditioned on correct vs. incorrect answers shows strong concentration near zero (JSD ≤ 0.1 for the vast majority of samples), proving that models produce nearly identical confidence regardless of the answer.
  • Attention Rollout analysis: Attention flow from confidence tokens to answer tokens is significantly lower than attention to question tokens, indicating models rely less on answer-specific information when generating confidence.
  • Integrated Gradients analysis: Token attribution reveals that answer tokens are consistently under-weighted compared to tokens in other components (e.g., question, instruction), further confirming the decoupling.
Jensen-Shannon divergence analysis
Figure 2. JSD scores between confidence distributions for correct vs. incorrect answers concentrate near zero with long right tails, demonstrating that models express confidence independently of their answers.
Attention Rollout scores
Figure 3. Attention Rollout analysis shows that confidence tokens pay significantly less attention to answer tokens compared to question tokens, confirming the internal decoupling.

Proposed Method: ADVICE Framework

ADVICE (Answer-Dependent VerbalIzed Confidence Estimation) is a lightweight fine-tuning framework that explicitly promotes answer-grounded confidence estimation. The key insight is to teach the model, through contrastive training on correct/incorrect answer pairs, that confidence should fundamentally differ depending on the answer's correctness.

ADVICE framework diagram
Figure 4. Overall architecture of the ADVICE framework. For each question, the model processes both a correct and an incorrect answer, and the four loss objectives work together to separate and calibrate the resulting confidence distributions.
1
Training Data Construction
Source 4,000 instances from TriviaQA training split, retaining only those where the model generates correct answers under greedy decoding. For each instance, construct a triplet: (question, correct answer, randomly sampled incorrect answer). Two variants per instance are created to train fluent expression across multiple confidence formats (ScoreLetter and ScoreNumber).
2
Multi-Objective Contrastive Training
Fine-tune the model using four complementary loss objectives that jointly enforce answer-dependent confidence. The model processes the same question with both the correct and incorrect answer, and the losses drive the confidence distributions apart in a principled manner.
3
Answer-Grounded Inference
At inference time, the model generates both an answer and a confidence score in a single pass. Because training has instilled answer-awareness, the model now naturally conditions its confidence on the generated answer, producing well-calibrated estimates.

The four loss objectives and their specific roles:

Experimental Results

Experiments are conducted on three models (Llama-3.1-8B-Instruct, Mistral-7B-Instruct-v0.3, Gemma-2-9b-it) across in-domain (TriviaQA) and out-of-distribution (MMLU, LogiQA) benchmarks, using four metrics: Expected Calibration Error (ECE), Absolute Net Calibration Error (|NCE|), Brier Score (BS), and AUROC.

In-Domain Results (TriviaQA)

ModelMethodECE ↓|NCE| ↓BS ↓AUROC ↑
Llama-3.1-8BDefault16.916.621.256.2
Llama-3.1-8BSelf-Consistency15.7--58.6
Llama-3.1-8BConfTuner5.21.115.366.3
Llama-3.1-8BADVICE10.49.814.877.0
Llama-3.1-8BADVICE + ConfTuner9.4--77.9

Out-of-Distribution Results

DatasetMethodECE ↓AUROC ↑
MMLUDefault26.9-
MMLUConfTuner13.9-
MMLUADVICE8.669.2
LogiQADefault53.8-
LogiQAConfTuner28.6-
LogiQAADVICE23.057.9
Reliability diagrams
Figure 5. Reliability diagram: After applying ADVICE, model confidence is calibrated much closer to actual accuracy, shifting the bars toward the ideal diagonal.

Ablation Study (Gemma-2-9b on TriviaQA)

ConfigurationECE ↓
L_LM only23.0
L_LM + L_JSD8.6
L_LM + L_Margin16.8
Full ADVICE (all objectives)6.2

Why It Matters

Knowing whether an AI is truly confident when it says "I'm sure" is critically important. In high-stakes domains such as healthcare, law, and finance, LLM overconfidence can lead to serious real-world consequences -- users may trust incorrect outputs without questioning them.

This work makes three key contributions beyond prior approaches:

Links

Confidence Reasoning Abstention