When to Speak, When to Abstain: Contrastive Decoding with Abstention

One-Line Summary

A training-free contrastive decoding method that dynamically blends parametric, contextual, and abstention distributions so that LLMs answer when they have relevant knowledge and gracefully abstain when they do not.

Background & Motivation

LLMs acquire broad parametric knowledge during pre-training, yet inevitably lack information about underrepresented or rapidly-evolving topics. Retrieval-Augmented Generation (RAG) supplements this with external contextual knowledge, but sometimes neither source contains the answer. Forcing the model to respond in such cases leads to confident-sounding hallucinations, which is especially dangerous in high-stakes domains.

The Missing Scenario: Prior work on contrastive decoding (e.g., Context-Aware Decoding, Adaptive Contrastive Decoding) only handles cases where at least one knowledge source is relevant. None of them address the critical fourth scenario—where both parametric and contextual knowledge are absent—and therefore these methods never abstain, achieving near-zero F1_abs scores.

The authors identify four distinct scenarios: (1) only parametric knowledge available, (2) only contextual knowledge available, (3) both available, and (4) neither available. They first construct a controlled testbed that explicitly labels each scenario, then propose Contrastive Decoding with Abstention (CDA)—a training-free method that robustly handles all four.

Proposed Method: Contrastive Decoding with Abstention (CDA)

**Figure 2.** Testbed construction pipeline: from MRQA datasets, parametric and contextual knowledge availability is assessed to create balanced evaluation sets across all four scenarios.

1

Three-Way Distribution Blending

Extends standard two-way contrastive decoding (parametric + contextual) to a three-way formulation: d_o = w_p·d_p + w_c·d_c + (1 − w_p − w_c)·d_a, where d_a is an explicit abstention distribution obtained by prompting the model to decline answering.

2

Calibrated Uncertainty Estimation

Raw entropy values are not directly comparable across different prompts. CDA computes "content-free" null distributions using placeholder inputs, then calibrates confidence as the relative entropy reduction: r_p = max(H_p − H̄_p, 0) / H̄_p. This ensures fair comparison between parametric and contextual knowledge signals.

3

Dynamic Weight Normalization

Calibrated confidence ratios are normalized so that w_p + w_c ≤ 1. When both confidences are low, the residual weight (1 − w_p − w_c) naturally flows to the abstention distribution, triggering the model to decline.

4

Momentum Stabilization (CDA-m)

Previously decoded tokens can unintentionally steer the model. CDA-m applies exponential moving average to the weights: w_t ← α·w_t−1 + (1−α)·w_t, smoothing abrupt fluctuations across decoding steps.

Experimental Results

Evaluated on three QA benchmarks (Natural Questions, HotpotQA, TriviaQA) with four LLMs (Llama3-8B, Llama2-7B/13B, Mistral-7B). Metrics: F1_ans (answerable accuracy), F1_abs (abstention accuracy), and Reliability Score (RS).

Dataset	Method	F1_ans	F1_abs	RS
NQ	FSB (best baseline)	69.27	54.94	59.64
NQ	CDA	72.06	55.49	62.95
NQ	CDA-m	73.15	55.47	63.72
HotpotQA	FSB (best baseline)	74.89	58.51	66.21
HotpotQA	CDA	78.71	62.50	70.20
HotpotQA	CDA-m	79.32	62.59	70.64
TriviaQA	FSB (best baseline)	77.02	59.84	68.55
TriviaQA	CDA	80.39	65.67	72.35
TriviaQA	CDA-m	80.93	65.66	72.74

Reliability scores in RAG settings — **Figure 7.** Reliability scores in a practical RAG setting using Contriever-msmarco retriever with Wikipedia. CDA-m achieves the highest RS (~68.7) across all baselines.

Baselines without abstention fail: Context-only, CAD, and ACD methods achieve near-zero F1_abs because they never decline to answer, regardless of knowledge availability.
Consistent gains across datasets: CDA-m outperforms the strongest baseline (FSB) by +3.88 F1_ans on NQ, +4.43 on HotpotQA, and +3.91 on TriviaQA.
Calibration is critical: Removing uncertainty calibration degrades RS by 15.06 points on NQ and 13.78 points on HotpotQA, confirming that raw entropy comparison is unreliable.
Beats supervised training: CDA-m (training-free) outperforms instruction-tuned models even in-domain (+2.60 RS on NQ) and especially out-of-domain (+10.60 RS on TriviaQA), demonstrating superior generalization.
Works in real RAG: With a practical Contriever retriever on Wikipedia, CDA-m achieves the best reliability score, confirming applicability beyond controlled testbeds.

Why It Matters

For trustworthy AI deployment, knowing when not to answer is just as important as answering correctly. CDA is the first training-free decoding approach that integrates abstention directly into the contrastive decoding framework, handling all four knowledge-access scenarios without any parameter updates. Its calibrated uncertainty estimation ensures robust performance across diverse models and datasets, while momentum stabilization prevents error propagation during autoregressive generation. The method is immediately applicable to any instruction-tuned LLM with RAG, making it a practical step toward more reliable question-answering systems.

Links

ACL Anthology arXiv Paper