EN KO
← All Publications

Adaptive Contrastive Decoding in Retrieval-Augmented Generation for Handling Noisy Contexts

EMNLP 2024 Findings
Youna Kim, Hyuhng Joon Kim, Cheonbok Park, Choonghyun Park, Hyunsoo Cho, Junyeob Kim, Kang Min Yoo, Sang-goo Lee, Taeuk Kim

One-Line Summary

An entropy-based adaptive contrastive decoding method that dynamically adjusts how much an LLM relies on retrieved context during RAG, achieving robust open-domain QA performance even when retrieval returns noisy or irrelevant passages.

Paper overview
Figure 1. Overview of Adaptive Contrastive Decoding (ACD). ACD adapts the contrastive weight based on context relevance, enabling robust RAG generation even with noisy retrieved contexts.

Background & Motivation

When using LLMs for knowledge-intensive tasks such as open-domain question answering, Retrieval-Augmented Generation (RAG) bridges the gap between external knowledge and the model's parametric knowledge by feeding retrieved documents as context. However, real-world retrieval is far from perfect -- passages are often noisy, irrelevant, or even contradictory to the correct answer. When a model blindly conditions on such low-quality contexts, generation quality degrades significantly.

Recent contrastive decoding approaches (e.g., Context-Aware Decoding, CAD) attempt to amplify contextual knowledge over parametric knowledge by contrasting output distributions with and without context. While effective when relevant context is provided, these methods use a fixed contrastive weight, making them vulnerable to noisy retrieval scenarios. Another line of work, Multi-Input Contrastive Decoding (MICD), introduces a dynamic weight based on maximum token probability, but its relevance estimation remains coarse-grained and unreliable.

Limitation of Fixed Contrastive Decoding: Methods like CAD apply a constant weight (α = 0.5) to suppress or amplify context influence. This is fundamentally suboptimal: a fixed weight over-corrects when the retrieved context is actually helpful (degrading performance on gold contexts -- e.g., CAD drops TriviaQA from 60.23 to 49.02 on Llama2-7B) and under-corrects when the context is noisy or misleading (failing to suppress harmful influence). Similarly, MICD-F uses a fixed α = 1.0, which aggressively amplifies context regardless of quality.

Key Insight: The optimal decoding strategy should adapt dynamically based on how much the retrieved context actually helps the model. By measuring the entropy change between context-free and context-augmented output distributions, the model itself can signal whether context is reducing uncertainty (helpful) or not (noisy) -- and the contrastive weight should respond accordingly. Unlike MICD-D's max-probability heuristic, entropy captures the full shape of the output distribution, providing a more principled and reliable relevance signal.

Proposed Method: Adaptive Contrastive Decoding (ACD)

ACD extends standard contrastive decoding with an entropy-based adaptive weight mechanism. The core decoding formula is:

Decoding Formula: P(Y_t | x, y_<t) = softmax( z_t + αACD · (z_tc - z_t) )

where z_t are the logits without context, z_tc are the logits with context, and αACD is the adaptive weight.

Adaptive Weight: αACD = H(Y_t) / ( H(Y_t) + H(Y_tc) )

where H(Y_t) is the entropy of the context-free distribution and H(Y_tc) is the entropy of the context-augmented distribution.

The intuition is simple: when retrieved context reduces model uncertainty (H(Y_tc) < H(Y_t)), αACD approaches 1, amplifying context influence. When context adds confusion (H(Y_tc) > H(Y_t)), αACD decreases toward 0, suppressing context influence and relying more on parametric knowledge. When both entropies are equal, αACD = 0.5, providing a neutral balance.

1
Dual Forward Pass
At each generation step, compute two sets of logits: z_t (from the model without retrieved context, i.e., closed-book) and z_tc (from the model conditioned on retrieved context, i.e., open-book). This provides the two output distributions needed for both contrastive decoding and relevance estimation. The closed-book prompt follows the format "Answer the following questions: [few-shots] Question: [q] Answer:" while the open-book version prepends "Context: [retrieved passage]".
2
Entropy-Based Relevance Estimation
Compute the entropy H(Y_t) and H(Y_tc) of both distributions. The ratio αACD = H(Y_t) / (H(Y_t) + H(Y_tc)) automatically captures context quality: high α when context helps (reduces uncertainty), low α when context hurts (increases uncertainty). Unlike MICD-D's max-probability heuristic, entropy considers the entire probability distribution, yielding a far more reliable relevance signal (AUROC 73-80% vs. 54-69%).
3
Adaptive Contrastive Token Generation
Apply the adaptive weight to the contrastive decoding formula: the final logits are z_t + αACD · (z_tc - z_t). This smoothly interpolates between fully relying on parametric knowledge (α=0) and fully leveraging context (α=1), with the balance determined per-token by the entropy signal. The computational overhead is exactly 2x standard greedy decoding (same as CAD), since only two forward passes are needed per step.

Experimental Setup

ACD is evaluated on three open-domain QA benchmarks using Wikipedia (December 2018 dump) as the retrieval corpus:

Retrieval uses Contriever-msmarco with top-1 passage selection. Four LLMs are tested: Llama2-7B, Llama2-13B, Llama3-8B, and Mistral-7B. All experiments use 5-shot prompting and Exact Match (EM) as the evaluation metric. To systematically analyze robustness, each dataset is split into:

Baselines include: RegCls (closed-book, no context), RegOpn / Standard RAG (open-book, standard greedy decoding), CAD (fixed α=0.5), MICD-F (fixed α=1.0), and MICD-D (dynamic α based on max token probability).

Experimental Results

Main Results with Gold/Noisy Breakdown (Llama2-7B, EM Accuracy)

MethodTriviaQANQPopQA
AllGoldNoisyAllGoldNoisyAllGoldNoisy
Standard RAG60.2387.4033.5031.3961.3112.4038.4981.217.77
CAD (α=0.5)49.0273.6924.7525.5751.619.0533.7072.186.03
MICD-F (α=1.0)60.3685.7235.3929.4556.1012.5435.7374.258.03
MICD-D63.2386.0340.7930.3652.1816.5239.0177.3911.42
ACD (Ours)64.8588.0142.0632.9156.6017.8841.2982.7711.46

ACD achieves the best overall performance across all three datasets. Critically, it simultaneously excels on both SubsetGold (preserving retrieval benefits) and SubsetNoisy (suppressing noise). CAD, with its fixed weight, catastrophically degrades on both subsets -- dropping TriviaQA from 87.40 to 73.69 on gold contexts, showing that indiscriminate contrastive amplification actively hurts when contexts are helpful.

Cross-Model Generalization (All Data, EM Accuracy)

ModelMethodTriviaQANQPopQA
Llama2-13BMICD-D66.5234.3841.65
ACD67.3736.1243.35
Llama3-8BMICD-D64.0130.7241.35
ACD66.3235.4843.25
Mistral-7BMICD-D66.9733.2439.87
ACD67.8235.3741.47

ACD consistently outperforms the strongest baseline (MICD-D) across all four model architectures, with particularly large gains on NQ (+4.76 on Llama3-8B) where retrieval noise is most prevalent.

Context Quality Discrimination (AUROC, Llama2-7B)

To verify that the adaptive weight truly reflects context quality, the paper measures how well α values separate gold from noisy contexts using AUROC. Three aggregation strategies are compared: max, average, and first-token α.

AggregationMethodNQTriviaQAPopQA
MaxMICD-D51.5359.7665.49
ACD65.7873.3774.84
AverageMICD-D54.1863.7872.64
ACD68.8072.3278.90
First TokenMICD-D53.9262.9568.81
ACD73.2780.4580.08

ACD's entropy-based α achieves 65-80% AUROC across all settings, far surpassing MICD-D's max-probability heuristic (51-73%). The first-token α is especially discriminative, suggesting that the model's initial reaction to context is already highly informative about passage quality.

Comparison with Oracle Upper Bound (Llama2-7B)

The oracle sets α=1.0 for gold contexts and α=0.0 for noisy contexts -- a perfect relevance estimator using ground-truth labels:

DatasetACDOracleGap
NQ32.9135.35+2.44
TriviaQA64.8565.31+0.46
PopQA41.2944.10+2.81

ACD reaches 97-99% of oracle performance without any ground-truth labels, demonstrating that entropy is an excellent proxy for true context relevance.

Analysis

Ablation: Adaptive vs. Fixed α

The paper ablates performance across fixed α values from 0.0 to 1.0 to confirm that dynamic weighting is strictly superior. At α=0.0 (parametric only), overall EM is low. As α increases, gold-context performance initially improves but noisy-context performance worsens. No single fixed α can optimize both subsets simultaneously. ACD's adaptive α consistently outperforms every fixed α by 1-3 EM points overall, validating that per-token dynamic adjustment is essential.

Knowledge Conflict Scenario (NQ-swap)

A particularly revealing analysis uses the NQ-swap dataset (3,650 samples), where gold answer spans in retrieved passages are replaced with random entities of the same type. This creates a scenario where the context is structurally relevant but factually wrong -- directly conflicting with the model's parametric knowledge.

Key Finding: ACD achieves 60-75% EM on NQ-swap across models, substantially outperforming Standard RAG (41-50%) which blindly follows the manipulated context. Critically, ACD also outperforms MICD-D, which tends to over-reject contexts and fails to leverage the structural cues. ACD's entropy-based mechanism correctly identifies that the swapped context increases model uncertainty (since the injected entity conflicts with parametric knowledge), leading to lower α and appropriate reliance on internal knowledge.

Case Study: Entropy in Action

Two illustrative examples demonstrate how ACD's entropy-based weight works in practice:

Known-Noisy Example: "Who does the voice of Nala in the Lion King?"

The model already knows the answer (Moira Kelly) with low entropy: H(Y_t) = 2.92. When given a noisy passage suggesting Whoopi Goldberg, context-augmented entropy spikes: H(Y_tc) = 5.46. The resulting αACD = 0.35 (low), correctly suppressing the misleading context. ACD answers "Moira Kelly" -- correct.

Unknown-Gold Example: "Who played Ben Stone on Law and Order?"

The model is uncertain (guesses Michael Tucker) with high entropy: H(Y_t) = 6.67. Given a gold passage mentioning Michael Moriarty, uncertainty drops sharply: H(Y_tc) = 1.56. The resulting αACD = 0.81 (high), correctly amplifying the helpful context. ACD answers "Michael Moriarty" -- correct.

These cases illustrate the core mechanism: αACD naturally tracks whether context resolves or creates uncertainty, without any explicit relevance classifier.

Limitations

Experimental Results (Summary)

Why It Matters

RAG systems are increasingly deployed in production, but retrieval quality is inherently unpredictable. A system that works well with perfect retrieval can fail catastrophically with noisy results. ACD addresses this fundamental brittleness through three key contributions:

Links

RAG