Ground-Truth Labels Matter: A Deeper Look into Input-Label Demonstrations
EMNLP 2022
Kang Min Yoo, Junyeob Kim, Hyuhng Joon Kim, Hyunsoo Cho, Hwiyeol Jo, Sang-Woo Lee, Sang-goo Lee, Taeuk Kim
One-Line Summary
Through two novel metrics -- Label-Correctness Sensitivity and Ground-truth Label Effect Ratio (GLER) -- this paper demonstrates that correct input-label mappings in demonstrations have a far greater impact on in-context learning than previously reported, with the effect modulated by prompt verbosity and model scale, overturning the influential claim by Min et al. (2022) that "ground-truth labels barely matter."
Figure 1. Overview of the experimental framework: The paper systematically varies label correctness in ICL demonstrations and measures downstream performance through the proposed Label-Correctness Sensitivity and GLER metrics across diverse configurations of models, tasks, and prompt templates.
Background & Motivation
In-context learning (ICL) enables large language models to perform tasks by conditioning on a few input-label demonstrations, without any parameter updates. A surprising and influential finding by Min et al. (2022, "Rethinking the Role of Demonstrations") claimed that the correctness of labels in demonstrations barely matters -- models performed nearly as well even with randomly assigned labels. This counterintuitive result raised fundamental questions about what ICL actually learns from demonstrations and quickly became one of the most widely cited findings in the ICL literature.
The Core Puzzle This Paper Addresses:
Counterintuitive prior finding: Min et al. (2022) reported that replacing ground-truth labels with random labels in ICL demonstrations had minimal effect on downstream performance, suggesting that demonstrations primarily serve as a formatting template rather than a source of input-label mappings.
Limited experimental scope: The prior conclusion was drawn from a narrow set of models (mainly GPT-3 Davinci 175B), a small number of tasks, and specific verbose prompt templates -- raising questions about generalizability across broader settings.
Lack of quantitative tools: No formal metrics existed to precisely measure and decompose the impact of label correctness on ICL performance, making rigorous comparison across different experimental settings difficult.
Practical stakes: If labels truly do not matter, it would fundamentally change how practitioners construct demonstration sets for real-world ICL applications -- potentially allowing the use of cheap, noisy labels without performance cost.
Decomposition gap: ICL performance gains come from multiple factors -- input distribution, label space, demonstration format, and input-label mappings -- but no prior work cleanly isolated and quantified the contribution of each component.
Intrigued by this counterintuitive observation, the authors conduct an extensive re-examination using the GPT-3 family (Ada 350M, Babbage 1.3B, Curie 6.7B, Davinci 175B) and diverse classification benchmarks (SST-2, SST-5, MR, CR, AGNews, TREC, DBPedia, RTE, CB, and others). They introduce new quantitative metrics and reveal that ground-truth labels do matter significantly -- and that the prior conclusion was an artifact of specific, limited experimental configurations.
Proposed Method: Quantifying the Impact of Ground-Truth Labels
The paper introduces two novel metrics and conducts a systematic multi-factor analysis to quantify how much ground-truth labels contribute to ICL performance. The key insight is to decompose the overall ICL gain into distinct components -- format, label space, input distribution, and ground-truth label mapping -- and measure the relative contribution of each.
1
Decomposing ICL Performance
The authors decompose the total performance gain from in-context demonstrations into four additive components: (a) format effect -- the gain from seeing the input-label structure; (b) label space effect -- the gain from exposure to the set of possible labels; (c) input distribution effect -- the gain from seeing task-relevant inputs; and (d) ground-truth label effect -- the additional gain from correct input-label mappings. This decomposition enables precise isolation of each factor's contribution.
2
Label-Correctness Sensitivity
A metric that measures how sensitive a model's ICL performance is to the correctness of demonstration labels. By systematically replacing ground-truth labels with random labels at varying proportions (0%, 25%, 50%, 75%, 100% random), this metric captures the degree to which a model relies on correct input-label mappings versus other demonstration components. A steep performance decline with increasing label noise indicates high sensitivity -- the model is genuinely learning from correct demonstrations.
3
Ground-truth Label Effect Ratio (GLER)
A complementary metric that quantifies the relative contribution of ground-truth labels to the total ICL performance gain. GLER is defined as the ratio of the ground-truth label effect to the total ICL gain (performance with correct labels minus zero-shot performance). A GLER of 0 means labels contribute nothing beyond format/distribution effects; a GLER of 1 means all ICL gains come from correct label mappings. This enables direct, normalized comparison across models, tasks, and prompt designs.
4
Multi-Factor Experimental Analysis
Using these metrics, the authors conduct experiments across multiple dimensions: (a) 4 model scales in the GPT-3 family (350M to 175B parameters), (b) 12+ classification tasks spanning binary, ternary, and multi-class settings (SST-2, SST-5, MR, CR, AGNews, TREC, DBPedia, RTE, CB, etc.), (c) 3 prompt template designs ranging from minimal (input-label pairs only) to verbose (with full task descriptions and instructions), and (d) varying numbers of demonstrations (4 to 32 shots). This systematic coverage reveals the conditions under which ground-truth labels become critical.
5
Identifying Key Controlling Factors
Through ablation studies and controlled experiments, the authors identify two key factors that modulate the importance of ground-truth labels: prompt template verbosity (verbose templates with task descriptions effectively encode label semantics in the template itself, reducing reliance on demonstration labels) and model scale (larger models such as Davinci 175B are significantly better at leveraging correct demonstrations, while smaller models like Ada 350M show weak sensitivity). These factors explain why prior work, which used verbose prompts with a single large model, reached a different conclusion.
Experimental Results
The authors evaluate across the GPT-3 family (Ada 350M, Babbage 1.3B, Curie 6.7B, Davinci 175B) and a comprehensive set of text classification benchmarks covering sentiment analysis, topic classification, question classification, and natural language inference. The two proposed metrics reveal clear patterns that were obscured in prior work.
Models and Benchmarks
Model
Parameters
Tasks Evaluated
GPT-3 Ada
350M
SST-2, SST-5, MR, CR, AGNews, TREC, DBPedia, RTE, CB, and others (12+ tasks spanning binary to 14-class classification)
Ground-truth labels contribute substantially to ICL performance; the model actively learns from correct mappings
Verbose prompt + Ada 350M
Low GLER
Other factors (formatting, task description in template) dominate; labels add little beyond what the template provides
Multi-class tasks (AGNews, TREC)
Higher GLER than binary tasks
More label options increase reliance on correct mappings since the label space cannot be inferred from format alone
Increasing demonstration count
GLER tends to increase
More examples amplify the learning signal from correct labels, compounding the benefit of accurate demonstrations
Min et al. (2022) setup
Low GLER
The specific combination of verbose prompts used by Min et al. artificially suppressed the ground-truth label effect
Performance Decomposition: What Drives ICL Gains?
Component
Minimal Prompt
Verbose Prompt
Format effect
Small contribution
Moderate contribution
Label space effect
Moderate contribution
Large contribution (template encodes label semantics)
Input distribution effect
Moderate contribution
Moderate contribution
Ground-truth label effect
Large contribution (dominant factor)
Small contribution (masked by template)
Ground-truth labels do matter -- significantly: Across a broad range of configurations, correct input-label mappings yield substantial performance gains over random labels. On multi-class tasks with minimal prompts, the performance gap between correct and random labels can exceed 10-20 absolute points.
Prompt verbosity is a key confound: Verbose prompt templates that include task descriptions effectively encode label information in the template itself, reducing the model's reliance on demonstration labels. This explains why Min et al. (2022), who used verbose templates, found labels to be unimportant -- the template was already "telling" the model what each label means.
Larger models leverage labels more effectively: As model scale increases from Ada (350M) to Davinci (175B), models become significantly more sensitive to label correctness, extracting stronger learning signals from correct input-label mappings. Smaller models lack the capacity to fully utilize the information in correct demonstrations.
Task granularity amplifies the effect: For fine-grained classification tasks with many classes (e.g., TREC with 6 classes, AGNews with 4 classes, DBPedia with 14 classes), the gap between correct and random labels widens considerably, as the label space becomes harder to infer from format alone.
Performance degrades proportionally with noise: Systematically increasing the fraction of incorrect labels (0% to 100% random) leads to roughly proportional performance degradation, confirming that each correct demonstration provides a meaningful and additive learning signal.
Prior conclusions were configuration-specific: The finding that "labels don't matter" holds only under a specific combination of verbose prompts and smaller models -- it does not generalize to the broader ICL landscape. When the experimental scope is broadened, the importance of ground-truth labels becomes unmistakable.
GLER reveals hidden structure: The decomposition via GLER shows that in the settings studied by Min et al. (2022), the ground-truth label effect was genuinely small -- but not because labels are inherently unimportant; rather, because the verbose prompt template was already providing equivalent information through its task description.
Why It Matters
This work makes four important contributions to the understanding and practice of in-context learning:
Corrects an influential misconception: The claim that ground-truth labels are unimportant for ICL had become widely accepted and cited across the NLP community. This paper provides rigorous evidence that correct demonstrations do teach models meaningful input-label mappings, restoring a key intuition about how ICL works and prompting the community to re-evaluate its assumptions.
Introduces principled evaluation metrics: Label-Correctness Sensitivity and GLER provide the first formal, quantitative tools for measuring and decomposing the impact of demonstration quality on ICL. These metrics enable standardized comparison across future studies and have been adopted by subsequent work on ICL analysis.
Offers actionable guidance for practitioners: The identification of prompt verbosity and model scale as key controlling factors gives practitioners clear design guidelines: when using larger models with minimal prompts, prioritizing label accuracy in demonstrations is especially critical for strong ICL performance. Conversely, when label noise is unavoidable, using more verbose templates can help compensate.
Advances mechanistic understanding of ICL: The decomposition of ICL performance into format, label space, input distribution, and ground-truth label components provides a principled framework for understanding what models learn from demonstrations, contributing to the broader effort to demystify the mechanisms underlying in-context learning in large language models.