EN KO
← All Publications

Ground-Truth Labels Matter: A Deeper Look into Input-Label Demonstrations

EMNLP 2022
Kang Min Yoo, Junyeob Kim, Hyuhng Joon Kim, Hyunsoo Cho, Hwiyeol Jo, Sang-Woo Lee, Sang-goo Lee, Taeuk Kim

One-Line Summary

Through two novel metrics -- Label-Correctness Sensitivity and Ground-truth Label Effect Ratio (GLER) -- this paper demonstrates that correct input-label mappings in demonstrations have a far greater impact on in-context learning than previously reported, with the effect modulated by prompt verbosity and model scale, overturning the influential claim by Min et al. (2022) that "ground-truth labels barely matter."

Paper overview
Figure 1. Overview of the experimental framework: The paper systematically varies label correctness in ICL demonstrations and measures downstream performance through the proposed Label-Correctness Sensitivity and GLER metrics across diverse configurations of models, tasks, and prompt templates.

Background & Motivation

In-context learning (ICL) enables large language models to perform tasks by conditioning on a few input-label demonstrations, without any parameter updates. A surprising and influential finding by Min et al. (2022, "Rethinking the Role of Demonstrations") claimed that the correctness of labels in demonstrations barely matters -- models performed nearly as well even with randomly assigned labels. This counterintuitive result raised fundamental questions about what ICL actually learns from demonstrations and quickly became one of the most widely cited findings in the ICL literature.

The Core Puzzle This Paper Addresses:

  • Counterintuitive prior finding: Min et al. (2022) reported that replacing ground-truth labels with random labels in ICL demonstrations had minimal effect on downstream performance, suggesting that demonstrations primarily serve as a formatting template rather than a source of input-label mappings.
  • Limited experimental scope: The prior conclusion was drawn from a narrow set of models (mainly GPT-3 Davinci 175B), a small number of tasks, and specific verbose prompt templates -- raising questions about generalizability across broader settings.
  • Lack of quantitative tools: No formal metrics existed to precisely measure and decompose the impact of label correctness on ICL performance, making rigorous comparison across different experimental settings difficult.
  • Practical stakes: If labels truly do not matter, it would fundamentally change how practitioners construct demonstration sets for real-world ICL applications -- potentially allowing the use of cheap, noisy labels without performance cost.
  • Decomposition gap: ICL performance gains come from multiple factors -- input distribution, label space, demonstration format, and input-label mappings -- but no prior work cleanly isolated and quantified the contribution of each component.

Intrigued by this counterintuitive observation, the authors conduct an extensive re-examination using the GPT-3 family (Ada 350M, Babbage 1.3B, Curie 6.7B, Davinci 175B) and diverse classification benchmarks (SST-2, SST-5, MR, CR, AGNews, TREC, DBPedia, RTE, CB, and others). They introduce new quantitative metrics and reveal that ground-truth labels do matter significantly -- and that the prior conclusion was an artifact of specific, limited experimental configurations.

Proposed Method: Quantifying the Impact of Ground-Truth Labels

The paper introduces two novel metrics and conducts a systematic multi-factor analysis to quantify how much ground-truth labels contribute to ICL performance. The key insight is to decompose the overall ICL gain into distinct components -- format, label space, input distribution, and ground-truth label mapping -- and measure the relative contribution of each.

1
Decomposing ICL Performance
The authors decompose the total performance gain from in-context demonstrations into four additive components: (a) format effect -- the gain from seeing the input-label structure; (b) label space effect -- the gain from exposure to the set of possible labels; (c) input distribution effect -- the gain from seeing task-relevant inputs; and (d) ground-truth label effect -- the additional gain from correct input-label mappings. This decomposition enables precise isolation of each factor's contribution.
2
Label-Correctness Sensitivity
A metric that measures how sensitive a model's ICL performance is to the correctness of demonstration labels. By systematically replacing ground-truth labels with random labels at varying proportions (0%, 25%, 50%, 75%, 100% random), this metric captures the degree to which a model relies on correct input-label mappings versus other demonstration components. A steep performance decline with increasing label noise indicates high sensitivity -- the model is genuinely learning from correct demonstrations.
3
Ground-truth Label Effect Ratio (GLER)
A complementary metric that quantifies the relative contribution of ground-truth labels to the total ICL performance gain. GLER is defined as the ratio of the ground-truth label effect to the total ICL gain (performance with correct labels minus zero-shot performance). A GLER of 0 means labels contribute nothing beyond format/distribution effects; a GLER of 1 means all ICL gains come from correct label mappings. This enables direct, normalized comparison across models, tasks, and prompt designs.
4
Multi-Factor Experimental Analysis
Using these metrics, the authors conduct experiments across multiple dimensions: (a) 4 model scales in the GPT-3 family (350M to 175B parameters), (b) 12+ classification tasks spanning binary, ternary, and multi-class settings (SST-2, SST-5, MR, CR, AGNews, TREC, DBPedia, RTE, CB, etc.), (c) 3 prompt template designs ranging from minimal (input-label pairs only) to verbose (with full task descriptions and instructions), and (d) varying numbers of demonstrations (4 to 32 shots). This systematic coverage reveals the conditions under which ground-truth labels become critical.
5
Identifying Key Controlling Factors
Through ablation studies and controlled experiments, the authors identify two key factors that modulate the importance of ground-truth labels: prompt template verbosity (verbose templates with task descriptions effectively encode label semantics in the template itself, reducing reliance on demonstration labels) and model scale (larger models such as Davinci 175B are significantly better at leveraging correct demonstrations, while smaller models like Ada 350M show weak sensitivity). These factors explain why prior work, which used verbose prompts with a single large model, reached a different conclusion.

Experimental Results

The authors evaluate across the GPT-3 family (Ada 350M, Babbage 1.3B, Curie 6.7B, Davinci 175B) and a comprehensive set of text classification benchmarks covering sentiment analysis, topic classification, question classification, and natural language inference. The two proposed metrics reveal clear patterns that were obscured in prior work.

Models and Benchmarks

ModelParametersTasks Evaluated
GPT-3 Ada350MSST-2, SST-5, MR, CR, AGNews, TREC, DBPedia, RTE, CB, and others (12+ tasks spanning binary to 14-class classification)
GPT-3 Babbage1.3B
GPT-3 Curie6.7B
GPT-3 Davinci175B

Key Findings: Label-Correctness Sensitivity

FactorLow Sensitivity (labels seem unimportant)High Sensitivity (labels clearly matter)
Prompt VerbosityVerbose templates with detailed task descriptionsMinimal templates with no task instructions
Model ScaleSmaller models (Ada 350M, Babbage 1.3B)Larger models (Curie 6.7B, Davinci 175B)
Task ComplexitySimple binary classification (e.g., SST-2, MR)Fine-grained multi-class classification (e.g., AGNews, TREC, DBPedia)
Number of DemonstrationsVery few demonstrations (4-shot)More demonstrations (16-32 shot)

Ground-truth Label Effect Ratio (GLER) Analysis

Experimental ConfigurationGLER TrendInterpretation
Minimal prompt + Davinci 175BHigh GLERGround-truth labels contribute substantially to ICL performance; the model actively learns from correct mappings
Verbose prompt + Ada 350MLow GLEROther factors (formatting, task description in template) dominate; labels add little beyond what the template provides
Multi-class tasks (AGNews, TREC)Higher GLER than binary tasksMore label options increase reliance on correct mappings since the label space cannot be inferred from format alone
Increasing demonstration countGLER tends to increaseMore examples amplify the learning signal from correct labels, compounding the benefit of accurate demonstrations
Min et al. (2022) setupLow GLERThe specific combination of verbose prompts used by Min et al. artificially suppressed the ground-truth label effect

Performance Decomposition: What Drives ICL Gains?

ComponentMinimal PromptVerbose Prompt
Format effectSmall contributionModerate contribution
Label space effectModerate contributionLarge contribution (template encodes label semantics)
Input distribution effectModerate contributionModerate contribution
Ground-truth label effectLarge contribution (dominant factor)Small contribution (masked by template)

Why It Matters

This work makes four important contributions to the understanding and practice of in-context learning:

Links

In-Context Learning