EN KO
← All Publications

Prompt-Augmented Linear Probing: Scaling Beyond The Limit of Few-shot In-Context Learners

AAAI 2023
Hyunsoo Cho, Hyuhng Joon Kim, Junyeob Kim, Sang-Woo Lee, Sang-goo Lee, Kang Min Yoo, Taeuk Kim

One-Line Summary

PALP combines prompt-augmented representations with lightweight linear classifiers on frozen LLMs, closing the gap between few-shot in-context learning and full fine-tuning while scaling to arbitrary amounts of labeled data with minimal training overhead.

Paper overview
Figure 1. Overview of the Prompt-Augmented Linear Probing (PALP) framework. Instead of feeding demonstrations into the context window (ICL) or training on raw representations (linear probing), PALP first transforms inputs with task-specific prompts, then trains a linear classifier on the resulting enriched hidden states.

Background & Motivation

Large language models (LLMs) have shown remarkable in-context learning (ICL) ability -- performing tasks with just a handful of demonstrations placed in the input prompt. However, ICL faces a fundamental scaling bottleneck: the fixed context window limits the number of demonstrations, and performance saturates or even degrades as more examples are added. Meanwhile, linear probing -- training a simple linear classifier on frozen LLM representations -- can leverage unlimited labeled data but often underperforms ICL because it extracts representations from raw inputs that lack task-specific conditioning.

The Core Dilemma:

  • In-Context Learning: Strong few-shot performance thanks to prompt conditioning, but cannot scale beyond the context window (typically 2K--4K tokens at the time). Adding more demonstrations beyond this limit leads to performance degradation.
  • Linear Probing: Can use all available labeled data with no context-length constraint, but representations extracted from raw (unprompted) inputs are suboptimal for downstream tasks, leading to weaker performance.
  • Fine-Tuning: Achieves the best performance by updating all model parameters, but is computationally expensive and requires full white-box access to the model -- impractical for many black-box API scenarios.

This paper asks: can we get the best of both worlds? By augmenting the inputs with task-specific prompts before extracting representations, PALP enables linear probing to rival ICL and approach fine-tuning performance -- all while treating the LLM as a frozen, black-box feature extractor.

A key empirical observation motivates PALP: when examining the hidden-state geometry of GPT-style models, the authors find that representations extracted from prompted inputs occupy a distinct, task-aligned subspace compared to representations from raw inputs. This geometric separation explains why prompt-augmented features are far more linearly separable -- and why a simple linear classifier suffices to achieve strong performance on top of them.

Proposed Method: Prompt-Augmented Linear Probing (PALP)

PALP bridges the gap between prompting and probing through a simple yet effective three-component framework that uses language models as black-box feature extractors:

1
Prompt-Augmented Representation Extraction
Instead of feeding raw inputs to the LLM, PALP prepends task-specific prompts -- including instruction templates and verbalizers -- to each input before passing it through the frozen model. For example, for sentiment analysis, the input "This movie was great" is transformed into "Review: This movie was great. Sentiment:" before being fed to the LLM. The hidden state at the final token position is then extracted as the representation. This transforms generic hidden states into task-aware representations that encode the model's understanding of the target task. The key insight is that prompts guide the LLM to produce representations in a task-relevant subspace, making them far more discriminative for downstream classification.
2
Linear Classification on Augmented Features
A lightweight linear classifier (single-layer logistic regression) is trained on the prompt-augmented hidden states using all available labeled data. Specifically, the extracted hidden states h are mapped to class probabilities via a weight matrix W and bias b: p(y|x) = softmax(Wh + b). Since classification happens outside the context window, PALP is free from the input-length bottleneck of ICL. Training requires only the extraction of fixed-size hidden states and a simple linear optimization -- orders of magnitude cheaper than fine-tuning the entire LLM. The number of trainable parameters is merely (d x C + C), where d is the hidden dimension and C is the number of classes -- compared to billions of parameters in the base LLM.
3
Multi-Prompt Ensembling
Different prompt templates produce complementary representations that capture different aspects of task knowledge. PALP generates predictions from multiple diverse prompts (e.g., varying instruction wordings, different verbalizer choices) and ensembles them via majority voting or probability averaging to improve accuracy and reduce variance. This also mitigates sensitivity to individual prompt choices -- a well-known fragility of prompt-based methods. The ensemble strategy is particularly effective because each prompt template steers the LLM to encode slightly different task-relevant features, and combining them yields a more robust decision boundary.

Why does prompting help probing? The authors provide a geometric analysis showing that prompt augmentation causes the LLM's internal representations to shift into a task-specific subspace. In this subspace, examples from different classes become more linearly separable. Without prompting, representations of different classes overlap significantly in the high-dimensional hidden space, making linear classification difficult. With task-specific prompts, the LLM effectively "pre-processes" the input into a representation that already encodes task-relevant distinctions, dramatically simplifying the classification problem.

Notably, PALP operates entirely in a black-box setting: it requires only forward-pass access to the LLM's hidden states, making it compatible with API-based models where gradient-based fine-tuning is impossible. Unlike methods such as prompt tuning or adapter training, PALP does not backpropagate through the LLM at all.

Experimental Results

PALP is evaluated across 13 NLU benchmarks spanning diverse task types: sentiment analysis (SST-2, SST-5, MR, CR, Amazon), natural language inference (RTE, CB), topic classification (AGNews, DBPedia), subjectivity detection (Subj, MPQA), and question classification (TREC). Experiments use GPT-style autoregressive language models across multiple scales (GPT-2 Large 774M, GPT-J 6B, and others), compared against standard ICL, vanilla linear probing, and full fine-tuning baselines.

PALP vs. ICL Scaling Behavior

MethodFew-shot (k=4)Mid-range (k=32)Full DataScales with Data?
In-Context LearningCompetitiveSaturatesN/A (context limit)No
Vanilla Linear ProbingWeakModerateModerateYes, but plateaus
PALPCompetitiveStrongNear fine-tuningYes, consistently
Full Fine-TuningOverfitsStrongBestYes

Performance Comparison on Selected Benchmarks (GPT-J 6B, Full Data)

MethodSST-2AGNewsDBPediaRTEAvg.
Zero-shot ICL82.071.264.552.767.6
Few-shot ICL (k=4)91.580.378.857.477.0
Vanilla Linear Probing83.785.693.255.279.4
PALP (single prompt)92.889.496.163.585.5
PALP (ensemble)93.590.797.065.386.6
Full Fine-Tuning95.092.598.872.689.7

Key Findings

Why It Matters

PALP demonstrates that the power of prompting and the scalability of classical probing are not mutually exclusive. This insight has significant practical implications:

Links

In-Context Learning