EN KO
← All Publications

Enhanced Prompting for Multi-Label Classification

Korea Computer Congress 2024 (KCC 2024)
Jungyeon Lee, Youngwoo Shin, Yejin Yoon, Taeuk Kim

One-Line Summary

A systematic enhanced prompting framework that significantly improves LLMs' multi-label classification performance by jointly optimizing prompt structure, demonstration selection, and label verbalization strategies for in-context learning.

Background & Motivation

Multi-label classification — where a single input can be assigned multiple labels simultaneously — is a common requirement in real-world NLP tasks such as topic tagging, emotion detection, and intent recognition. While large language models (LLMs) have shown strong performance on single-label tasks via in-context learning (ICL), their effectiveness on multi-label scenarios remains limited. This gap is particularly problematic because multi-label classification is arguably more prevalent in practice: a news article may cover politics, economy, and technology simultaneously; a customer review may express satisfaction, frustration, and surprise at the same time.

Key Challenges in Multi-Label ICL:

  • Single-label bias: Standard prompting formats implicitly encourage models to predict a single label, making it difficult for LLMs to output multiple labels simultaneously. Most existing ICL research targets single-label classification, reinforcing this bias in prompt design conventions.
  • Combinatorial label space: With n possible labels, the number of valid label subsets grows exponentially (2n), making it hard for few-shot demonstrations to cover the space of possible label combinations. For a task with just 10 labels, there are 1,024 possible label subsets.
  • Demonstration inefficiency: Random selection of in-context examples often results in skewed label distributions that fail to represent the multi-label nature of the task. Rare labels and uncommon label co-occurrences are systematically underrepresented.
  • Ambiguous label semantics: Without clear label descriptions, models may confuse semantically similar labels (e.g., "anger" vs. "frustration") or miss co-occurring label patterns that are frequent in the data.

These challenges motivate the need for a principled prompting framework specifically designed for multi-label classification, one that addresses the structural, exemplar, and semantic dimensions of prompt design simultaneously. Unlike prior work that addresses these factors in isolation, this study investigates all three dimensions and their interactions within a unified framework.

Proposed Method

The proposed framework enhances in-context learning for multi-label classification through three complementary strategies that can be applied independently or in combination. Each strategy targets a distinct source of difficulty in multi-label ICL, and together they provide a comprehensive solution.

1
Prompt Restructuring
Redesigns the prompt format to explicitly convey the multi-label nature of the task. Instead of a standard single-answer format, the prompt instructs the model to output a set of labels and uses structured formatting (e.g., comma-separated lists or enumerated outputs) that naturally encourages multiple predictions. Task instructions are augmented with explicit cues such as "select all applicable labels" to overcome single-label bias. The restructured prompt also includes the complete label inventory upfront, making the full set of available labels visible to the model before it encounters any demonstrations.
2
Demonstration Selection
Proposes diversity-aware strategies for selecting in-context examples that maximize coverage of the label space. Rather than random sampling, demonstrations are chosen to ensure (a) a variety of label cardinalities (examples with different numbers of assigned labels), and (b) balanced representation across individual labels and label co-occurrence patterns. The selection algorithm prioritizes examples that introduce new labels or label combinations not yet covered by previously selected demonstrations, greedily building a maximally informative example set.
3
Label Verbalization
Enhances how label information is presented within the prompt by providing natural language descriptions or definitions alongside label names. This helps the model better distinguish between semantically similar labels and understand the scope of each category, leading to more accurate multi-label predictions. Three verbalization strategies are compared: (a) label name only, (b) name + natural language definition describing the label's scope, and (c) name + illustrative example sentence for each label. The definition-based approach generally performs best, as it provides the clearest semantic boundaries between related labels.

Strategy Interaction Overview

StrategyTarget ProblemMechanism
Prompt RestructuringSingle-label biasExplicit multi-label instructions, structured output format, full label inventory
Demonstration SelectionLabel space coverageDiversity-aware greedy selection maximizing label and cardinality variety
Label VerbalizationLabel semantic ambiguityNatural language definitions clarifying boundaries between similar labels

Experimental Results

The proposed enhanced prompting strategies were evaluated on multiple multi-label classification benchmarks, comparing against naïve prompting baselines across different LLM architectures. Evaluation metrics include micro-F1, macro-F1, and subset accuracy, capturing different aspects of multi-label prediction quality.

Impact of Individual Strategies

StrategyPrimary EffectSecondary Effect
Prompt RestructuringReduces single-label bias; increases average number of predicted labels to better match ground truthImproves recall by encouraging the model to output more labels per instance
Diversity-Aware Demonstration SelectionSignificantly outperforms random selection by improving label coverage and co-occurrence awarenessStabilizes predictions across runs by reducing sensitivity to specific example choices
Label VerbalizationImproves discrimination between similar labels; reduces label confusion errorsReduces false positives for semantically adjacent label pairs
Combined FrameworkAchieves the best results when all three strategies are applied togetherImprovements are largely additive across strategies

Detailed Analysis

Error Analysis by Category:

  • Missing labels (false negatives): The most common error type with naïve prompting. Prompt restructuring reduces this error by 30-40% on average by explicitly cueing multi-label output.
  • Label confusion (false positives): The second most common error, where the model predicts a semantically related but incorrect label. Label verbalization is most effective at reducing this error type.
  • Cardinality mismatch: Naïve prompts tend to under-predict label count. Diversity-aware demonstration selection calibrates the model's expectations about how many labels to assign.

Why It Matters

This work demonstrates that thoughtful prompt engineering can substantially close the performance gap between LLMs and supervised models on multi-label tasks, and provides actionable insights for practitioners:

In-Context Learning