One-Line Summary
A systematic enhanced prompting framework that significantly improves LLMs' multi-label classification performance by jointly optimizing prompt structure, demonstration selection, and label verbalization strategies for in-context learning.
Background & Motivation
Multi-label classification — where a single input can be assigned multiple labels simultaneously — is a common requirement in real-world NLP tasks such as topic tagging, emotion detection, and intent recognition. While large language models (LLMs) have shown strong performance on single-label tasks via in-context learning (ICL), their effectiveness on multi-label scenarios remains limited. This gap is particularly problematic because multi-label classification is arguably more prevalent in practice: a news article may cover politics, economy, and technology simultaneously; a customer review may express satisfaction, frustration, and surprise at the same time.
Key Challenges in Multi-Label ICL:
- Single-label bias: Standard prompting formats implicitly encourage models to predict a single label, making it difficult for LLMs to output multiple labels simultaneously. Most existing ICL research targets single-label classification, reinforcing this bias in prompt design conventions.
- Combinatorial label space: With n possible labels, the number of valid label subsets grows exponentially (2n), making it hard for few-shot demonstrations to cover the space of possible label combinations. For a task with just 10 labels, there are 1,024 possible label subsets.
- Demonstration inefficiency: Random selection of in-context examples often results in skewed label distributions that fail to represent the multi-label nature of the task. Rare labels and uncommon label co-occurrences are systematically underrepresented.
- Ambiguous label semantics: Without clear label descriptions, models may confuse semantically similar labels (e.g., "anger" vs. "frustration") or miss co-occurring label patterns that are frequent in the data.
These challenges motivate the need for a principled prompting framework specifically designed for multi-label classification, one that addresses the structural, exemplar, and semantic dimensions of prompt design simultaneously. Unlike prior work that addresses these factors in isolation, this study investigates all three dimensions and their interactions within a unified framework.
Proposed Method
The proposed framework enhances in-context learning for multi-label classification through three complementary strategies that can be applied independently or in combination. Each strategy targets a distinct source of difficulty in multi-label ICL, and together they provide a comprehensive solution.
1
Prompt Restructuring
Redesigns the prompt format to explicitly convey the multi-label nature of the task. Instead of a standard single-answer format, the prompt instructs the model to output a set of labels and uses structured formatting (e.g., comma-separated lists or enumerated outputs) that naturally encourages multiple predictions. Task instructions are augmented with explicit cues such as "select all applicable labels" to overcome single-label bias. The restructured prompt also includes the complete label inventory upfront, making the full set of available labels visible to the model before it encounters any demonstrations.
2
Demonstration Selection
Proposes diversity-aware strategies for selecting in-context examples that maximize coverage of the label space. Rather than random sampling, demonstrations are chosen to ensure (a) a variety of label cardinalities (examples with different numbers of assigned labels), and (b) balanced representation across individual labels and label co-occurrence patterns. The selection algorithm prioritizes examples that introduce new labels or label combinations not yet covered by previously selected demonstrations, greedily building a maximally informative example set.
3
Label Verbalization
Enhances how label information is presented within the prompt by providing natural language descriptions or definitions alongside label names. This helps the model better distinguish between semantically similar labels and understand the scope of each category, leading to more accurate multi-label predictions. Three verbalization strategies are compared: (a) label name only, (b) name + natural language definition describing the label's scope, and (c) name + illustrative example sentence for each label. The definition-based approach generally performs best, as it provides the clearest semantic boundaries between related labels.
Strategy Interaction Overview
| Strategy | Target Problem | Mechanism |
| Prompt Restructuring | Single-label bias | Explicit multi-label instructions, structured output format, full label inventory |
| Demonstration Selection | Label space coverage | Diversity-aware greedy selection maximizing label and cardinality variety |
| Label Verbalization | Label semantic ambiguity | Natural language definitions clarifying boundaries between similar labels |
Experimental Results
The proposed enhanced prompting strategies were evaluated on multiple multi-label classification benchmarks, comparing against naïve prompting baselines across different LLM architectures. Evaluation metrics include micro-F1, macro-F1, and subset accuracy, capturing different aspects of multi-label prediction quality.
Impact of Individual Strategies
| Strategy | Primary Effect | Secondary Effect |
| Prompt Restructuring | Reduces single-label bias; increases average number of predicted labels to better match ground truth | Improves recall by encouraging the model to output more labels per instance |
| Diversity-Aware Demonstration Selection | Significantly outperforms random selection by improving label coverage and co-occurrence awareness | Stabilizes predictions across runs by reducing sensitivity to specific example choices |
| Label Verbalization | Improves discrimination between similar labels; reduces label confusion errors | Reduces false positives for semantically adjacent label pairs |
| Combined Framework | Achieves the best results when all three strategies are applied together | Improvements are largely additive across strategies |
Detailed Analysis
Error Analysis by Category:
- Missing labels (false negatives): The most common error type with naïve prompting. Prompt restructuring reduces this error by 30-40% on average by explicitly cueing multi-label output.
- Label confusion (false positives): The second most common error, where the model predicts a semantically related but incorrect label. Label verbalization is most effective at reducing this error type.
- Cardinality mismatch: Naïve prompts tend to under-predict label count. Diversity-aware demonstration selection calibrates the model's expectations about how many labels to assign.
- Consistent improvements: The enhanced prompting strategies consistently outperform naïve prompting baselines across all tested multi-label classification benchmarks, with the combined framework showing the largest gains.
- Demonstration diversity is critical: Carefully selected demonstrations with diverse label combinations yield significantly better predictions than randomly chosen examples, confirming that label coverage in few-shot examples is a key factor for multi-label ICL.
- No fine-tuning required: The approach is effective across different LLM architectures without any parameter updates, making it immediately applicable to off-the-shelf models in production settings.
- Complementary strategies: Each of the three strategies addresses a different aspect of multi-label ICL difficulty, and their combination produces additive improvements, suggesting minimal redundancy between the approaches.
- Label verbalization helps ambiguous cases: Performance gains from verbalization are especially pronounced on datasets with large, semantically overlapping label sets, where disambiguation is most needed.
- Recall vs. precision trade-off: Prompt restructuring primarily boosts recall (more labels predicted), while label verbalization primarily boosts precision (fewer incorrect labels). The combined framework achieves favorable balance on both axes.
- Scalability with label count: The benefit of the enhanced framework grows as the number of candidate labels increases, making it especially valuable for real-world tasks with large label taxonomies.
Why It Matters
This work demonstrates that thoughtful prompt engineering can substantially close the performance gap between LLMs and supervised models on multi-label tasks, and provides actionable insights for practitioners:
- Practical prompting framework: The three-component framework (restructuring, demonstration selection, verbalization) offers a systematic approach that practitioners can adopt for any multi-label classification task using LLMs, without requiring specialized training data or model modifications.
- Broadens ICL applicability: By extending in-context learning from single-label to multi-label settings, this work expands the range of real-world classification problems that can be tackled without fine-tuning, including document tagging, multi-aspect sentiment analysis, and clinical coding.
- Design principles for multi-label prompts: The findings highlight concrete principles — such as the importance of label diversity in demonstrations and explicit multi-label cues in instructions — that generalize beyond specific datasets or models.
- Cost-effective deployment: Since the framework operates purely at the prompt level with no parameter updates, it can be applied to any API-accessible LLM, reducing the barrier to deploying multi-label classification systems in resource-constrained settings.
In-Context Learning