Aligning Language Models to Explicitly Handle Ambiguity
EMNLP 2024
Hyuhng Joon Kim, Youna Kim, Cheonbok Park, Junyeob Kim, Choonghyun Park, Kang Min Yoo, Sang-goo Lee, Taeuk Kim
One-Line Summary
Alignment with Perceived Ambiguity (APA) teaches language models to explicitly detect and handle ambiguous queries by leveraging each model's own assessment of ambiguity, outperforming gold-standard label training especially in out-of-distribution scenarios.
Figure 1. An example of an ambiguous query from AmbigQA. The phrase "national championship" poses diverse denotations, causing ambiguity. A model with various related knowledge may perceive it as ambiguous (left), while a model without sufficient knowledge may not (right).Figure 2. The overall process of the four-stage alignment pipeline. Samples that the model cannot explicitly handle are filtered (Stage 1), self-disambiguated to measure information gain (Stage 2), and those with high information gain are used for supervised fine-tuning (Stages 3 & 4).
Background & Motivation
In interactions between users and language model agents, user utterances frequently exhibit ellipsis (omission of words or phrases) or imprecision (lack of exactness) to prioritize efficiency. For instance, a question like "Who won the national championship?" can refer to many different championships across different sports and years, leading to varying interpretations based on different assumptions or background knowledge. Despite this, large language models (LLMs) typically pick a single interpretation and answer confidently, ignoring the inherent ambiguity.
Limitation 1 — No Explicit Training: Existing LLMs are not explicitly trained to deal with ambiguous utterances. They tend to produce a single answer even when a question is genuinely ambiguous, failing to surface alternative valid interpretations that the user may have intended.
Limitation 2 — Model-Dependent Ambiguity: The degree of perceived ambiguity is model-dependent — a model with broader knowledge recognizes more possible interpretations than one with limited knowledge. Using fixed gold-standard ambiguity labels for all models ignores this fundamental variation, leading to suboptimal alignment.
Key Insight: Rather than relying on externally annotated ambiguity labels, alignment should be tailored to each model's own knowledge boundary. A model should learn to flag ambiguity precisely at the frontier of its own knowledge, not at some universal threshold.
Proposed Method: Alignment with Perceived Ambiguity (APA)
APA is a four-stage alignment pipeline that teaches an LLM to detect ambiguity and respond with disambiguating clarifications, using the model's own perception of ambiguity rather than external gold labels:
1
Explicit Prediction & Filtering
The model processes all samples; correctly handled predictions form Dcorrect, while incorrectly handled ones (where the model gives a single answer despite genuine ambiguity) form Dincorrect. This identifies the gap between what the model should clarify and what it currently does not.
2
Self-Disambiguation & Information Gain
For each sample in Dincorrect, the model generates disambiguations. Information gain is computed as the entropy reduction: InfoGain = H(x) − H(xdisambig), measuring token-level uncertainty change. Samples exceeding a threshold (ε = 0.1) are classified as truly ambiguous from the model's perspective.
3
Data Construction & Supervised Fine-Tuning (SFT)
Ambiguous samples receive clarification labels (e.g., "The question is ambiguous because...") and are balanced with Dcorrect to form the training dataset D. The model is then fine-tuned using standard next-token prediction on this balanced dataset.
4
Preference Optimization
Further alignment through preference-based training (DPO) to reinforce explicit ambiguity handling over single-answer responses, ensuring the model consistently surfaces multiple interpretations when appropriate.
Experimental Results
APA is evaluated on four question-answering datasets using Llama-2 (7B, 13B) and Mistral (7B) as base models. Models are trained on AmbigQA and tested both in-distribution and out-of-distribution (SituatedQA, AmbigTriviaQA). Baselines include inference-only methods (naive prompting, ambiguity-aware instructions, sample repetition, self-ask) and trained methods (honesty-tuned, full-set, random subset).
Dataset
Metric
Llama-2 7B
Mistral 7B
Llama-2 13B
AmbigQA
Unambig. Acc.
27.23%
37.23%
37.83%
AmbigQA
Ambig. F1
63.69
50.31
58.15
SituatedQA (Geo)
Unambig. Acc.
24.51%
32.21%
24.51%
SituatedQA (Geo)
Ambig. F1
42.05
42.18
41.59
SituatedQA (Temp)
Unambig. Acc.
21.90%
35.74%
24.36%
SituatedQA (Temp)
Ambig. F1
40.77
40.17
41.09
AmbigTriviaQA
Unambig. Acc.
53.41%
58.14%
63.74%
AmbigTriviaQA
Ambig. F1
61.34
58.93
55.23
Outperforms Gold Labels on OOD: APA surpasses the full-set baseline (trained on all gold-standard labels) by up to 17 F1 points on out-of-distribution datasets, despite using only ~32% of data (Llama-2) or ~13% (Mistral)
Information Gain > Random Selection: Ablation studies show that information gain-based sample selection outperforms random subset selection by ~10 F1 points on OOD datasets, validating the model-aware curation strategy
Balanced Alignment: APA achieves the best Overall Alignment Performance (23.87% on AmbigTriviaQA), combining a high Valid Alignment Rate with a low Misaligned Clarification Rate — meaning the model avoids false ambiguity signals on clear questions
Maintained Clarity: On unambiguous questions, aligned models continue to provide direct, single answers without unnecessary hedging, preserving normal QA performance
Consistent Across Scales: The approach works effectively across different model families (Llama-2, Mistral) and scales (7B, 13B), demonstrating broad applicability
Why It Matters
Real-world user queries are often underspecified or ambiguous, yet most LLMs are trained to give a single decisive answer. This work provides a principled framework for teaching models when to ask for clarification and how to present multiple valid interpretations. Three aspects make APA particularly significant:
Model-Aware Alignment: By grounding the alignment process in each model's own knowledge, APA avoids both over-hedging (flagging clear questions as ambiguous) and under-hedging (ignoring genuine ambiguity), making LLM interactions more trustworthy
Data Efficiency: APA achieves superior OOD performance using only a fraction of available training data, demonstrating that intelligent sample selection outperforms brute-force training on all data
Practical Reliability: As LLMs are increasingly deployed in customer-facing applications (search, assistants, chatbots), the ability to explicitly handle ambiguity rather than silently guessing is critical for user trust and safety