Ambiguity - HYU NLP Lab

One-Line Summary

Alignment with Perceived Ambiguity (APA) teaches language models to explicitly detect and handle ambiguous queries by leveraging each model's own assessment of ambiguity, outperforming gold-standard label training especially in out-of-distribution scenarios.

Background & Motivation

In interactions between users and language model agents, user utterances frequently exhibit ellipsis (omission of words or phrases) or imprecision (lack of exactness) to prioritize efficiency. For instance, a question like "Who won the national championship?" can refer to many different championships across different sports and years, leading to varying interpretations based on different assumptions or background knowledge. Despite this, large language models (LLMs) typically pick a single interpretation and answer confidently, ignoring the inherent ambiguity.

Limitation 1 — No Explicit Training: Existing LLMs are not explicitly trained to deal with ambiguous utterances. They tend to produce a single answer even when a question is genuinely ambiguous, failing to surface alternative valid interpretations that the user may have intended.

Limitation 2 — Model-Dependent Ambiguity: The degree of perceived ambiguity is model-dependent — a model with broader knowledge recognizes more possible interpretations than one with limited knowledge. Using fixed gold-standard ambiguity labels for all models ignores this fundamental variation, leading to suboptimal alignment.

Key Insight: Rather than relying on externally annotated ambiguity labels, alignment should be tailored to each model's own knowledge boundary. A model should learn to flag ambiguity precisely at the frontier of its own knowledge, not at some universal threshold.

Proposed Method: Alignment with Perceived Ambiguity (APA)

APA is a four-stage alignment pipeline that teaches an LLM to detect ambiguity and respond with disambiguating clarifications, using the model's own perception of ambiguity rather than external gold labels:

1

Explicit Prediction & Filtering

The model processes all samples; correctly handled predictions form D_correct, while incorrectly handled ones (where the model gives a single answer despite genuine ambiguity) form D_incorrect. This identifies the gap between what the model should clarify and what it currently does not.

2

Self-Disambiguation & Information Gain

For each sample in D_incorrect, the model generates disambiguations. Information gain is computed as the entropy reduction: InfoGain = H(x) − H(x_disambig), measuring token-level uncertainty change. Samples exceeding a threshold (ε = 0.1) are classified as truly ambiguous from the model's perspective.

3

Data Construction & Supervised Fine-Tuning (SFT)

Ambiguous samples receive clarification labels (e.g., "The question is ambiguous because...") and are balanced with D_correct to form the training dataset D. The model is then fine-tuned using standard next-token prediction on this balanced dataset.

4

Preference Optimization

Further alignment through preference-based training (DPO) to reinforce explicit ambiguity handling over single-answer responses, ensuring the model consistently surfaces multiple interpretations when appropriate.

Experimental Results

APA is evaluated on four question-answering datasets using Llama-2 (7B, 13B) and Mistral (7B) as base models. Models are trained on AmbigQA and tested both in-distribution and out-of-distribution (SituatedQA, AmbigTriviaQA). Baselines include inference-only methods (naive prompting, ambiguity-aware instructions, sample repetition, self-ask) and trained methods (honesty-tuned, full-set, random subset).

Dataset	Metric	Llama-2 7B	Mistral 7B	Llama-2 13B
AmbigQA	Unambig. Acc.	27.23%	37.23%	37.83%
AmbigQA	Ambig. F1	63.69	50.31	58.15
SituatedQA (Geo)	Unambig. Acc.	24.51%	32.21%	24.51%
SituatedQA (Geo)	Ambig. F1	42.05	42.18	41.59
SituatedQA (Temp)	Unambig. Acc.	21.90%	35.74%	24.36%
SituatedQA (Temp)	Ambig. F1	40.77	40.17	41.09
AmbigTriviaQA	Unambig. Acc.	53.41%	58.14%	63.74%
AmbigTriviaQA	Ambig. F1	61.34	58.93	55.23

Outperforms Gold Labels on OOD: APA surpasses the full-set baseline (trained on all gold-standard labels) by up to 17 F1 points on out-of-distribution datasets, despite using only ~32% of data (Llama-2) or ~13% (Mistral)
Information Gain > Random Selection: Ablation studies show that information gain-based sample selection outperforms random subset selection by ~10 F1 points on OOD datasets, validating the model-aware curation strategy
Balanced Alignment: APA achieves the best Overall Alignment Performance (23.87% on AmbigTriviaQA), combining a high Valid Alignment Rate with a low Misaligned Clarification Rate — meaning the model avoids false ambiguity signals on clear questions
Maintained Clarity: On unambiguous questions, aligned models continue to provide direct, single answers without unnecessary hedging, preserving normal QA performance
Consistent Across Scales: The approach works effectively across different model families (Llama-2, Mistral) and scales (7B, 13B), demonstrating broad applicability

Why It Matters

Real-world user queries are often underspecified or ambiguous, yet most LLMs are trained to give a single decisive answer. This work provides a principled framework for teaching models when to ask for clarification and how to present multiple valid interpretations. Three aspects make APA particularly significant:

Model-Aware Alignment: By grounding the alignment process in each model's own knowledge, APA avoids both over-hedging (flagging clear questions as ambiguous) and under-hedging (ignoring genuine ambiguity), making LLM interactions more trustworthy
Data Efficiency: APA achieves superior OOD performance using only a fraction of available training data, demonstrating that intelligent sample selection outperforms brute-force training on all data
Practical Reliability: As LLMs are increasingly deployed in customer-facing applications (search, assistants, chatbots), the ability to explicitly handle ambiguity rather than silently guessing is critical for user trust and safety

Links