UniKnow - HYU NLP Lab

One-Line Summary

UniKnow is a unified evaluation framework and training methodology that addresses all four knowledge scenarios (conflict, external-only, parametric-only, unknown) arising when LLMs combine parametric and external knowledge, demonstrating that scenario-comprehensive supervision with just 4,000 training instances significantly outperforms fragmented approaches on reliability.

Background & Motivation

Language models rely on two knowledge sources: parametric knowledge (PK) encoded during pretraining, which remains static, and external knowledge (EK) provided at inference time via retrieval-augmented generation (RAG). However, real-world deployment exposes models to complex interactions between these sources that existing work addresses only in isolation:

Conflict (C): Both PK and EK are present but contradict each other — the model must prioritize external knowledge for its recency and task-specificity
External-Only (E-Only): The model lacks relevant parametric knowledge and must rely entirely on the provided context
Parametric-Only (P-Only): External context is irrelevant noise (random or misleadingly retrieved) and only internal knowledge yields correct answers
Unknown (U): Neither knowledge source contains the answer — the model should abstain rather than hallucinate

Core Problem: Prior methods like COIECD (conflict-only), RetRobust (robustness-only), and KAFT (conflict + robustness) each address at most 3 of 4 scenarios. When evaluated across all four, these fragmented approaches exhibit severe scenario-specific biases: methods that excel at answerability hallucinate in the Unknown scenario, while abstention-capable methods over-refuse when answers are available. UniKnow is the first framework to define, evaluate, and train for all four scenarios jointly.

Proposed Method

UniKnow overview — **Figure 2.** Overall architecture of the UniKnow framework: Systematically evaluates 4 scenarios across 7 QA datasets with controlled knowledge conditions.

1

Parametric Knowledge Estimation

For each question, sample n=10 responses from the model without any context. If 70% or more are correct (threshold τ=0.7), classify as ∃PK (parametric knowledge present). If zero correct, classify as ∅PK. Questions between thresholds are excluded as ambiguous. This enables controlled scenario assignment per question.

2

External Knowledge Construction

Each question is paired with four context types: (a) original context containing the answer, (b) conflicting context generated via LLM instruction to contradict the answer while maintaining part-of-speech consistency, (c) random context — topically unrelated passages from the same dataset, and (d) incorrectly retrieved context — highest-ranked passages from Contriever-msmarco that do not contain the answer, simulating realistic retrieval failures.

3

Scenario-Aligned Evaluation

Crossing PK presence (∃PK/∅PK) with EK type (relevant/conflicting/irrelevant) produces the four scenarios. Models are evaluated on 7 QA datasets: NaturalQuestions (3,994 test), TriviaQA (7,712), HotpotQA (4,760), SQuAD (7,918), BioASQ (697), TextbookQA (1,056), and RelationExtraction (1,974), with all contexts limited to ~100 words.

4

LM_UniKnow Training

The proposed training method uses balanced sampling: 250 questions each from ∃PK and ∅PK categories from NaturalQuestions and TriviaQA, paired with all 4 context types, yielding 4,000 training instances total. For answerable scenarios (C, E-Only, P-Only), the model is trained to produce scenario-specific expected answers. For the Unknown scenario, it is trained to output an "unknown" abstention token. Efficient fine-tuning is done via QLoRA.

The paper also proposes COIECD_Prompt, an inference-only variant that extends the COIECD decoding strategy with explicit prompting for all four scenarios, enabling complete coverage without any training.

Experimental Results

Evaluated on 8 LLMs (Llama 2 7B/13B, Llama 3 8B, Mistral 7B v0.3, Qwen 2.5 1.5B/3B/7B/14B) across 7 datasets, using Exact Match (EM) and a Reliability score that balances correctness with appropriate abstention.

**Figure 4.** EM scores by scenario (Llama 3 8B): Confirms that each method is biased toward specific scenarios, while LM_UniKnow achieves the best overall performance.

**Figure 6.** Error type distributions: Parametric errors dominate in the Conflict scenario (40–45%), while hallucinations are most frequent in the Unknown scenario for methods without abstention training.

Scenario-Specific Findings

LM_UniKnow achieves highest overall reliability (~0.82–0.85): Training that covers all four scenarios yields the best combined performance, confirming that broader scenario coverage leads to better generalization
Conflict is the hardest scenario: EM scores cluster more tightly with less variance across methods, indicating that overcoming parametric knowledge interference is fundamentally harder than incorporating new information
Answerability vs. reliability trade-off: COIECD, RetRobust, and KAFT excel at P-Only but generate hallucinations in Unknown; abstention-capable methods (Prompting, COIECD_Prompt, LM_UniKnow) show the inverse pattern, with LM_UniKnow achieving the best balance
Scale effects are scenario-dependent: Testing on Qwen 1.5B–14B reveals that E-Only performance is stable across sizes (~65–70% EM), while Conflict and P-Only improve significantly with scale when explicitly trained, and Unknown abstention consistently improves with larger models

Ablation: Context Diversity (Llama 3 8B)

Training Configuration	TriviaQA Rely	NQ Rely
LM_UniKnow (full)	0.842	0.740
w/o conflicting contexts	0.733	0.623
w/o incorrectly retrieved contexts	0.733	0.623
w/o both	0.708	0.621

Ablation: Abstention Data Proportion (Llama 3 8B)

Unknown Proportion	Accuracy	Truth	Reliability
0% (no abstention)	Highest	Lowest	~0.75
25% (LM_UniKnow default)	0.69	0.88	~0.84
50%	Lowest	Highest	~0.79

Accuracy and Reliability scores — **Figure 7.** Accuracy (Acc) vs. Reliability (Rely): LM_UniKnow achieves the highest reliability score while maintaining competitive accuracy. Methods on the diagonal sacrifice answerability for safety.

Diverse context types are essential: Removing conflicting or incorrectly retrieved contexts from training drops reliability by 10+ points, confirming each context type contributes meaningfully
Equal scenario allocation is optimal: 25% abstention data (equal across 4 scenarios) achieves the best reliability balance between correct answering and appropriate refusal
Over-reliance depends on PK presence: Error analysis reveals that parametric errors concentrate in Conflict (40–45% of errors), a finding obscured in prior work that treats conflict in isolation

Why It Matters

In real-world RAG systems, knowledge conflicts, irrelevant retrievals, and unanswerable queries occur simultaneously and unpredictably. UniKnow makes three key contributions to building reliable AI across all knowledge states:

First unified evaluation: Defines and operationalizes four exhaustive knowledge scenarios, enabling systematic diagnosis of model weaknesses that fragmented benchmarks miss
Efficient training recipe: LM_UniKnow achieves state-of-the-art reliability with only 4,000 training instances via QLoRA, making it practical to apply to any open-weight LLM
Actionable insights: Reveals that parametric knowledge interactions fundamentally alter model behavior — a critical finding for designing future knowledge-grounded systems that must handle conflict, distraction, and absence gracefully

Links

arXiv Paper

UniKnow: A Unified Framework for Reliable Language Model Behavior across Parametric and External Knowledge