EN KO
← All Publications

UniKnow: A Unified Framework for Reliable Language Model Behavior across Parametric and External Knowledge

arXiv 2025
Youna Kim, Hyuhng Joon Kim, Minjoon Choi, Sungmin Cho, Hyunsoo Cho, Sang-goo Lee, Taeuk Kim

One-Line Summary

UniKnow is a unified evaluation framework and training methodology that addresses all four knowledge scenarios (conflict, external-only, parametric-only, unknown) arising when LLMs combine parametric and external knowledge, demonstrating that scenario-comprehensive supervision with just 4,000 training instances significantly outperforms fragmented approaches on reliability.

Four knowledge scenarios
Figure 1. UniKnow's 4 knowledge scenarios: Classified as Conflict, External-Only, Parametric-Only, and Unknown based on the presence of parametric knowledge (PK) and external knowledge (EK).

Background & Motivation

Language models rely on two knowledge sources: parametric knowledge (PK) encoded during pretraining, which remains static, and external knowledge (EK) provided at inference time via retrieval-augmented generation (RAG). However, real-world deployment exposes models to complex interactions between these sources that existing work addresses only in isolation:

Core Problem: Prior methods like COIECD (conflict-only), RetRobust (robustness-only), and KAFT (conflict + robustness) each address at most 3 of 4 scenarios. When evaluated across all four, these fragmented approaches exhibit severe scenario-specific biases: methods that excel at answerability hallucinate in the Unknown scenario, while abstention-capable methods over-refuse when answers are available. UniKnow is the first framework to define, evaluate, and train for all four scenarios jointly.

Proposed Method

UniKnow overview
Figure 2. Overall architecture of the UniKnow framework: Systematically evaluates 4 scenarios across 7 QA datasets with controlled knowledge conditions.
1
Parametric Knowledge Estimation
For each question, sample n=10 responses from the model without any context. If 70% or more are correct (threshold τ=0.7), classify as ∃PK (parametric knowledge present). If zero correct, classify as ∅PK. Questions between thresholds are excluded as ambiguous. This enables controlled scenario assignment per question.
2
External Knowledge Construction
Each question is paired with four context types: (a) original context containing the answer, (b) conflicting context generated via LLM instruction to contradict the answer while maintaining part-of-speech consistency, (c) random context — topically unrelated passages from the same dataset, and (d) incorrectly retrieved context — highest-ranked passages from Contriever-msmarco that do not contain the answer, simulating realistic retrieval failures.
3
Scenario-Aligned Evaluation
Crossing PK presence (∃PK/∅PK) with EK type (relevant/conflicting/irrelevant) produces the four scenarios. Models are evaluated on 7 QA datasets: NaturalQuestions (3,994 test), TriviaQA (7,712), HotpotQA (4,760), SQuAD (7,918), BioASQ (697), TextbookQA (1,056), and RelationExtraction (1,974), with all contexts limited to ~100 words.
4
LMUniKnow Training
The proposed training method uses balanced sampling: 250 questions each from ∃PK and ∅PK categories from NaturalQuestions and TriviaQA, paired with all 4 context types, yielding 4,000 training instances total. For answerable scenarios (C, E-Only, P-Only), the model is trained to produce scenario-specific expected answers. For the Unknown scenario, it is trained to output an "unknown" abstention token. Efficient fine-tuning is done via QLoRA.

The paper also proposes COIECDPrompt, an inference-only variant that extends the COIECD decoding strategy with explicit prompting for all four scenarios, enabling complete coverage without any training.

Experimental Results

Evaluated on 8 LLMs (Llama 2 7B/13B, Llama 3 8B, Mistral 7B v0.3, Qwen 2.5 1.5B/3B/7B/14B) across 7 datasets, using Exact Match (EM) and a Reliability score that balances correctness with appropriate abstention.

EM scores by scenario
Figure 4. EM scores by scenario (Llama 3 8B): Confirms that each method is biased toward specific scenarios, while LMUniKnow achieves the best overall performance.
Error type distributions
Figure 6. Error type distributions: Parametric errors dominate in the Conflict scenario (40–45%), while hallucinations are most frequent in the Unknown scenario for methods without abstention training.

Scenario-Specific Findings

Ablation: Context Diversity (Llama 3 8B)

Training Configuration TriviaQA Rely NQ Rely
LMUniKnow (full) 0.842 0.740
w/o conflicting contexts 0.733 0.623
w/o incorrectly retrieved contexts 0.733 0.623
w/o both 0.708 0.621

Ablation: Abstention Data Proportion (Llama 3 8B)

Unknown Proportion Accuracy Truth Reliability
0% (no abstention) Highest Lowest ~0.75
25% (LMUniKnow default) 0.69 0.88 ~0.84
50% Lowest Highest ~0.79
Accuracy and Reliability scores
Figure 7. Accuracy (Acc) vs. Reliability (Rely): LMUniKnow achieves the highest reliability score while maintaining competitive accuracy. Methods on the diagonal sacrifice answerability for safety.

Why It Matters

In real-world RAG systems, knowledge conflicts, irrelevant retrievals, and unanswerable queries occur simultaneously and unpredictably. UniKnow makes three key contributions to building reliable AI across all knowledge states:

Links

Knowledge