UniKnow: A Unified Framework for Reliable Language Model Behavior across Parametric and External Knowledge
arXiv 2025
Youna Kim, Hyuhng Joon Kim, Minjoon Choi, Sungmin Cho, Hyunsoo Cho, Sang-goo Lee, Taeuk Kim
One-Line Summary
UniKnow is a unified evaluation framework and training methodology that addresses all four knowledge scenarios (conflict, external-only, parametric-only, unknown) arising when LLMs combine parametric and external knowledge, demonstrating that scenario-comprehensive supervision with just 4,000 training instances significantly outperforms fragmented approaches on reliability.
Figure 1. UniKnow's 4 knowledge scenarios: Classified as Conflict, External-Only, Parametric-Only, and Unknown based on the presence of parametric knowledge (PK) and external knowledge (EK).
Background & Motivation
Language models rely on two knowledge sources: parametric knowledge (PK) encoded during pretraining, which remains static, and external knowledge (EK) provided at inference time via retrieval-augmented generation (RAG). However, real-world deployment exposes models to complex interactions between these sources that existing work addresses only in isolation:
Conflict (C): Both PK and EK are present but contradict each other — the model must prioritize external knowledge for its recency and task-specificity
External-Only (E-Only): The model lacks relevant parametric knowledge and must rely entirely on the provided context
Parametric-Only (P-Only): External context is irrelevant noise (random or misleadingly retrieved) and only internal knowledge yields correct answers
Unknown (U): Neither knowledge source contains the answer — the model should abstain rather than hallucinate
Core Problem: Prior methods like COIECD (conflict-only), RetRobust (robustness-only), and KAFT (conflict + robustness) each address at most 3 of 4 scenarios. When evaluated across all four, these fragmented approaches exhibit severe scenario-specific biases: methods that excel at answerability hallucinate in the Unknown scenario, while abstention-capable methods over-refuse when answers are available. UniKnow is the first framework to define, evaluate, and train for all four scenarios jointly.
Proposed Method
Figure 2. Overall architecture of the UniKnow framework: Systematically evaluates 4 scenarios across 7 QA datasets with controlled knowledge conditions.
1
Parametric Knowledge Estimation
For each question, sample n=10 responses from the model without any context. If 70% or more are correct (threshold τ=0.7), classify as ∃PK (parametric knowledge present). If zero correct, classify as ∅PK. Questions between thresholds are excluded as ambiguous. This enables controlled scenario assignment per question.
2
External Knowledge Construction
Each question is paired with four context types: (a) original context containing the answer, (b) conflicting context generated via LLM instruction to contradict the answer while maintaining part-of-speech consistency, (c) random context — topically unrelated passages from the same dataset, and (d) incorrectly retrieved context — highest-ranked passages from Contriever-msmarco that do not contain the answer, simulating realistic retrieval failures.
3
Scenario-Aligned Evaluation
Crossing PK presence (∃PK/∅PK) with EK type (relevant/conflicting/irrelevant) produces the four scenarios. Models are evaluated on 7 QA datasets: NaturalQuestions (3,994 test), TriviaQA (7,712), HotpotQA (4,760), SQuAD (7,918), BioASQ (697), TextbookQA (1,056), and RelationExtraction (1,974), with all contexts limited to ~100 words.
4
LMUniKnow Training
The proposed training method uses balanced sampling: 250 questions each from ∃PK and ∅PK categories from NaturalQuestions and TriviaQA, paired with all 4 context types, yielding 4,000 training instances total. For answerable scenarios (C, E-Only, P-Only), the model is trained to produce scenario-specific expected answers. For the Unknown scenario, it is trained to output an "unknown" abstention token. Efficient fine-tuning is done via QLoRA.
The paper also proposes COIECDPrompt, an inference-only variant that extends the COIECD decoding strategy with explicit prompting for all four scenarios, enabling complete coverage without any training.
Experimental Results
Evaluated on 8 LLMs (Llama 2 7B/13B, Llama 3 8B, Mistral 7B v0.3, Qwen 2.5 1.5B/3B/7B/14B) across 7 datasets, using Exact Match (EM) and a Reliability score that balances correctness with appropriate abstention.
Figure 4. EM scores by scenario (Llama 3 8B): Confirms that each method is biased toward specific scenarios, while LMUniKnow achieves the best overall performance.Figure 6. Error type distributions: Parametric errors dominate in the Conflict scenario (40–45%), while hallucinations are most frequent in the Unknown scenario for methods without abstention training.
Scenario-Specific Findings
LMUniKnow achieves highest overall reliability (~0.82–0.85): Training that covers all four scenarios yields the best combined performance, confirming that broader scenario coverage leads to better generalization
Conflict is the hardest scenario: EM scores cluster more tightly with less variance across methods, indicating that overcoming parametric knowledge interference is fundamentally harder than incorporating new information
Answerability vs. reliability trade-off: COIECD, RetRobust, and KAFT excel at P-Only but generate hallucinations in Unknown; abstention-capable methods (Prompting, COIECDPrompt, LMUniKnow) show the inverse pattern, with LMUniKnow achieving the best balance
Scale effects are scenario-dependent: Testing on Qwen 1.5B–14B reveals that E-Only performance is stable across sizes (~65–70% EM), while Conflict and P-Only improve significantly with scale when explicitly trained, and Unknown abstention consistently improves with larger models
Ablation: Context Diversity (Llama 3 8B)
Training Configuration
TriviaQA Rely
NQ Rely
LMUniKnow (full)
0.842
0.740
w/o conflicting contexts
0.733
0.623
w/o incorrectly retrieved contexts
0.733
0.623
w/o both
0.708
0.621
Ablation: Abstention Data Proportion (Llama 3 8B)
Unknown Proportion
Accuracy
Truth
Reliability
0% (no abstention)
Highest
Lowest
~0.75
25% (LMUniKnow default)
0.69
0.88
~0.84
50%
Lowest
Highest
~0.79
Figure 7. Accuracy (Acc) vs. Reliability (Rely): LMUniKnow achieves the highest reliability score while maintaining competitive accuracy. Methods on the diagonal sacrifice answerability for safety.
Diverse context types are essential: Removing conflicting or incorrectly retrieved contexts from training drops reliability by 10+ points, confirming each context type contributes meaningfully
Equal scenario allocation is optimal: 25% abstention data (equal across 4 scenarios) achieves the best reliability balance between correct answering and appropriate refusal
Over-reliance depends on PK presence: Error analysis reveals that parametric errors concentrate in Conflict (40–45% of errors), a finding obscured in prior work that treats conflict in isolation
Why It Matters
In real-world RAG systems, knowledge conflicts, irrelevant retrievals, and unanswerable queries occur simultaneously and unpredictably. UniKnow makes three key contributions to building reliable AI across all knowledge states:
First unified evaluation: Defines and operationalizes four exhaustive knowledge scenarios, enabling systematic diagnosis of model weaknesses that fragmented benchmarks miss
Efficient training recipe: LMUniKnow achieves state-of-the-art reliability with only 4,000 training instances via QLoRA, making it practical to apply to any open-weight LLM
Actionable insights: Reveals that parametric knowledge interactions fundamentally alter model behavior — a critical finding for designing future knowledge-grounded systems that must handle conflict, distraction, and absence gracefully