EN KO
← All Publications

MAGIC: A Multi-Hop and Graph-Based Benchmark for Inter-Context Conflicts in Retrieval-Augmented Generation

EMNLP 2025 Findings
Jungyeon Lee, Kangmin Lee, Taeuk Kim

One-Line Summary

A knowledge-graph-based benchmark of 1,080 instances that systematically generates inter-context knowledge conflicts in RAG systems across eight conflict types (single/multi-hop x 1-4 conflicts), revealing that even the best LLMs achieve only 55% localization accuracy versus 83.3% for humans.

MAGIC three-hop conflict example
Figure 1. A 3-hop conflict example: Subtle inconsistencies regarding the release order of songs appear across multiple documents. Detecting such conflicts requires multi-step reasoning.

Background & Motivation

Retrieval-Augmented Generation (RAG) enhances LLM responses by retrieving external documents, but what happens when those retrieved documents contradict each other? This problem of inter-context conflict is critical for safe RAG deployment, yet existing benchmarks fail to capture its full complexity.

Four Key Limitations of Existing Benchmarks:

  • Narrow task focus: Prior datasets (ECON: 168 instances, WikiContradict: 103 instances) are limited to QA settings where conflicts occur only among answer candidates.
  • Oversimplified construction: Heavy reliance on entity substitution fails to capture nuanced, real-world knowledge conflicts.
  • Limited conflict typology: No systematic distinction between single-hop and multi-hop conflicts; ~78% of existing instances are single-hop only.
  • Underexplored inter-context conflicts: Most research focuses on conflicts between parametric and external knowledge, not among multiple retrieved documents.

In practice, real-world discrepancies often involve multi-hop reasoning and multiple simultaneous conflicts across documents -- scenarios that existing benchmarks rarely cover. MAGIC addresses all four limitations with a scalable, KG-based framework.

Proposed Method: KG-Based Conflict Generation Framework

MAGIC framework overview
Figure 2. Overall architecture of the MAGIC framework: Subgraphs are extracted from Wikidata5M, knowledge conflicts are generated, and then converted into natural language documents.

MAGIC uses a three-step pipeline built on Wikidata5M (~20 million triplets), producing 1,080 carefully curated instances across 46 relation types organized into 7 semantic domains (Human, Geography, Organization, Creative Work, Class/Concept, Cause-Effect, General).

1
Subgraph Extraction
From Wikidata5M's 825 relations, 46 are selected based on (a) semantic clarity for controlled conflict manipulation and (b) support for meaningful multi-hop reasoning chains. Depth-First Search (DFS) traversal extracts subgraphs with constraints: max 15 edges per subgraph and max 5 edges per node, with randomly determined DFS depth to ensure structural diversity.
2
Conflict Generation
OpenAI o3-mini generates conflicts via few-shot prompting, receiving the target seed triplet plus surrounding subgraph context (3 validated examples per relation type). Eight conflict types are produced along two dimensions: number of hops (single/multi) x number of conflicts (1/2/3/4). A two-stage human-in-the-loop process ensures quality: manual filtering of demonstrations before generation, then expert review to remove trivial or incoherent outputs.
3
KG-to-Text Conversion
GPT-4o-mini converts subgraphs to coherent natural language while preserving all semantic relations. Claude 3.5 Sonnet then validates triplet coverage automatically, achieving 95.21% accuracy on conflict triplets and 82.04% on full subgraph triplets. Human inspection of 167 sampled outputs confirmed high reliability across all data types.
Four conflict types
Figure 4. Four conflict types: Combinations of single-hop/multi-hop x single-conflict/multi-conflict enable evaluation at varying levels of difficulty.

Dataset Composition

Type1 Conflict2 Conflicts3 Conflicts4 ConflictsTotal
Single-Hop2081548050492
Multi-Hop3001588050588
Total5083121601001,080

Experimental Results

Five LLMs are evaluated on MAGIC using two metrics: Identification (ID) -- binary detection of whether conflicts exist (scored across 3 inference runs; any failure = 0), and Localization (LOC) -- pinpointing all exact conflict sources (full score only if all locations correctly identified). A multi-step prompting strategy (asking for conflict count, reasoning, and conflicting sentences) outperforms simple binary prompting by up to 39.41%.

Cross-Benchmark Comparison (5-Model Average)

BenchmarkID Score (%)LOC Score (%)
ECON (168 instances)74.7357.09
WikiContradict (103 instances)69.9355.74
MAGIC (1,080 instances)64.5440.51

Per-Model Performance on MAGIC

ModelID Score (%)LOC Score (%)
Mixtral 8x7B37.9217.40
Llama 3.1 70B72.8637.92
Claude 3.5 Haiku60.2842.50
OpenAI o168.0649.72
GPT-4o-mini83.6155.00
Human Baseline92.5083.30
Performance by conflict type
Figure 7. Performance by conflict type: Multi-hop conflicts are far more challenging than single-hop conflicts across all models.
Context length impact
Figure 9. Both ID and LOC scores decline as context length increases, with LOC dropping more sharply.

Why It Matters

As RAG becomes the dominant paradigm for deploying LLMs in production, robust handling of conflicting information is essential for safe and reliable systems. MAGIC makes three contributions that advance this goal:

Links

RAG Knowledge Benchmark