A three-stage framework that systematically integrates knowledge graph triples into multimodal entity linking via VLM-based triple generation, contrastive retrieval with gated fusion, and LLM-based reranking, achieving up to 19.13% improvement in HITS@1 over prior methods across three benchmarks.
Figure 1. An example of multimodal entity linking (MEL) using KGMEL. KGMEL generates triples for the mention to be matched with knowledge graph (KG) triples in the knowledge base (KB). Blue and yellow arrows point to triples derived from visual and textual context, respectively.
Background & Motivation
Entity linking (EL) aligns textual mentions with their corresponding entities in a knowledge base, facilitating applications such as semantic search, question answering, and knowledge-grounded dialogue. Recent advances in multimodal entity linking (MEL) have shown that combining text and images can reduce ambiguity and improve alignment accuracy. However, most existing MEL methods overlook the rich structural information available in the form of knowledge graph (KG) triples -- a resource that is orders of magnitude richer than concise textual descriptions.
Key Gap in Existing MEL Methods:
Reliance on surface-level cues: Prior approaches depend primarily on textual and visual similarities, which can be insufficient when entities are visually or textually similar. For example, distinguishing between two basketball players who share similar appearances and descriptions requires deeper structural knowledge.
Underutilized KG structure: Knowledge bases like Wikidata contain entities with hundreds of triples on average, far richer than single concise textual descriptions, yet this structural information has been largely ignored in MEL research.
Semantic bridging potential: Triple embeddings can bring distant mention-entity text embeddings closer together in latent space, providing a semantic bridge that surface-level features alone cannot offer. T-SNE visualization of BERT embeddings confirms that triple information pulls related mention-entity pairs closer.
No end-to-end KG integration: No prior work has systematically leveraged KG triples across all stages (generation, retrieval, and reranking) of the MEL pipeline.
Two Core Challenges for KG-Enhanced MEL:
Mentions lack inherent triples: Unlike entities in the knowledge base, mentions appearing in natural text do not come with structured triple representations. This asymmetry must be bridged before KG-based matching can occur.
Entities have noisy triples: Each entity may have hundreds of KG triples, many of which are irrelevant to a given mention. Naively using all triples introduces noise that can degrade performance rather than improve it.
KGMEL addresses both challenges by (1) using VLMs to generate triples for mentions from multimodal context, and (2) applying semantic filtering to select only the most relevant entity triples before matching and reranking.
Figure 3. Overview of KGMEL. The framework consists of three stages: (1) Generation: triples are generated for mentions using VLMs. (2) Retrieval: joint embeddings integrating textual, visual, and triple-based embeddings are used to retrieve K candidates. (3) Reranking: after filtering irrelevant KG triples, the best-matching entity is determined using LLMs.
1
Triple Generation (VLM-Based)
A vision-language model (GPT-4o-mini) analyzes both textual and visual context for each mention through a structured three-step prompt: (1) identify entity type via NER categories, (2) generate a concise entity description, (3) produce structured (head, relation, tail) triples. Twenty relation types are carefully selected based on frequency in the knowledge base and semantic relevance: instance of, subclass of, part of, has characteristic, field of work, occupation, sex or gender, country of citizenship, position held, religion or worldview, member of, owner of, country, capital, continent, located in, industry, participant, genre, and named after. The VLM extracts triples from both visual cues (e.g., "occupation: basketball player" from an image) and textual cues (e.g., "appeared_in: Thunderstruck" from surrounding text), formalized as Tm = VLM(Ptriple(tm, vm)).
2
Candidate Entity Retrieval (Contrastive Learning with Gated Fusion)
Encoding: Frozen CLIP encodes text and images into d'-dimensional embeddings. Triple relations and tails are encoded separately into matrices Rm and Om, then combined via an MLP with residual connections: Z̃m = Om + MLP([Om || Rm]), preserving tail entity information.
Dual Cross-Attention: Computes relevance scores for each triple relative to both text and visual modalities: sm = Softmax((β · Z̃m · TmT + (1-β) · Z̃m · VmT) / τatt), with β=0.5 balancing modalities and τatt=0.1 as temperature. Top-p selection (p ∈ {3, 5}) retains only the highest-ranked triples for denoising.
Gated Fusion: Learned sigmoid gates combine three modality embeddings: Xm = gT · WT · Tm + gV · WV · Vm + WZ · Zm, where gT = σ(WT(g) · Tm + bT(g)) adaptively weights each modality.
Training: Three contrastive losses drive learning: (1) mention-entity loss LME aligns mentions with ground-truth entities, (2) mention-mention loss LMM separates distinct mentions, and (3) entity-entity loss LEE separates distinct entities. Combined as L = LME + λMM · LMM + λEE · LEE with λMM = λEE = 0.1 and contrastive temperature τcl = 0.1. Top-K=16 candidates are retrieved via dot-product similarity.
3
Entity Reranking (LLM-Based with Triple Filtering)
A semantic triple filtering step identifies the top-n (n ∈ {10, 15}) most relevant relations and tails from each candidate entity's KG triples by computing similarity to the generated mention triples: Te(filt) = {(e,r,o) ∈ Te | r ∈ R(C(m), Tm) ∧ o ∈ O(C(m), Tm)}. This filtering removes noisy triples that could mislead the LLM. GPT-3.5-turbo then performs zero-shot reranking using a structured step-by-step prompt: (1) identify supporting triples that serve as evidence for each candidate, (2) determine the most appropriate entity from the filtered candidates. The final selection is em* = LLM(Prerank(tm, Tm, {te, Te(filt)}e ∈ C(m))).
Experimental Results
KGMEL is evaluated on three MEL benchmarks with Wikidata as the knowledge base. Results are reported as mean over 3 runs. It achieves state-of-the-art HITS@1 across all three datasets.
Dataset Statistics
Statistic
WikiDiverse
RichpediaMEL
WikiMEL
Sentences
7,405
17,724
22,070
Mentions
15,093
17,805
25,846
KG Triples
60.8M
32.8M
65.1M
Candidate Entities
132,460
160,935
109,976
Total Entities
776,407
831,737
761,343
Relations
1,322
1,288
1,289
Main Results (HITS@1)
Method
WikiDiverse
RichpediaMEL
WikiMEL
M3EL
74.06
--
--
IIER
--
84.63
--
OT-MEL
--
--
88.97
KGMEL (retrieval only)
82.12
76.40
87.29
KGMEL (retrieval + rerank)
88.23
85.21
90.58
Extended Metrics
Dataset
Metric
Retrieval
+ Rerank
WikiDiverse
HITS@3
90.28
92.82
HITS@5
92.07
93.61
MRR
86.00
90.84
RichpediaMEL
HITS@3
85.92
89.85
HITS@5
88.82
91.32
MRR
80.94
88.08
WikiMEL
HITS@3
92.47
95.18
HITS@5
93.94
95.87
MRR
89.99
93.04
VLM Comparison for Triple Generation (HITS@1 after Reranking)
VLM Model
WikiDiverse
RichpediaMEL
WikiMEL
LLaVA-1.6-7B (open-source)
86.43
81.94
86.22
LLaVA-1.6-13B (open-source)
85.94
83.26
85.96
GPT-4o-mini (proprietary)
88.23
85.21
90.58
Ablation Study (Average HITS@1 Drop)
Removed Component
Avg. HITS@1 Drop
Interpretation
Without image embeddings (V)
-5.54%
Visual modality is the most impactful single component
Without triple embeddings (Z)
-1.62%
KG triples provide meaningful complementary signal
Consider a mention with both text and image context referring to a basketball player. KGMEL generates triples such as "occupation: basketball player" from the image and "appeared_in: Thunderstruck" from surrounding text. These generated triples align with the correct entity's KG triples in Wikidata, enabling disambiguation when text or image features alone would match multiple similar entities. The triple-based semantic bridge is especially valuable for entities that share visual appearance or textual descriptions but differ in structured relational properties.
Up to 19.13% improvement: KGMEL achieves a 14.17% gain on WikiDiverse over the previous best (M3EL), demonstrating that KG structure is a powerful complement to text and image signals.
Reranking adds significant value: The reranking stage improves HITS@1 by 6.11% on WikiDiverse, 8.81% on RichpediaMEL, and 3.29% on WikiMEL over retrieval alone, with the largest gains where retrieval accuracy is lower.
Image embeddings are the most impactful modality: Removing image embeddings causes the largest drop (-5.54%), confirming the importance of visual information in MEL. This makes intuitive sense given that images provide complementary identity cues not present in text.
Robust across VLMs: The framework works with models ranging from open-source LLaVA-1.6-7B (86.43 on WikiDiverse) to proprietary GPT-4o-mini (88.23), showing that even smaller open-source VLMs produce useful triples, though GPT-4o-mini yields the best overall performance.
KG triples as semantic bridges: Triple embeddings bring distant mention-entity text embeddings closer in latent space, providing complementary signals when surface-level cues are ambiguous. T-SNE visualization of BERT embeddings empirically confirms this bridging effect.
Massive-scale KG utilization: The framework effectively handles knowledge bases with 32.8M to 65.1M triples across hundreds of thousands of entities, demonstrating scalability to real-world knowledge bases.
Triple filtering is essential: Semantic filtering of entity triples before LLM reranking removes noisy relations that would otherwise mislead the language model, improving both accuracy and computational efficiency.
Why It Matters
Entity linking is a foundational task for information retrieval, question answering, and knowledge-grounded dialogue systems. KGMEL makes four key contributions that advance the state of the art:
First end-to-end KG integration for MEL: By incorporating KG triples into all three stages (generation, retrieval, reranking), KGMEL shows that structural knowledge provides substantial and consistent improvements over methods using only text and image features. No prior work has systematically leveraged KG information across the entire MEL pipeline.
Practical and flexible design: The framework is compatible with various VLMs (from 7B open-source LLaVA to proprietary GPT-4o-mini) and uses zero-shot LLM reranking without task-specific fine-tuning, making it adaptable to different deployment constraints and budgets.
Novel technical components: The dual cross-attention mechanism with top-p selection for triple denoising, the gated fusion layer for adaptive modality weighting, and the semantic triple filtering before reranking are each individually effective innovations, as confirmed by the ablation study.
New research direction: The results demonstrate that entities' rich KG structure (averaging hundreds of triples across 32.8M-65.1M total triples per dataset) is a significantly underutilized resource in multimodal NLP, opening opportunities for KG-enhanced methods in related tasks such as visual question answering, multimodal knowledge base population, and cross-modal retrieval.