EN KO
← All Publications

KGMEL: Knowledge Graph-Enhanced Multimodal Entity Linking

SIGIR 2025
Juyeon Kim, Geon Lee, Taeuk Kim, Kijung Shin

One-Line Summary

A three-stage framework that systematically integrates knowledge graph triples into multimodal entity linking via VLM-based triple generation, contrastive retrieval with gated fusion, and LLM-based reranking, achieving up to 19.13% improvement in HITS@1 over prior methods across three benchmarks.

Example of multimodal entity linking using KGMEL
Figure 1. An example of multimodal entity linking (MEL) using KGMEL. KGMEL generates triples for the mention to be matched with knowledge graph (KG) triples in the knowledge base (KB). Blue and yellow arrows point to triples derived from visual and textual context, respectively.

Background & Motivation

Entity linking (EL) aligns textual mentions with their corresponding entities in a knowledge base, facilitating applications such as semantic search, question answering, and knowledge-grounded dialogue. Recent advances in multimodal entity linking (MEL) have shown that combining text and images can reduce ambiguity and improve alignment accuracy. However, most existing MEL methods overlook the rich structural information available in the form of knowledge graph (KG) triples -- a resource that is orders of magnitude richer than concise textual descriptions.

Key Gap in Existing MEL Methods:

  • Reliance on surface-level cues: Prior approaches depend primarily on textual and visual similarities, which can be insufficient when entities are visually or textually similar. For example, distinguishing between two basketball players who share similar appearances and descriptions requires deeper structural knowledge.
  • Underutilized KG structure: Knowledge bases like Wikidata contain entities with hundreds of triples on average, far richer than single concise textual descriptions, yet this structural information has been largely ignored in MEL research.
  • Semantic bridging potential: Triple embeddings can bring distant mention-entity text embeddings closer together in latent space, providing a semantic bridge that surface-level features alone cannot offer. T-SNE visualization of BERT embeddings confirms that triple information pulls related mention-entity pairs closer.
  • No end-to-end KG integration: No prior work has systematically leveraged KG triples across all stages (generation, retrieval, and reranking) of the MEL pipeline.

Two Core Challenges for KG-Enhanced MEL:

  • Mentions lack inherent triples: Unlike entities in the knowledge base, mentions appearing in natural text do not come with structured triple representations. This asymmetry must be bridged before KG-based matching can occur.
  • Entities have noisy triples: Each entity may have hundreds of KG triples, many of which are irrelevant to a given mention. Naively using all triples introduces noise that can degrade performance rather than improve it.

KGMEL addresses both challenges by (1) using VLMs to generate triples for mentions from multimodal context, and (2) applying semantic filtering to select only the most relevant entity triples before matching and reranking.

Proposed Method: Three-Stage KG-Enhanced Framework

Overview of the KGMEL framework with three stages: Generation, Retrieval, and Reranking
Figure 3. Overview of KGMEL. The framework consists of three stages: (1) Generation: triples are generated for mentions using VLMs. (2) Retrieval: joint embeddings integrating textual, visual, and triple-based embeddings are used to retrieve K candidates. (3) Reranking: after filtering irrelevant KG triples, the best-matching entity is determined using LLMs.
1
Triple Generation (VLM-Based)
A vision-language model (GPT-4o-mini) analyzes both textual and visual context for each mention through a structured three-step prompt: (1) identify entity type via NER categories, (2) generate a concise entity description, (3) produce structured (head, relation, tail) triples. Twenty relation types are carefully selected based on frequency in the knowledge base and semantic relevance: instance of, subclass of, part of, has characteristic, field of work, occupation, sex or gender, country of citizenship, position held, religion or worldview, member of, owner of, country, capital, continent, located in, industry, participant, genre, and named after. The VLM extracts triples from both visual cues (e.g., "occupation: basketball player" from an image) and textual cues (e.g., "appeared_in: Thunderstruck" from surrounding text), formalized as Tm = VLM(Ptriple(tm, vm)).
2
Candidate Entity Retrieval (Contrastive Learning with Gated Fusion)

Encoding: Frozen CLIP encodes text and images into d'-dimensional embeddings. Triple relations and tails are encoded separately into matrices Rm and Om, then combined via an MLP with residual connections: Z̃m = Om + MLP([Om || Rm]), preserving tail entity information.

Dual Cross-Attention: Computes relevance scores for each triple relative to both text and visual modalities: sm = Softmax((β · Z̃m · TmT + (1-β) · Z̃m · VmT) / τatt), with β=0.5 balancing modalities and τatt=0.1 as temperature. Top-p selection (p ∈ {3, 5}) retains only the highest-ranked triples for denoising.

Gated Fusion: Learned sigmoid gates combine three modality embeddings: Xm = gT · WT · Tm + gV · WV · Vm + WZ · Zm, where gT = σ(WT(g) · Tm + bT(g)) adaptively weights each modality.

Training: Three contrastive losses drive learning: (1) mention-entity loss LME aligns mentions with ground-truth entities, (2) mention-mention loss LMM separates distinct mentions, and (3) entity-entity loss LEE separates distinct entities. Combined as L = LME + λMM · LMM + λEE · LEE with λMM = λEE = 0.1 and contrastive temperature τcl = 0.1. Top-K=16 candidates are retrieved via dot-product similarity.

3
Entity Reranking (LLM-Based with Triple Filtering)
A semantic triple filtering step identifies the top-n (n ∈ {10, 15}) most relevant relations and tails from each candidate entity's KG triples by computing similarity to the generated mention triples: Te(filt) = {(e,r,o) ∈ Te | r ∈ R(C(m), Tm) ∧ o ∈ O(C(m), Tm)}. This filtering removes noisy triples that could mislead the LLM. GPT-3.5-turbo then performs zero-shot reranking using a structured step-by-step prompt: (1) identify supporting triples that serve as evidence for each candidate, (2) determine the most appropriate entity from the filtered candidates. The final selection is em* = LLM(Prerank(tm, Tm, {te, Te(filt)}e ∈ C(m))).

Experimental Results

KGMEL is evaluated on three MEL benchmarks with Wikidata as the knowledge base. Results are reported as mean over 3 runs. It achieves state-of-the-art HITS@1 across all three datasets.

Dataset Statistics

StatisticWikiDiverseRichpediaMELWikiMEL
Sentences7,40517,72422,070
Mentions15,09317,80525,846
KG Triples60.8M32.8M65.1M
Candidate Entities132,460160,935109,976
Total Entities776,407831,737761,343
Relations1,3221,2881,289

Main Results (HITS@1)

MethodWikiDiverseRichpediaMELWikiMEL
M3EL74.06----
IIER--84.63--
OT-MEL----88.97
KGMEL (retrieval only)82.1276.4087.29
KGMEL (retrieval + rerank)88.2385.2190.58

Extended Metrics

DatasetMetricRetrieval+ Rerank
WikiDiverseHITS@390.2892.82
HITS@592.0793.61
MRR86.0090.84
RichpediaMELHITS@385.9289.85
HITS@588.8291.32
MRR80.9488.08
WikiMELHITS@392.4795.18
HITS@593.9495.87
MRR89.9993.04

VLM Comparison for Triple Generation (HITS@1 after Reranking)

VLM ModelWikiDiverseRichpediaMELWikiMEL
LLaVA-1.6-7B (open-source)86.4381.9486.22
LLaVA-1.6-13B (open-source)85.9483.2685.96
GPT-4o-mini (proprietary)88.2385.2190.58

Ablation Study (Average HITS@1 Drop)

Removed ComponentAvg. HITS@1 DropInterpretation
Without image embeddings (V)-5.54%Visual modality is the most impactful single component
Without triple embeddings (Z)-1.62%KG triples provide meaningful complementary signal
Without gated fusion layer-1.29%Adaptive gating outperforms naive concatenation/addition

Case Study: How KG Triples Disambiguate Entities

Consider a mention with both text and image context referring to a basketball player. KGMEL generates triples such as "occupation: basketball player" from the image and "appeared_in: Thunderstruck" from surrounding text. These generated triples align with the correct entity's KG triples in Wikidata, enabling disambiguation when text or image features alone would match multiple similar entities. The triple-based semantic bridge is especially valuable for entities that share visual appearance or textual descriptions but differ in structured relational properties.

Why It Matters

Entity linking is a foundational task for information retrieval, question answering, and knowledge-grounded dialogue systems. KGMEL makes four key contributions that advance the state of the art:

Links

Information Retrieval Multimodal Knowledge Graph