KGMEL - HYU NLP Lab

One-Line Summary

A three-stage framework that systematically integrates knowledge graph triples into multimodal entity linking via VLM-based triple generation, contrastive retrieval with gated fusion, and LLM-based reranking, achieving up to 19.13% improvement in HITS@1 over prior methods across three benchmarks.

Background & Motivation

Entity linking (EL) aligns textual mentions with their corresponding entities in a knowledge base, facilitating applications such as semantic search, question answering, and knowledge-grounded dialogue. Recent advances in multimodal entity linking (MEL) have shown that combining text and images can reduce ambiguity and improve alignment accuracy. However, most existing MEL methods overlook the rich structural information available in the form of knowledge graph (KG) triples -- a resource that is orders of magnitude richer than concise textual descriptions.

Key Gap in Existing MEL Methods:

Reliance on surface-level cues: Prior approaches depend primarily on textual and visual similarities, which can be insufficient when entities are visually or textually similar. For example, distinguishing between two basketball players who share similar appearances and descriptions requires deeper structural knowledge.
Underutilized KG structure: Knowledge bases like Wikidata contain entities with hundreds of triples on average, far richer than single concise textual descriptions, yet this structural information has been largely ignored in MEL research.
Semantic bridging potential: Triple embeddings can bring distant mention-entity text embeddings closer together in latent space, providing a semantic bridge that surface-level features alone cannot offer. T-SNE visualization of BERT embeddings confirms that triple information pulls related mention-entity pairs closer.
No end-to-end KG integration: No prior work has systematically leveraged KG triples across all stages (generation, retrieval, and reranking) of the MEL pipeline.

Two Core Challenges for KG-Enhanced MEL:

Mentions lack inherent triples: Unlike entities in the knowledge base, mentions appearing in natural text do not come with structured triple representations. This asymmetry must be bridged before KG-based matching can occur.
Entities have noisy triples: Each entity may have hundreds of KG triples, many of which are irrelevant to a given mention. Naively using all triples introduces noise that can degrade performance rather than improve it.

KGMEL addresses both challenges by (1) using VLMs to generate triples for mentions from multimodal context, and (2) applying semantic filtering to select only the most relevant entity triples before matching and reranking.

Proposed Method: Three-Stage KG-Enhanced Framework

Overview of the KGMEL framework with three stages: Generation, Retrieval, and Reranking — **Figure 3.** Overview of KGMEL. The framework consists of three stages: (1) Generation: triples are generated for mentions using VLMs. (2) Retrieval: joint embeddings integrating textual, visual, and triple-based embeddings are used to retrieve K candidates. (3) Reranking: after filtering irrelevant KG triples, the best-matching entity is determined using LLMs.

1

Triple Generation (VLM-Based)

A vision-language model (GPT-4o-mini) analyzes both textual and visual context for each mention through a structured three-step prompt: (1) identify entity type via NER categories, (2) generate a concise entity description, (3) produce structured (head, relation, tail) triples. Twenty relation types are carefully selected based on frequency in the knowledge base and semantic relevance: instance of, subclass of, part of, has characteristic, field of work, occupation, sex or gender, country of citizenship, position held, religion or worldview, member of, owner of, country, capital, continent, located in, industry, participant, genre, and named after. The VLM extracts triples from both visual cues (e.g., "occupation: basketball player" from an image) and textual cues (e.g., "appeared_in: Thunderstruck" from surrounding text), formalized as T_m = VLM(P_triple(t_m, v_m)).

2

Candidate Entity Retrieval (Contrastive Learning with Gated Fusion)

Encoding: Frozen CLIP encodes text and images into d'-dimensional embeddings. Triple relations and tails are encoded separately into matrices R_m and O_m, then combined via an MLP with residual connections: Z̃_m = O_m + MLP([O_m || R_m]), preserving tail entity information.

Dual Cross-Attention: Computes relevance scores for each triple relative to both text and visual modalities: s_m = Softmax((β · Z̃_m · T_m^T + (1-β) · Z̃_m · V_m^T) / τ_att), with β=0.5 balancing modalities and τ_att=0.1 as temperature. Top-p selection (p ∈ {3, 5}) retains only the highest-ranked triples for denoising.

Gated Fusion: Learned sigmoid gates combine three modality embeddings: X_m = g_T · W_T · T_m + g_V · W_V · V_m + W_Z · Z_m, where g_T = σ(W_T^(g) · T_m + b_T^(g)) adaptively weights each modality.

Training: Three contrastive losses drive learning: (1) mention-entity loss L_ME aligns mentions with ground-truth entities, (2) mention-mention loss L_MM separates distinct mentions, and (3) entity-entity loss L_EE separates distinct entities. Combined as L = L_ME + λ_MM · L_MM + λ_EE · L_EE with λ_MM = λ_EE = 0.1 and contrastive temperature τ_cl = 0.1. Top-K=16 candidates are retrieved via dot-product similarity.

3

Entity Reranking (LLM-Based with Triple Filtering)

A semantic triple filtering step identifies the top-n (n ∈ {10, 15}) most relevant relations and tails from each candidate entity's KG triples by computing similarity to the generated mention triples: T_e^(filt) = {(e,r,o) ∈ T_e | r ∈ R(C(m), T_m) ∧ o ∈ O(C(m), T_m)}. This filtering removes noisy triples that could mislead the LLM. GPT-3.5-turbo then performs zero-shot reranking using a structured step-by-step prompt: (1) identify supporting triples that serve as evidence for each candidate, (2) determine the most appropriate entity from the filtered candidates. The final selection is e_m* = LLM(P_rerank(t_m, T_m, {t_e, T_e^(filt)}_{e ∈ C(m)})).

Experimental Results

KGMEL is evaluated on three MEL benchmarks with Wikidata as the knowledge base. Results are reported as mean over 3 runs. It achieves state-of-the-art HITS@1 across all three datasets.

Dataset Statistics

Statistic	WikiDiverse	RichpediaMEL	WikiMEL
Sentences	7,405	17,724	22,070
Mentions	15,093	17,805	25,846
KG Triples	60.8M	32.8M	65.1M
Candidate Entities	132,460	160,935	109,976
Total Entities	776,407	831,737	761,343
Relations	1,322	1,288	1,289

Main Results (HITS@1)

Method	WikiDiverse	RichpediaMEL	WikiMEL
M3EL	74.06	--	--
IIER	--	84.63	--
OT-MEL	--	--	88.97
KGMEL (retrieval only)	82.12	76.40	87.29
KGMEL (retrieval + rerank)	88.23	85.21	90.58

Extended Metrics

Dataset	Metric	Retrieval	+ Rerank
WikiDiverse	HITS@3	90.28	92.82
	HITS@5	92.07	93.61
	MRR	86.00	90.84
RichpediaMEL	HITS@3	85.92	89.85
	HITS@5	88.82	91.32
	MRR	80.94	88.08
WikiMEL	HITS@3	92.47	95.18
	HITS@5	93.94	95.87
	MRR	89.99	93.04

VLM Comparison for Triple Generation (HITS@1 after Reranking)

VLM Model	WikiDiverse	RichpediaMEL	WikiMEL
LLaVA-1.6-7B (open-source)	86.43	81.94	86.22
LLaVA-1.6-13B (open-source)	85.94	83.26	85.96
GPT-4o-mini (proprietary)	88.23	85.21	90.58

Ablation Study (Average HITS@1 Drop)

Removed Component	Avg. HITS@1 Drop	Interpretation
Without image embeddings (V)	-5.54%	Visual modality is the most impactful single component
Without triple embeddings (Z)	-1.62%	KG triples provide meaningful complementary signal
Without gated fusion layer	-1.29%	Adaptive gating outperforms naive concatenation/addition

Case Study: How KG Triples Disambiguate Entities

Consider a mention with both text and image context referring to a basketball player. KGMEL generates triples such as "occupation: basketball player" from the image and "appeared_in: Thunderstruck" from surrounding text. These generated triples align with the correct entity's KG triples in Wikidata, enabling disambiguation when text or image features alone would match multiple similar entities. The triple-based semantic bridge is especially valuable for entities that share visual appearance or textual descriptions but differ in structured relational properties.

Up to 19.13% improvement: KGMEL achieves a 14.17% gain on WikiDiverse over the previous best (M3EL), demonstrating that KG structure is a powerful complement to text and image signals.
Reranking adds significant value: The reranking stage improves HITS@1 by 6.11% on WikiDiverse, 8.81% on RichpediaMEL, and 3.29% on WikiMEL over retrieval alone, with the largest gains where retrieval accuracy is lower.
Image embeddings are the most impactful modality: Removing image embeddings causes the largest drop (-5.54%), confirming the importance of visual information in MEL. This makes intuitive sense given that images provide complementary identity cues not present in text.
Robust across VLMs: The framework works with models ranging from open-source LLaVA-1.6-7B (86.43 on WikiDiverse) to proprietary GPT-4o-mini (88.23), showing that even smaller open-source VLMs produce useful triples, though GPT-4o-mini yields the best overall performance.
KG triples as semantic bridges: Triple embeddings bring distant mention-entity text embeddings closer in latent space, providing complementary signals when surface-level cues are ambiguous. T-SNE visualization of BERT embeddings empirically confirms this bridging effect.
Massive-scale KG utilization: The framework effectively handles knowledge bases with 32.8M to 65.1M triples across hundreds of thousands of entities, demonstrating scalability to real-world knowledge bases.
Triple filtering is essential: Semantic filtering of entity triples before LLM reranking removes noisy relations that would otherwise mislead the language model, improving both accuracy and computational efficiency.

Why It Matters

Entity linking is a foundational task for information retrieval, question answering, and knowledge-grounded dialogue systems. KGMEL makes four key contributions that advance the state of the art:

First end-to-end KG integration for MEL: By incorporating KG triples into all three stages (generation, retrieval, reranking), KGMEL shows that structural knowledge provides substantial and consistent improvements over methods using only text and image features. No prior work has systematically leveraged KG information across the entire MEL pipeline.
Practical and flexible design: The framework is compatible with various VLMs (from 7B open-source LLaVA to proprietary GPT-4o-mini) and uses zero-shot LLM reranking without task-specific fine-tuning, making it adaptable to different deployment constraints and budgets.
Novel technical components: The dual cross-attention mechanism with top-p selection for triple denoising, the gated fusion layer for adaptive modality weighting, and the semantic triple filtering before reranking are each individually effective innovations, as confirmed by the ablation study.
New research direction: The results demonstrate that entities' rich KG structure (averaging hundreds of triples across 32.8M-65.1M total triples per dataset) is a significantly underutilized resource in multimodal NLP, opening opportunities for KG-enhanced methods in related tasks such as visual question answering, multimodal knowledge base population, and cross-modal retrieval.

Links

ACM Digital Library arXiv Paper

KGMEL: Knowledge Graph-Enhanced Multimodal Entity Linking