EN KO
← All Publications

Subgraph-Aware Training of Language Models for Knowledge Graph Completion Using Structure-Aware Contrastive Learning

WWW 2025
Youmin Ko, Hyemin Yang, Taeuk Kim, Hyunjoon Kim

One-Line Summary

FLAME extracts context-aware hidden states from frozen LLMs and trains lightweight classifiers for knowledge graph completion, achieving fine-tuned performance with 188x memory efficiency and 26x speedup by bridging the LLM-KG semantic gap through subgraph-based entity descriptions and sliced mutual information analysis.

Overall architecture of FLAME: probing frozen LLMs for triple classification in knowledge graph completion
Figure 1. The overall architecture of FLAME for triple classification. Positive and negative sample pairs are constructed, then the middle layers of a frozen language model are probed to obtain hidden states for the KGC task. Entity descriptions are generated by the subgraph entity description generator. The only trainable component is the data-efficient classifier used to classify hidden states.

Background & Motivation

Knowledge Graph Completion (KGC) -- predicting missing links in knowledge graphs -- is essential for maintaining and expanding large-scale knowledge bases like Freebase, WordNet, and UMLS. Traditional structural embedding methods (e.g., TransE, DistMult, ComplEx, RotatE) learn vector representations of entities and relations but struggle with sparse entities that have few connections, because they rely solely on graph topology. Recent work has turned to large language models (LLMs) for their rich semantic understanding and encyclopedic world knowledge (acquired through pretraining on Wikipedia, CommonCrawl, etc.), but this creates a fundamental trade-off.

The Core Dilemma in LLM-Based KGC:

  • Fine-tuning is effective but expensive: Methods like KG-LLAMA achieve strong KGC performance but require 14.68 GB of GPU memory and 83 hours of training -- prohibitive for many research groups and real-world deployments.
  • Non-fine-tuned approaches are cheap but weak: Directly prompting frozen LLMs for KGC yields surprisingly poor results -- a frozen LLaMA-7B achieves only 9.1% accuracy on FB13 triple classification (essentially random), and even with in-context learning (ICL) it only reaches 50.1%, far below the 89.2% of fine-tuned KG-LLAMA.
  • Semantic gap between LLMs and KGs: Structured triples (e.g., (Einstein, bornIn, Ulm)) deviate significantly from the natural language distributions that LLMs are pretrained on, limiting the effectiveness of naive probing approaches. Directly concatenating raw triples as entity descriptions can actually hurt performance.
  • Unclear where task knowledge resides: It is unknown which intermediate layers of a frozen LLM encode the most task-relevant information for KGC. Top layers tend to suffer from hallucination effects, while bottom layers lack sufficient abstraction. This makes feature extraction a guessing game without a principled selection criterion.

FLAME addresses all four challenges by (1) generating natural-language entity descriptions from local subgraph neighborhoods to bridge the semantic gap, (2) probing intermediate layers of frozen LLMs to extract KGC-relevant representations, (3) using sliced mutual information to identify the optimal layers, and (4) training only a lightweight classifier -- leaving the LLM entirely frozen. The key insight is that frozen LLMs already possess sufficient encyclopedic knowledge for KGC on common knowledge graphs; the challenge is unlocking this knowledge through proper representation extraction rather than expensive parameter updates.

Proposed Method: FLAME Framework

FLAME (Frozen LLM Approach for KGC with Model-Friendly Entity Descriptions) consists of three main components that work together to extract knowledge graph completion capabilities from frozen language models without any parameter updates to the base model.

Prompt templates used for triple classification in FLAME
Figure 2. Prompt templates for triple classification. The templates structure positive and negative triple pairs with entity descriptions to stimulate the internal classification mechanisms of frozen LLMs.
1
Subgraph Entity Description Generator
For each entity e in a triple, the generator creates textual descriptions D(e) using two approaches. Structured verbalization (Tri) directly concatenates verbalized one-hop triples from local subgraphs (e.g., "Einstein was born in Ulm. Einstein worked at Princeton."). However, this raw triple format can actually mislead the model because it deviates from pretraining distributions. Model-friendly narrative (GPT) transforms these structured triples into fluent natural language using in-context learning with GPT-3.5-turbo, producing semantically aligned descriptions that better match the LLM's pretraining corpus. This bridging step is critical: while the Tri approach yields only 0.847 on FB13 (worse than no descriptions at 0.851), the GPT-generated narrative achieves 0.890 -- a 6.7% improvement over baseline. One notable exception is the UMLS biomedical dataset, where domain-specific entity names already align well with the LLM's internal representations, making generated descriptions unnecessary.
2
Frozen LLM Probing with SMI-Guided Layer Selection
Rather than fine-tuning, FLAME extracts context-aware hidden states from intermediate layers of frozen LLMs at the final token position using task-specific prompts constructed from positive and negative triple pairs. The key question is: which layer to probe? Sliced Mutual Information (SMI) -- which measures the expectation of mutual information between labels and random one-dimensional projections of representations -- provides a principled answer. SMI analysis reveals that intermediate layers (around layer 16 for LLaMA-7B) are optimal, while top layers degrade due to hallucination effects and bottom layers lack sufficient abstraction. Model-friendly descriptions boost SMI values by 34.1%, confirming effective semantic alignment. This pattern is consistent across different model architectures (LLaMA, Mistral, Gemma).
3
Data-Efficient KGC Classifier
A lightweight classifier is trained on the extracted hidden-state representations to perform triple classification, distinguishing valid triples from invalid ones. The paper evaluates three classifier architectures: logistic regression, SVM, and MLP. MLP achieves the best performance (0.851 on FB13, 0.874 on WN11, 0.679 on FB15K-237N), outperforming both simpler alternatives. Because only the classifier is trained (not the LLM), this step requires minimal compute -- just 0.078 GB of GPU memory versus 14.68 GB for full fine-tuning. The method generalizes across multiple LLM architectures including Mistral, Gemma, and Qwen2.5.
Prompt template for entity description generation via GPT-3.5-turbo
Figure 3. Prompt template used to generate model-friendly entity descriptions via GPT-3.5-turbo. One-hop subgraph triples are transformed into fluent natural language narratives through in-context learning.

Experimental Results

FLAME is evaluated on six benchmark datasets spanning triple classification, relation prediction, and entity prediction tasks. The datasets include FB13, WN11, FB15K-237N, and WN18RR (derived from Freebase and WordNet), UMLS (biomedical domain), and YAGO3-10 (large-scale with over 1M training triples). Baselines span both structural methods (TransE, DistMult, ComplEx, RotatE) and LLM-based approaches (KG-BERT, KG-T5, KG-LLAMA, and frozen variants with various prompting strategies).

Triple Classification Accuracy

MethodFB13WN11FB15K-237NWN18RRUMLS
LLaMA-7B (frozen, no prompt)0.091--------
LLaMA-7B-ICL (frozen, in-context)0.501--------
KG-LLAMA-7B (fine-tuned, full data)0.8920.9550.7480.9210.858
FLAME w/ MLP only (no desc.)0.8510.8740.679----
FLAME w/ Non-Generated (3k samples)0.901--0.7380.9340.862
FLAME w/ GPT desc. (3k samples)0.9120.9170.7260.9240.860
FLAME w/ GPT desc. (full data)0.9250.9370.7440.9380.866

Efficiency Comparison (WN11, Full Dataset)

MetricKG-LLAMA (fine-tuned)FLAME (frozen)Gain
Training GPU Memory14.68 GB0.078 GB188x reduction
Training Time83 hours33 minutes150x faster
Total Time (incl. inference)85h 50min2h 44min + 15s26.11x speedup

Relation & Entity Prediction (Hits@1)

TaskMethodHits@1Training Data
Relation PredictionChatGLM-6B (frozen)0.0658--
Relation PredictionKG-LLAMA-7B0.7028Full (1.08M)
Relation PredictionFLAME w/ GPT0.70156,996 (0.6%)
Entity PredictionKG-LLAMA-7B0.2415Full
Entity PredictionFLAME w/ GPT0.249510k
Layer selection analysis across LLaMA and Mistral models
Figure 4. Layer-wise performance analysis across LLaMA and Mistral models. Intermediate layers (around layer 16) consistently outperform both shallow and deep layers. Top layers degrade due to hallucination effects, while bottom layers lack sufficient abstraction for the KGC task.
Impact of training dataset size on KGC performance
Figure 5. Data efficiency analysis: impact of training set size on triple classification accuracy. FLAME achieves 98.3% of full performance on FB13 with only 0.06% of training data, 99.6% on FB15K-237N with 0.57%, and 98.8% on WN18RR with 0.46%.

Ablation: Classifier Architecture

ClassifierFB13WN11FB15K-237N
Logistic Regression0.8370.8570.665
SVM0.8420.8620.671
MLP0.8510.8740.679

Ablation: Cross-Model Versatility (7B Models)

ModelDescription TypeFB13WN11
LLaMA-7BGPT narrative0.8900.892
Mistral-7BGPT narrative0.8750.912
Gemma-7BGPT narrative----

Entity descriptions provide consistent 4.5-6.2% improvements across all tested architectures, confirming the approach is model-agnostic rather than specific to LLaMA.

PCA visualization of hidden states showing class separability
Figure 6. PCA visualization (3D projection) of hidden states from FLAME with GPT descriptions at layer 16 on the FB13 test set. Positive (valid) and negative (invalid) triples form clearly separable clusters in the representation space, demonstrating that the probed hidden states encode meaningful structural information for KGC.

Why It Matters

FLAME demonstrates that frozen LLMs already encode sufficient knowledge for KGC tasks when properly probed with structure-aware descriptions, fundamentally challenging the assumption that fine-tuning is necessary. The paper makes both practical and theoretical contributions with broad implications for the field.

Core Contribution: FLAME establishes that the gap between frozen and fine-tuned LLM performance on KGC is not due to missing knowledge, but rather a representation alignment problem -- structured KG triples do not match the natural language distributions LLMs were trained on. By solving this alignment through model-friendly entity descriptions and principled layer selection, frozen LLMs can match or exceed fine-tuned performance.

Limitations & Future Directions

Links

Knowledge Graph Representation Learning