EN KO
← All Publications

Does Localization Inform Unlearning? A Rigorous Examination of Local Parameter Attribution for Knowledge Unlearning in Language Models

EMNLP 2025
Hwiyeong Lee, Uiji Hwang, Hyelim Lim, Taeuk Kim

One-Line Summary

We rigorously test the prevailing assumption that "identifying and modifying specific parameters can effectively remove unwanted knowledge" through controlled experiments, and demonstrate that this assumption does not hold.

Background & Motivation

Large language models inevitably retain harmful biases, sensitive personal data, and copyrighted content during pre-training. As regulations such as the EU's GDPR "right to be forgotten" tighten, the ability to selectively remove specific knowledge from trained models -- known as knowledge unlearning -- has become a critical research problem. Recent unlearning methods have adopted a localization-based strategy: first identify the specific parameter regions (typically MLP value vectors) that "store" the target knowledge, then confine parameter updates to those regions in order to remove the knowledge while preserving unrelated general capabilities.

Core Assumption Under Scrutiny: Specific knowledge is stored "locally" in identifiable parameter subsets, so finding and modifying those parameters should be sufficient for effective unlearning. However, existing evaluations rely on unreliable surface-level metrics, and the causal connection between localization accuracy and unlearning effectiveness has never been rigorously verified.

This paper designs a controlled experimental framework that eliminates confounding factors -- most importantly, the accuracy of the localization method itself -- to directly test whether parameter locality is truly indicative of effective knowledge removal. The results fundamentally challenge the foundational assumption underlying localized unlearning approaches.

Proposed Method: Controlled Experimental Framework

The key innovation is a controlled setup where the ground-truth parameter region storing the target knowledge is known by construction, eliminating localization accuracy as a confounding factor:

1
Retain-Only Fine-Tuning
Start from a pretrained model θp and fine-tune it on the retain set only to obtain θr. This serves as the gold-standard reference model that has never seen the forget set.
2
Controlled Knowledge Injection
Train θr on the forget set with parameter updates restricted to a randomly selected 10% of MLP value vectors (the target region Vtgt), producing the "contaminated" model θo. By construction, Vtgt is the exact set of parameters that encode the forget knowledge.
3
Oracle vs. Random Comparison
Apply unlearning methods in two scenarios: Oracle (updates confined to Vtgt, the true knowledge region) and Random (updates confined to Vrdm, a randomly selected alternative region of equal size). If locality matters, Oracle should significantly outperform Random.

Additionally, the paper benchmarks three existing localization methods -- Activations, MemFlex, and WAGLE -- against random parameter selection under four unlearning algorithms: WGA (weighted gradient ascent), NPO (negative preference optimization), DPO (direct preference optimization), and RMU (representation mutation).

Experimental Results

Experiments are conducted on the TOFU benchmark (4,000 synthetic QA pairs about fictitious authors; 10% forget set, 90% retain set) using LLaMA-3.1-8B-Instruct and OLMo2-7B-Instruct. Evaluation uses robust metrics: Exact Strength (ES), Forget Strength (FS = 1 − ESforget), Retain Strength (RS = ESretain), AUES (area under the FS-RS curve), and MU95 (forget quality at 95% model utility).

Localization Methods vs. Random Selection (LLaMA-3.1-8B)

MethodAUES ↑MU95 ↑
Random0.529-14.87
Activations0.522-16.84
MemFlex0.491-15.97
WAGLE0.525-16.61

Random parameter selection outperformed all dedicated localization methods on both metrics.

Oracle vs. Random -- Controlled Experiment (LLaMA-3.1-8B)

Unlearning MethodAUES (Random)AUES (Oracle)Δp-value
WGA0.5860.5930.0180.61
NPO0.6250.6190.0110.71
DPO0.4970.4920.0070.66
RMU0.5060.5020.0170.37
Comparison of Oracle and Random scenarios across layers
Figure 1. Comparison of Oracle and Random scenarios: L2 minimization of MLP outputs at each layer. Even when adjusting a random parameter region, the MLP outputs can be reproduced to a degree comparable to the true knowledge region, demonstrating flexible parameter adaptation.

Why It Matters

This work delivers a critical message to the unlearning research community: "knowing where knowledge is stored" does not mean "being able to effectively forget." The widely adopted localization-based paradigm rests on an unverified assumption, and this paper is the first to rigorously test and refute it through controlled experiments with known ground-truth parameter regions.

The practical implications are significant. As AI regulations like GDPR's "right to be forgotten" create real legal obligations for model providers, the field needs unlearning methods that actually work. By demonstrating that multiple parameter configurations can achieve equivalent unlearning performance, the authors suggest that future research should shift away from strict parameter locality and instead explore flexible parameter adaptation strategies across diverse model regions. This finding provides a foundation for the development of more robust and reliable unlearning approaches.

Links

Unlearning Safety