Localization & Unlearning

One-Line Summary

We rigorously test the prevailing assumption that "identifying and modifying specific parameters can effectively remove unwanted knowledge" through controlled experiments, and demonstrate that this assumption does not hold.

Background & Motivation

Large language models inevitably retain harmful biases, sensitive personal data, and copyrighted content during pre-training. As regulations such as the EU's GDPR "right to be forgotten" tighten, the ability to selectively remove specific knowledge from trained models -- known as knowledge unlearning -- has become a critical research problem. Recent unlearning methods have adopted a localization-based strategy: first identify the specific parameter regions (typically MLP value vectors) that "store" the target knowledge, then confine parameter updates to those regions in order to remove the knowledge while preserving unrelated general capabilities.

Core Assumption Under Scrutiny: Specific knowledge is stored "locally" in identifiable parameter subsets, so finding and modifying those parameters should be sufficient for effective unlearning. However, existing evaluations rely on unreliable surface-level metrics, and the causal connection between localization accuracy and unlearning effectiveness has never been rigorously verified.

This paper designs a controlled experimental framework that eliminates confounding factors -- most importantly, the accuracy of the localization method itself -- to directly test whether parameter locality is truly indicative of effective knowledge removal. The results fundamentally challenge the foundational assumption underlying localized unlearning approaches.

Proposed Method: Controlled Experimental Framework

The key innovation is a controlled setup where the ground-truth parameter region storing the target knowledge is known by construction, eliminating localization accuracy as a confounding factor:

1

Retain-Only Fine-Tuning

Start from a pretrained model θ_p and fine-tune it on the retain set only to obtain θ_r. This serves as the gold-standard reference model that has never seen the forget set.

2

Controlled Knowledge Injection

Train θ_r on the forget set with parameter updates restricted to a randomly selected 10% of MLP value vectors (the target region V_tgt), producing the "contaminated" model θ_o. By construction, V_tgt is the exact set of parameters that encode the forget knowledge.

3

Oracle vs. Random Comparison

Apply unlearning methods in two scenarios: Oracle (updates confined to V_tgt, the true knowledge region) and Random (updates confined to V_rdm, a randomly selected alternative region of equal size). If locality matters, Oracle should significantly outperform Random.

Additionally, the paper benchmarks three existing localization methods -- Activations, MemFlex, and WAGLE -- against random parameter selection under four unlearning algorithms: WGA (weighted gradient ascent), NPO (negative preference optimization), DPO (direct preference optimization), and RMU (representation mutation).

Experimental Results

Experiments are conducted on the TOFU benchmark (4,000 synthetic QA pairs about fictitious authors; 10% forget set, 90% retain set) using LLaMA-3.1-8B-Instruct and OLMo2-7B-Instruct. Evaluation uses robust metrics: Exact Strength (ES), Forget Strength (FS = 1 − ES_forget), Retain Strength (RS = ES_retain), AUES (area under the FS-RS curve), and MU95 (forget quality at 95% model utility).

Localization Methods vs. Random Selection (LLaMA-3.1-8B)

Method	AUES ↑	MU95 ↑
Random	0.529	-14.87
Activations	0.522	-16.84
MemFlex	0.491	-15.97
WAGLE	0.525	-16.61

Random parameter selection outperformed all dedicated localization methods on both metrics.

Oracle vs. Random -- Controlled Experiment (LLaMA-3.1-8B)

Unlearning Method	AUES (Random)	AUES (Oracle)	Δ	p-value
WGA	0.586	0.593	0.018	0.61
NPO	0.625	0.619	0.011	0.71
DPO	0.497	0.492	0.007	0.66
RMU	0.506	0.502	0.017	0.37

No Statistically Significant Difference: All p-values exceed 0.3, indicating that Oracle (true knowledge region) provides no meaningful advantage over Random (arbitrary region) across all four unlearning methods
Random Sometimes Wins: For NPO, DPO, and RMU, the Random scenario actually achieved higher AUES than Oracle, directly contradicting the locality assumption
Localization Methods Underperform Random: On LLaMA-3.1-8B, all three localization methods (Activations, MemFlex, WAGLE) scored lower than random parameter selection
Flexible Parameter Adaptation: L2 minimization analysis shows that even when updating a random parameter region, the model can reproduce the target MLP outputs to a degree comparable to updating the true knowledge region -- indicating that multiple parameter configurations can satisfy unlearning objectives

Comparison of Oracle and Random scenarios across layers — **Figure 1.** Comparison of Oracle and Random scenarios: L2 minimization of MLP outputs at each layer. Even when adjusting a random parameter region, the MLP outputs can be reproduced to a degree comparable to the true knowledge region, demonstrating flexible parameter adaptation.

Why It Matters

This work delivers a critical message to the unlearning research community: "knowing where knowledge is stored" does not mean "being able to effectively forget." The widely adopted localization-based paradigm rests on an unverified assumption, and this paper is the first to rigorously test and refute it through controlled experiments with known ground-truth parameter regions.

The practical implications are significant. As AI regulations like GDPR's "right to be forgotten" create real legal obligations for model providers, the field needs unlearning methods that actually work. By demonstrating that multiple parameter configurations can achieve equivalent unlearning performance, the authors suggest that future research should shift away from strict parameter locality and instead explore flexible parameter adaptation strategies across diverse model regions. This finding provides a foundation for the development of more robust and reliable unlearning approaches.

Links

ACL Anthology arXiv Paper

Does Localization Inform Unlearning? A Rigorous Examination of Local Parameter Attribution for Knowledge Unlearning in Language Models