We rigorously test the prevailing assumption that "identifying and modifying specific parameters can effectively remove unwanted knowledge" through controlled experiments, and demonstrate that this assumption does not hold.
Large language models inevitably retain harmful biases, sensitive personal data, and copyrighted content during pre-training. As regulations such as the EU's GDPR "right to be forgotten" tighten, the ability to selectively remove specific knowledge from trained models -- known as knowledge unlearning -- has become a critical research problem. Recent unlearning methods have adopted a localization-based strategy: first identify the specific parameter regions (typically MLP value vectors) that "store" the target knowledge, then confine parameter updates to those regions in order to remove the knowledge while preserving unrelated general capabilities.
Core Assumption Under Scrutiny: Specific knowledge is stored "locally" in identifiable parameter subsets, so finding and modifying those parameters should be sufficient for effective unlearning. However, existing evaluations rely on unreliable surface-level metrics, and the causal connection between localization accuracy and unlearning effectiveness has never been rigorously verified.
This paper designs a controlled experimental framework that eliminates confounding factors -- most importantly, the accuracy of the localization method itself -- to directly test whether parameter locality is truly indicative of effective knowledge removal. The results fundamentally challenge the foundational assumption underlying localized unlearning approaches.
The key innovation is a controlled setup where the ground-truth parameter region storing the target knowledge is known by construction, eliminating localization accuracy as a confounding factor:
Additionally, the paper benchmarks three existing localization methods -- Activations, MemFlex, and WAGLE -- against random parameter selection under four unlearning algorithms: WGA (weighted gradient ascent), NPO (negative preference optimization), DPO (direct preference optimization), and RMU (representation mutation).
Experiments are conducted on the TOFU benchmark (4,000 synthetic QA pairs about fictitious authors; 10% forget set, 90% retain set) using LLaMA-3.1-8B-Instruct and OLMo2-7B-Instruct. Evaluation uses robust metrics: Exact Strength (ES), Forget Strength (FS = 1 − ESforget), Retain Strength (RS = ESretain), AUES (area under the FS-RS curve), and MU95 (forget quality at 95% model utility).
| Method | AUES ↑ | MU95 ↑ |
|---|---|---|
| Random | 0.529 | -14.87 |
| Activations | 0.522 | -16.84 |
| MemFlex | 0.491 | -15.97 |
| WAGLE | 0.525 | -16.61 |
Random parameter selection outperformed all dedicated localization methods on both metrics.
| Unlearning Method | AUES (Random) | AUES (Oracle) | Δ | p-value |
|---|---|---|---|---|
| WGA | 0.586 | 0.593 | 0.018 | 0.61 |
| NPO | 0.625 | 0.619 | 0.011 | 0.71 |
| DPO | 0.497 | 0.492 | 0.007 | 0.66 |
| RMU | 0.506 | 0.502 | 0.017 | 0.37 |
This work delivers a critical message to the unlearning research community: "knowing where knowledge is stored" does not mean "being able to effectively forget." The widely adopted localization-based paradigm rests on an unverified assumption, and this paper is the first to rigorously test and refute it through controlled experiments with known ground-truth parameter regions.
The practical implications are significant. As AI regulations like GDPR's "right to be forgotten" create real legal obligations for model providers, the field needs unlearning methods that actually work. By demonstrating that multiple parameter configurations can achieve equivalent unlearning performance, the authors suggest that future research should shift away from strict parameter locality and instead explore flexible parameter adaptation strategies across diverse model regions. This finding provides a foundation for the development of more robust and reliable unlearning approaches.