A Survey on Machine Unlearning for Large Language Models

One-Line Summary

A comprehensive survey that systematically categorizes machine unlearning techniques for large language models into three paradigms — parameter-level, gradient-based, and input-level methods — and critically analyzes their effectiveness, limitations, and evaluation gaps, while charting concrete open challenges including the locality assumption, scalability barriers, and the forgetting-retention trade-off.

Background & Motivation

Large language models (LLMs) memorize vast amounts of information during training, including personal data, copyrighted content, and potentially harmful knowledge. Growing regulatory requirements such as the EU's GDPR "right to be forgotten" and concerns about AI safety have made the ability to selectively remove specific knowledge from trained models — known as machine unlearning — a pressing research problem.

Why Machine Unlearning for LLMs Is Uniquely Challenging:

Scale prohibits retraining: With billions of parameters and training corpora containing trillions of tokens, full retraining from scratch to exclude specific data is computationally prohibitive, often costing millions of dollars and weeks of GPU time.
Distributed knowledge representation: Unlike traditional databases where records can be deleted individually, knowledge in LLMs is encoded across millions of parameters in a distributed and entangled manner, making surgical removal extremely difficult.
Catastrophic forgetting risk: Naively modifying parameters to remove target knowledge often degrades the model's general capabilities — a phenomenon analogous to catastrophic forgetting in continual learning.
Verification difficulty: Even after applying unlearning, it is hard to verify that knowledge has been truly removed rather than merely suppressed, as adversarial prompting can often recover "unlearned" information.

These challenges have spawned a rapidly growing body of research, with dozens of new methods proposed in 2023–2025 alone. However, the field lacks a unified taxonomy and systematic comparison. This survey addresses that gap by providing the Korean research community with a structured overview of the machine unlearning landscape for LLMs, organizing the literature by method type, unlearning target, and evaluation approach.

Survey Structure: A Three-Paradigm Taxonomy

The survey organizes the machine unlearning literature for LLMs into three major paradigms, each with distinct mechanisms, strengths, and limitations:

Parameter-Level Methods

These methods attempt to localize specific knowledge within the model's parameters and then directly modify or erase those parameters. Techniques include knowledge neuron identification (using attribution methods to find neurons that activate for specific facts), rank-one model editing (ROME, MEMIT), and targeted weight masking. While conceptually elegant, these methods rest on the assumption that knowledge is stored locally — an assumption increasingly challenged by empirical evidence showing that factual knowledge is distributed across layers and attention heads.

Gradient-Based Methods

These methods use optimization-based strategies to "reverse" the learning of target knowledge. The most common approach is gradient ascent on the forget set (maximizing loss on data to be forgotten), often combined with gradient descent on a retain set to preserve general capabilities. Variants include influence function-based approaches that estimate which training examples most affect specific predictions, and KL-divergence-based regularization that constrains the unlearned model to remain close to the original on non-target data. These methods are more flexible than parameter-level approaches but face challenges with training stability and hyperparameter sensitivity.

Input-Level Methods

Rather than modifying model parameters, these methods operate at the input or inference stage. Approaches include prompt-based suppression (crafting system prompts or in-context instructions that direct the model to refuse or avoid generating certain knowledge), retrieval-augmented filtering (intercepting and filtering outputs at generation time), and representation engineering (steering internal activations away from target knowledge during inference). These methods are lightweight and reversible but generally offer weaker unlearning guarantees, as the underlying knowledge remains encoded in the model's parameters.

Comparative Analysis of the Three Paradigms

Aspect	Parameter-Level	Gradient-Based	Input-Level
Mechanism	Localize & edit specific parameters	Optimize to reverse learned knowledge	Filter/steer at inference time
Model Modification	Targeted weight changes	Global weight updates via fine-tuning	No weight changes
Forgetting Strength	High for localized facts	High but variable	Low (superficial suppression)
Utility Preservation	Good if locality holds	Moderate (risk of collateral damage)	Excellent (model unchanged)
Adversarial Robustness	Weak (indirect probing recovers info)	Moderate	Weak (knowledge still in weights)
Scalability	Challenging for distributed knowledge	Computationally expensive at scale	Lightweight and scalable
Reversibility	Irreversible	Irreversible	Fully reversible
Representative Methods	ROME, MEMIT, Knowledge Neurons	Gradient Ascent, Influence Functions, KL-Reg	Prompt Engineering, Representation Engineering

Unlearning Targets

The survey further distinguishes methods by what they aim to remove:

Factual knowledge removal: Deleting specific facts (e.g., "Person X was born in City Y") while retaining related but non-target knowledge — the most commonly studied setting in current benchmarks like TOFU and CounterFact.
Privacy-sensitive data deletion: Removing personally identifiable information (PII) such as names, addresses, and phone numbers that the model may have memorized verbatim from training data, motivated by GDPR Article 17 compliance.
Harmful content suppression: Eliminating the model's ability to generate dangerous content including bioweapon synthesis instructions, cyberattack code, or detailed instructions for illegal activities — a key focus of AI safety research.
Copyright-infringing material elimination: Removing memorized copyrighted text such as book passages, song lyrics, or proprietary code that the model can reproduce verbatim, addressing growing legal concerns around generative AI.

Evaluation Dimensions

The survey identifies three critical dimensions for evaluating unlearning methods:

Dimension	What It Measures	Common Metrics
Forgetting efficacy	How completely the target knowledge has been removed	Forget set accuracy, membership inference attack resistance, extraction likelihood
Utility preservation	How well the model retains its general capabilities after unlearning	Retain set accuracy, downstream task performance (MMLU, TruthfulQA), perplexity
Adversarial robustness	Whether the "forgotten" knowledge can be recovered through adversarial means	Jailbreak success rate, paraphrased query accuracy, multi-turn extraction attacks

Key Findings

No Silver Bullet: No single unlearning method consistently achieves complete knowledge removal while preserving general model capabilities across all settings. Parameter-level methods excel at targeted removal but suffer from incomplete erasure of distributed knowledge; gradient-based methods offer broader coverage but risk destabilizing the model; input-level methods are non-destructive but provide only superficial suppression.
Locality Assumption Under Scrutiny: Methods that assume knowledge is localized in specific parameters often fail to achieve robust unlearning, as knowledge in LLMs tends to be distributed across layers and components. Empirical studies show that even after "erasing" identified knowledge neurons, the target information can often be recovered through indirect prompting or by probing different layers — a finding corroborated by our lab's own research (see "Does Localization Inform Unlearning?", EMNLP 2025).
Evaluation Gaps Are Severe: Current evaluation protocols are insufficient — models may appear to have "forgotten" knowledge under standard probing but can still reveal it under adversarial prompting, reformulated queries, or multi-turn conversations. Many published methods report success on narrow benchmarks but fail under more rigorous evaluation that includes paraphrased queries, indirect reasoning chains, and red-teaming attacks.
Scale-Dependent Challenges: Unlearning difficulty increases with model scale, and methods effective for smaller models (e.g., 7B parameters) do not always transfer to larger ones (e.g., 70B+). Larger models exhibit greater knowledge redundancy, meaning the same fact may be encoded through multiple pathways, making complete removal harder.
Trade-off Between Forgetting and Retaining: A persistent tension exists between thoroughly removing target knowledge and maintaining the model's broader language understanding and generation abilities. Aggressive unlearning (e.g., high learning rates for gradient ascent) achieves better forgetting but causes greater collateral damage to unrelated capabilities, while conservative approaches preserve utility but leave target knowledge partially intact.
Temporal and Sequential Challenges: Most methods are evaluated on single-shot unlearning (removing one batch of knowledge), but real-world deployment requires sequential unlearning — handling multiple removal requests over time. The cumulative effect of repeated unlearning operations on model quality remains poorly understood.

Open Research Directions Identified by the Survey:

Hybrid approaches: Combining parameter-level precision with gradient-based coverage and input-level safety nets to achieve robust multi-layered unlearning.
Unlearning-aware pretraining: Designing model architectures and training procedures that facilitate future knowledge removal, such as modular knowledge storage or factored representations.
Formal verification: Developing mathematical guarantees that target knowledge has been provably removed, analogous to differential privacy guarantees but for knowledge erasure.
Continual unlearning benchmarks: Creating evaluation protocols that test sequential removal of multiple knowledge units over time, measuring cumulative degradation and interaction effects.
Cross-lingual unlearning: Addressing the challenge that knowledge expressed in one language may persist through translation or multilingual representations, requiring unlearning across all linguistic manifestations.

Why It Matters

As LLMs are deployed at scale in commercial and public-sector applications, the ability to remove specific knowledge becomes critical for regulatory compliance, user privacy, and AI safety. This survey makes several important contributions:

Structured entry point for researchers: By providing a clear three-paradigm taxonomy (parameter-level, gradient-based, input-level) and systematic comparison of methods, this survey enables researchers — especially those in the Korean NLP community — to quickly understand the landscape and identify promising research directions.
Critical analysis of the locality assumption: The survey highlights that the widespread assumption underlying many unlearning methods (that knowledge is locally stored) is empirically questionable, redirecting research attention toward methods that account for distributed knowledge representations.
Identification of evaluation blind spots: By cataloging the gaps in current evaluation protocols, the survey motivates the development of more rigorous and adversarially robust benchmarks for assessing unlearning effectiveness.
Bridge to empirical research: This survey complements our lab's empirical work on the same topic (see "Does Localization Inform Unlearning?", EMNLP 2025), which provides controlled experimental evidence that parameter locality does not reliably inform effective unlearning — situating that finding within the broader theoretical and methodological landscape reviewed here.

Unlearning Safety