EN KO
← All Publications

A Survey on Machine Unlearning for Large Language Models

Korean Institute of Information Scientists and Engineers (정보과학회지), Vol. 43, Issue 11, pp. 72-78
Uiji Hwang, Taeuk Kim

One-Line Summary

A comprehensive survey that systematically categorizes machine unlearning techniques for large language models into three paradigms — parameter-level, gradient-based, and input-level methods — and critically analyzes their effectiveness, limitations, and evaluation gaps, while charting concrete open challenges including the locality assumption, scalability barriers, and the forgetting-retention trade-off.

Background & Motivation

Large language models (LLMs) memorize vast amounts of information during training, including personal data, copyrighted content, and potentially harmful knowledge. Growing regulatory requirements such as the EU's GDPR "right to be forgotten" and concerns about AI safety have made the ability to selectively remove specific knowledge from trained models — known as machine unlearning — a pressing research problem.

Why Machine Unlearning for LLMs Is Uniquely Challenging:

  • Scale prohibits retraining: With billions of parameters and training corpora containing trillions of tokens, full retraining from scratch to exclude specific data is computationally prohibitive, often costing millions of dollars and weeks of GPU time.
  • Distributed knowledge representation: Unlike traditional databases where records can be deleted individually, knowledge in LLMs is encoded across millions of parameters in a distributed and entangled manner, making surgical removal extremely difficult.
  • Catastrophic forgetting risk: Naively modifying parameters to remove target knowledge often degrades the model's general capabilities — a phenomenon analogous to catastrophic forgetting in continual learning.
  • Verification difficulty: Even after applying unlearning, it is hard to verify that knowledge has been truly removed rather than merely suppressed, as adversarial prompting can often recover "unlearned" information.

These challenges have spawned a rapidly growing body of research, with dozens of new methods proposed in 2023–2025 alone. However, the field lacks a unified taxonomy and systematic comparison. This survey addresses that gap by providing the Korean research community with a structured overview of the machine unlearning landscape for LLMs, organizing the literature by method type, unlearning target, and evaluation approach.

Survey Structure: A Three-Paradigm Taxonomy

The survey organizes the machine unlearning literature for LLMs into three major paradigms, each with distinct mechanisms, strengths, and limitations:

1
Parameter-Level Methods
These methods attempt to localize specific knowledge within the model's parameters and then directly modify or erase those parameters. Techniques include knowledge neuron identification (using attribution methods to find neurons that activate for specific facts), rank-one model editing (ROME, MEMIT), and targeted weight masking. While conceptually elegant, these methods rest on the assumption that knowledge is stored locally — an assumption increasingly challenged by empirical evidence showing that factual knowledge is distributed across layers and attention heads.
2
Gradient-Based Methods
These methods use optimization-based strategies to "reverse" the learning of target knowledge. The most common approach is gradient ascent on the forget set (maximizing loss on data to be forgotten), often combined with gradient descent on a retain set to preserve general capabilities. Variants include influence function-based approaches that estimate which training examples most affect specific predictions, and KL-divergence-based regularization that constrains the unlearned model to remain close to the original on non-target data. These methods are more flexible than parameter-level approaches but face challenges with training stability and hyperparameter sensitivity.
3
Input-Level Methods
Rather than modifying model parameters, these methods operate at the input or inference stage. Approaches include prompt-based suppression (crafting system prompts or in-context instructions that direct the model to refuse or avoid generating certain knowledge), retrieval-augmented filtering (intercepting and filtering outputs at generation time), and representation engineering (steering internal activations away from target knowledge during inference). These methods are lightweight and reversible but generally offer weaker unlearning guarantees, as the underlying knowledge remains encoded in the model's parameters.

Comparative Analysis of the Three Paradigms

AspectParameter-LevelGradient-BasedInput-Level
MechanismLocalize & edit specific parametersOptimize to reverse learned knowledgeFilter/steer at inference time
Model ModificationTargeted weight changesGlobal weight updates via fine-tuningNo weight changes
Forgetting StrengthHigh for localized factsHigh but variableLow (superficial suppression)
Utility PreservationGood if locality holdsModerate (risk of collateral damage)Excellent (model unchanged)
Adversarial RobustnessWeak (indirect probing recovers info)ModerateWeak (knowledge still in weights)
ScalabilityChallenging for distributed knowledgeComputationally expensive at scaleLightweight and scalable
ReversibilityIrreversibleIrreversibleFully reversible
Representative MethodsROME, MEMIT, Knowledge NeuronsGradient Ascent, Influence Functions, KL-RegPrompt Engineering, Representation Engineering

Unlearning Targets

The survey further distinguishes methods by what they aim to remove:

Evaluation Dimensions

The survey identifies three critical dimensions for evaluating unlearning methods:

DimensionWhat It MeasuresCommon Metrics
Forgetting efficacyHow completely the target knowledge has been removedForget set accuracy, membership inference attack resistance, extraction likelihood
Utility preservationHow well the model retains its general capabilities after unlearningRetain set accuracy, downstream task performance (MMLU, TruthfulQA), perplexity
Adversarial robustnessWhether the "forgotten" knowledge can be recovered through adversarial meansJailbreak success rate, paraphrased query accuracy, multi-turn extraction attacks

Key Findings

Open Research Directions Identified by the Survey:

  • Hybrid approaches: Combining parameter-level precision with gradient-based coverage and input-level safety nets to achieve robust multi-layered unlearning.
  • Unlearning-aware pretraining: Designing model architectures and training procedures that facilitate future knowledge removal, such as modular knowledge storage or factored representations.
  • Formal verification: Developing mathematical guarantees that target knowledge has been provably removed, analogous to differential privacy guarantees but for knowledge erasure.
  • Continual unlearning benchmarks: Creating evaluation protocols that test sequential removal of multiple knowledge units over time, measuring cumulative degradation and interaction effects.
  • Cross-lingual unlearning: Addressing the challenge that knowledge expressed in one language may persist through translation or multilingual representations, requiring unlearning across all linguistic manifestations.

Why It Matters

As LLMs are deployed at scale in commercial and public-sector applications, the ability to remove specific knowledge becomes critical for regulatory compliance, user privacy, and AI safety. This survey makes several important contributions:

Unlearning Safety