One-Line Summary
MAdapter refines the standard adapter architecture by strategically inserting additional lightweight modules at the middle layers of pretrained language models, achieving comparable or superior performance to conventional adapters while using only about 1% of total model parameters -- roughly half the trainable parameters of standard adapters.
Background & Motivation
Parameter-Efficient Fine-Tuning (PEFT) methods like adapters have become essential for adapting large pretrained language models to downstream tasks without updating all parameters. Standard adapters insert small bottleneck layers after the feed-forward sub-layers in every transformer block uniformly. While effective, this uniform placement strategy does not account for what each layer actually contributes to the task.
Key Observation:
- Middle layers are most task-relevant: Research on the layerwise behavior of transformers consistently shows that middle layers encode the most task-discriminative features, while the earliest layers capture low-level patterns and the final layers tend toward more general representations.
- Uniform placement is suboptimal: Allocating identical adapter capacity to every layer wastes parameters on layers that contribute less to task adaptation, and under-serves the critical middle layers.
- Parameter budget is tight: In practical scenarios, the total number of trainable parameters must remain small. A smarter allocation strategy could achieve better results with fewer parameters.
MAdapter addresses this gap by augmenting the middle layers with additional adapter capacity while keeping the overall parameter budget extremely lean -- around 1% of total model parameters.
Proposed Method
MAdapter modifies the standard adapter paradigm with a targeted insertion strategy that concentrates additional capacity where it matters most.
1
Standard Adapter Baseline
Start with the conventional adapter setup: small bottleneck modules (down-projection, non-linearity, up-projection with residual connection) inserted after the feed-forward sub-layer of each transformer block. The pretrained model weights remain frozen; only the adapter parameters are trained.
2
Middle-Layer Identification
Identify the middle layers of the transformer stack -- the layers empirically shown to encode the most task-relevant representations. For a model with L layers, the middle region is defined as a contiguous subset of layers centered around L/2.
3
Efficient Middle-Layer Augmentation
Insert additional lightweight adapter modules at the identified middle layers. These extra adapters are designed with reduced bottleneck dimensions to keep the overall parameter count low. The total trainable parameter count remains at approximately 1% of the full model -- about half the parameters used by conventional adapters applied uniformly across all layers.
4
Architecture-Agnostic Application
The approach is compatible with various pretrained language models (e.g., BERT, RoBERTa) and requires minimal modification to existing adapter implementations, making it straightforward to adopt in practice.
Experimental Results
MAdapter is evaluated on multiple NLU benchmarks and compared against standard adapter baselines and other PEFT methods.
Parameter Efficiency
| Method | Trainable Params (% of total) | Relative to Standard Adapter |
| Full Fine-Tuning | 100% | -- |
| Standard Adapter | ~2% | 1.0x |
| MAdapter | ~1% | ~0.5x |
- Comparable or superior performance: MAdapter achieves performance on par with or better than standard adapters across NLU benchmarks, despite using approximately half the trainable parameters.
- Efficiency gains: By concentrating adapter capacity at the most informative middle layers, MAdapter avoids wasting parameters on layers that contribute less to task adaptation.
- Consistent improvements: The benefit of middle-layer augmentation is observed across different model architectures and task types, confirming the generalizability of the approach.
- Complex tasks benefit most: Tasks requiring richer feature interactions see the largest gains from the targeted middle-layer strategy, aligning with the hypothesis that these layers encode the most task-discriminative representations.
Why It Matters
MAdapter offers practical and conceptual contributions to the growing field of parameter-efficient fine-tuning:
- Placement strategy matters: This work demonstrates that where adapters are placed is just as important as how they are designed. A simple change in allocation strategy can halve the parameter count without sacrificing performance.
- Bridging interpretability and efficiency: By leveraging insights from transformer interpretability research (that middle layers are most task-relevant), MAdapter shows how understanding model internals can directly inform better PEFT designs.
- Practical applicability: With only ~1% of total model parameters being trainable and no architectural changes to the base model, MAdapter is easy to integrate into existing fine-tuning pipelines for BERT, RoBERTa, and similar models.
Efficiency
Representation Learning