EN KO
← All Publications

Superficial Success vs. Internal Breakdown: An Empirical Study of Generalization in Adaptive Multi-Agent Systems

ACL 2026 Findings
Namyoung So*, Seokgyu Jang*, Taeuk Kim (*: equal contribution)

One-Line Summary

An empirical study of two representative adaptive multi-agent systems (AFlow, AgentDropout) across six domains shows that learned topologies (i) fail to transfer out-of-distribution (topological overfitting) and (ii) often retain reasonable accuracy while the underlying collaboration has already collapsed (illusory coordination)—with role-related and connection-related breakdowns accounting for ~59% of failures under domain shift.

Example of superficial success vs. internal breakdown under domain transfer: a legal-trained MAS applied to science produces role misalignment, task derailment, and step repetition, yet the final answer is still correct because of a strong base LLM.
Figure 1. An adaptive MAS trained on the legal domain is deployed on science. Agents make multiple collaboration errors (Legal Text Extractor solving a physics problem, Verifier with nothing to verify, Holding Selector re-solving the problem from scratch), yet the final answer remains correct—surface accuracy masks illusory coordination.

Background & Motivation

Adaptive multi-agent systems (MAS) optimize both the set of agents A (roles) and their connections C (communication topology) from data, much like supervised topology search. The appeal is obvious: plug in a strong base LLM, learn a task-specific collaboration graph, and enjoy the gains. The paradox is that these systems are built from general-purpose LLMs yet are routinely tuned on a narrow slice of tasks—so it is unclear whether they behave as general-purpose systems at all.

This is more than an academic concern. Constructing an adaptive MAS is expensive (multiple LLM calls, repeated search, orchestration overhead), and deploying a separate MAS per task defeats the original motivation. If adaptive MAS only work in-domain, they are effectively narrow solvers in disguise.

Central question: When an adaptive MAS transfers well across domains, is the transfer driven by genuine collective intelligence, or simply by the raw ability of the underlying LLM? The paper argues the latter is alarmingly common and proposes metrics that expose it.

Study Setup

1
Two Representative Adaptive MAS
AFlow (bottom-up) incrementally constructs communication paths and jointly optimizes roles A and connections C. AgentDropout (top-down) prunes redundant links from a fully connected graph and optimizes only C, with AgentInit used to fix A beforehand.
2
Six Domains / Reasoning Types
CaseHOLD (Legal), COM2 (Detective), MuSiQue (Multi-Hop), SciBench (Scientific), TheoremQA (Math), StrategyQA (Commonsense). Training uses the conventional small budgets (60 instances for AgentDropout, 100 for AFlow); the learned topology is frozen and applied to the other five domains for OOD evaluation, plus a multi-domain variant that mixes all six while keeping the total instance count constant.
3
Base LLMs and Judge
Main experiments use GPT-oss-20B as the base agent LLM, with Qwen3-30B-A3B confirming the trends in the appendix. GPT-oss-120B serves as the LLM-as-judge for failure taxonomy labeling. All numbers are averaged over three independent runs.
4
Qualitative Lens: MAST Taxonomy
Execution traces are labeled with the 14-category Multi-Agent System Failure Taxonomy (MAST) of Cemri et al. (2025). 100 logs are judged per setting, isolating role violations, miscommunication, task derailment, step repetition, and missing verification.
5
Quantitative Lens: Two New Metrics
Role Alignment Ri = S1i · (1 − S2i), where S1 is cosine similarity between an agent’s role prompt and its output (all-MiniLM-L6-v2 embeddings) and S2 is the average similarity to other agents’ outputs—large R means each agent contributes unique, role-faithful content. Connection Significance Oi = ∑ αi,ℓ si,ℓ, where α weights incoming messages via softmax influence against the static priors (role, query) and s ∈ {+1, −1} is a per-message usefulness judgment from an LLM-as-judge. O near 0 means messages are ignored; O < 0 means influential but unhelpful; O > 0 means influential and helpful.

Finding 1: Topological Overfitting

Single-domain MAS optimization produces topologies that are surprisingly brittle under distribution shift. AgentDropout trained on CaseHOLD (Legal) drops from 63.5% in-domain to an average of 55.78% across the five unseen domains, and AgentDropout trained on StrategyQA (Commonsense) cannot even produce valid outputs on most other domains (e.g., 0.6% / 0.5% / 0.1% / 15.7% on Legal / Detective / Science / Math) because topologies tuned for binary true/false answers fail to produce multiple-choice or numerical solutions.

AgentDropout on GPT-oss-20B — Train Domain (row) → Test Domain (col) Accuracy

Train / TestLegalDetectiveMulti-HopScienceMathCommonsense
CaseHOLD (Legal)63.544.257.441.865.570.0
COM2 (Detective)53.447.953.835.854.419.5
MuSiQue (Multi-Hop)63.249.058.440.165.473.8
SciBench (Science)61.834.254.938.962.847.5
TheoremQA (Math)62.247.257.536.963.875.1
StrategyQA (Commonsense)0.60.541.50.115.772.5
Multi-Domain Training60.246.752.941.164.475.3

A simple mitigation—mixing training data across all six domains while holding the total instance budget constant—recovers most of the in-domain baselines and significantly stabilizes the worst cases, hinting that generalization failures are driven by narrow training scope rather than an intrinsic limit of adaptive MAS.

Finding 2: Illusory Coordination

Even where accuracy looks acceptable, the qualitative and quantitative analyses reveal that collaboration is often not the reason. Applying MAST to 100 execution logs, role- and connection-related failures (categories 1–6) make up roughly 59% of all errors under domain transfer.

Failure Distribution under Domain Transfer (MAST)

Failure TypeShare
Disobey Role Specification15.22%
No or Incorrect Verification10.10%
Task Derailment8.97%
Disobey Task Specification8.94%
Step Repetition8.65%
Ignored Other Agent’s Input7.21%
Miscellaneous (8 other MAST cases)40.90%

Representative case studies show the pattern vividly: a Legal Text Extractor attempting Carnot-efficiency physics (role misalignment), a Validator ignoring prior agent outputs and re-solving from scratch (input neglect), and an Answer Synthesizer replying “True” to a multiple-choice question (task violation).

Role Alignment (R) and Connection Significance (O) under Domain Transfer — AgentDropout

Each value is normalized per row by the row-wise maximum, so the diagonal (in-domain) entries equal 1.00. Low normalized values expose illusory coordination even when raw accuracy is deceptively reasonable.

Train / TestL (R / O)DMHSMACS
CaseHOLD (Legal)1.00 / 1.000.56 / 0.070.04 / −1.790.22 / −2.070.25 / −1.890.54 / −1.56
COM2 (Detective)0.79 / 0.901.00 / 1.000.04 / 0.170.43 / 0.650.47 / 0.580.82 / 0.46
MuSiQue (Multi-Hop)0.69 / 1.001.00 / 0.960.38 / 0.150.50 / 0.950.58 / 0.800.58 / 0.86
SciBench (Science)0.44 / −0.750.49 / −0.500.04 / −0.571.00 / 1.000.62 / 0.770.46 / −0.07
TheoremQA (Math)0.38 / −0.120.36 / 0.210.04 / −0.070.60 / 1.001.00 / 0.880.32 / 0.48
StrategyQA (Commonsense)1.00 / 0.930.96 / 0.950.07 / 0.170.32 / 1.000.45 / 0.900.95 / 0.81
Multi-Domain Training0.89 / 0.691.00 / 0.990.31 / −0.230.58 / 0.880.62 / 0.850.98 / 1.00

Correlation and Ablation

A component-swap ablation isolates what exactly overfits. Role-OOD (keep in-domain connections, swap roles with OOD ones) drops accuracy by an average of −13.00 pp, while Connection-OOD (swap only connections) drops by only −1.24 pp—indicating that learned roles are substantially more task-specific than learned connections. A notable exception is MuSiQue (Multi-Hop), where Connection-OOD alone causes a 5.36 pp drop, showing that valid inter-agent links matter most when the task itself requires multi-hop integration.

BenchmarkAcc – R (Pearson)Acc – O (Pearson)In-DomainConnection-OODRole-OOD
CaseHOLD−0.0070.000263.5062.88 (−0.62)48.26 (−15.24)
COM2−0.035**0.045***47.9050.68 (+2.78)34.50 (−13.40)
MuSiQue0.0030.123***58.4053.04 (−5.36)48.44 (−9.96)
SciBench0.084***−0.039*38.9038.69 (−0.21)30.29 (−8.61)
TheoremQA0.113***−0.081***63.8061.26 (−2.54)51.64 (−12.16)
StrategyQA−0.096***0.067**72.5071.00 (−1.50)53.89 (−18.61)

Why It Matters

The paper reframes what it means for an adaptive MAS to “work.” A system can post competitive numbers on a new benchmark while internally collapsing into a single strong LLM carrying the load, with other agents producing irrelevant or actively misleading messages. Benchmarks that reward only final-answer accuracy miss this, and the field’s current optimization objectives actively encourage it.

Links

Safety Reasoning