BlendX: Complex Multi-Intent Detection with Blended Patterns

One-Line Summary

BlendX is a suite of refined multi-intent detection datasets that exposes the fragility of existing benchmarks by constructing linguistically diverse blended utterances via rule-based heuristics and ChatGPT-assisted generation, revealing up to 40%p accuracy drops in state-of-the-art models.

Background & Motivation

            Core Problem: Dominant multi-intent detection benchmarks (MixATIS, MixSNIPS) construct examples using only four conjunction templates — "and," "and then," "and also," and a comma — allowing models to achieve high accuracy by exploiting shallow concatenation cues rather than truly understanding compositional semantics.
        

Multi-intent detection (MID) addresses the realistic scenario where a single user utterance conveys multiple intents at once. Studies show that over half of utterances in production dialogue systems contain more than one intent, making MID a critical capability for real-world task-oriented dialogue systems.

However, existing MID benchmarks suffer from two fundamental limitations:

Limited concatenation patterns: MixATIS and MixSNIPS use only four conjunctions to combine single-intent utterances, creating highly predictable multi-intent examples that do not reflect the diversity of natural language.
Narrow dataset scope: Prior work has been limited to only two source datasets (ATIS and SNIPS), restricting the diversity of intents and domains evaluated.
Inflated performance: The overly simplistic construction allows models to achieve near-perfect scores by detecting surface-level concatenation cues (e.g., the word "and") rather than genuinely decomposing complex utterances into their constituent intents.

These limitations motivate BlendX, which aims to provide a more rigorous and linguistically diverse benchmark that truly tests a model's ability to handle multi-intent utterances as they appear in real conversations.

Proposed Method

BlendX constructs its benchmark along two orthogonal dimensions — complexity (explicit vs. implicit) and methodology (Naïve, Manual, Generative) — and introduces three novel complexity metrics to quantify dataset difficulty.

1

Naïve Concatenation (Baseline)

Replicates the original MixX recipe by combining single-intent utterances using only the four standard AND-variant conjunctions (and, and then, and also, comma). Serves as a direct comparison baseline with zero word reduction (W = 0%) across all datasets.

2

Manual Concatenation (Rule-Based Heuristics)

Applies diverse linguistic transformations through explicit patterns (varied conjunctions: or, before, after, additionally, meanwhile) and implicit patterns (omissions that reduce word count, coreferences via pronoun introduction, and gerund phrases). Achieves the highest complexity: 37–48% word reduction across datasets.

3

Generative Concatenation (ChatGPT + Similarity Selection)

Leverages ChatGPT to produce natural multi-intent utterances, augmented with SBERT-based similarity filtering (cosine similarity threshold τ = 0.7) to ensure semantic fidelity. This similarity-driven selection reduces the ChatGPT error rate dramatically (e.g., from 41% to 10% on ATIS). Achieves 18–37% word reduction.

4

Novel Complexity Metrics

Three binary metrics quantify dataset difficulty: W(utt, n) measures word-count reduction after concatenation (detecting omissions), C(utt, n) measures conjunction absence (detecting implicit joins), and P(utt, n) measures pronoun introduction (detecting coreferences). These metrics objectively demonstrate BlendX's superiority over MixX.

Four widely used single-intent datasets are extended into multi-intent versions with a 3:5:2 ratio of single / double / triple-intent utterances:

Dataset	Intents	Train	Dev	Test	Total
SNIPS	7	50,625	2,613	2,615	55,853
ATIS	18	20,250	1,125	1,125	22,500
Banking77	77	36,390	2,009	2,021	40,420
CLINC150	147	54,896	2,889	2,977	60,762

Experimental Results

Three representative models are evaluated: TFMN (threshold-based multi-intent detection), SLIM (binary classification approach), and ChatGPT (in-context learning). The cross-evaluation paradigm (train on MixX, test on BlendX) reveals critical weaknesses.

TFMN Performance (Accuracy %)

Train	Test	SNIPS	ATIS	Banking77	CLINC150
MixX	MixX	95.68	77.98	76.61	85.88
MixX	BlendX	52.51	42.51	37.31	42.45
BlendX	BlendX	94.93	76.50	63.99	77.96

SLIM Performance (Accuracy %)

Train	Test	SNIPS	ATIS	Banking77	CLINC150
MixX	MixX	95.97	77.10	83.71	88.67
MixX	BlendX	93.51	72.80	69.89	73.39
BlendX	BlendX	95.73	76.92	75.30	85.62

Ablation: TFMN Trained on MixX, Tested on BlendX Subsets (Accuracy %)

Method	SNIPS	ATIS	Banking77	CLINC150
Naïve	95.32	73.23	62.30	80.73
Manual	25.32	42.40	8.05	25.73
Generative	81.58	53.93	27.95	60.17

Dramatic cross-evaluation drops: TFMN accuracy plummets by up to 43%p when moving from MixX to BlendX evaluation (e.g., SNIPS: 95.68% → 52.51%; Banking77: 76.61% → 37.31%), exposing heavy reliance on superficial concatenation cues.
Manual patterns are most challenging: The ablation reveals that Manual concatenation (implicit patterns) devastates model performance — TFMN drops to just 8.05% on Banking77 and 25.32% on SNIPS — confirming that implicit blending without conjunction markers is the primary source of difficulty.
SLIM is more robust but still affected: SLIM shows smaller drops (e.g., Banking77: 83.71% → 69.89%) due to its binary classification approach, but still suffers meaningfully on BlendX, especially on Banking77 and CLINC150.
Training on BlendX helps but does not close the gap: Even when models are both trained and tested on BlendX (e.g., TFMN: 63.99% on Banking77), performance remains substantially below MixX-only evaluation (76.61%), demonstrating BlendX's inherent complexity beyond simple distributional shift.
Similarity-based selection improves generation quality: SBERT-based similarity filtering reduces ChatGPT's error rate from 41% to 10% on ATIS (cosine similarity jumps from 0.214 to 0.758), validating the importance of controlled generation.

Why It Matters

            Key Takeaway: State-of-the-art MID results on MixATIS/MixSNIPS substantially overestimate real-world capability. BlendX provides the community with a more honest evaluation framework for multi-intent detection.
        

BlendX makes three significant contributions to the field of multi-intent detection:

Exposes benchmark weakness: By demonstrating performance drops of up to 40%p, BlendX reveals that existing models rely on shallow pattern matching rather than genuine semantic understanding, challenging the perceived progress in the MID field.
Broader evaluation scope: Extending from 2 to 4 source datasets (adding Banking77 and CLINC150 with 77 and 147 intent types) provides a far more comprehensive evaluation that covers diverse domains and intent granularities.
Principled complexity framework: The two-dimensional construction framework (complexity x methodology) and three novel statistical metrics give researchers systematic tools for understanding and measuring dataset difficulty, enabling future benchmark development.
Practical implications: For production dialogue systems, BlendX's findings suggest that deployed models may perform far worse on real user utterances than lab evaluations indicate, motivating the development of more robust multi-intent detection approaches.

Links

ACL Anthology