BlendX: Complex Multi-Intent Detection with Blended Patterns
LREC-COLING 2024
Yejin Yoon, Jungyeon Lee, Kangsan Kim, Chanhee Park, Taeuk Kim
One-Line Summary
BlendX is a suite of refined multi-intent detection datasets that exposes the fragility of existing benchmarks by constructing linguistically diverse blended utterances via rule-based heuristics and ChatGPT-assisted generation, revealing up to 40%p accuracy drops in state-of-the-art models.
Figure 1. Motivation for BlendX: existing multi-intent detection datasets (MixX) rely on overly simple concatenation patterns, highlighting the need for a more rigorous testbed for multi-intent detection.
Background & Motivation
Core Problem: Dominant multi-intent detection benchmarks (MixATIS, MixSNIPS) construct examples using only four conjunction templates — "and," "and then," "and also," and a comma — allowing models to achieve high accuracy by exploiting shallow concatenation cues rather than truly understanding compositional semantics.
Multi-intent detection (MID) addresses the realistic scenario where a single user utterance conveys multiple intents at once. Studies show that over half of utterances in production dialogue systems contain more than one intent, making MID a critical capability for real-world task-oriented dialogue systems.
However, existing MID benchmarks suffer from two fundamental limitations:
Limited concatenation patterns: MixATIS and MixSNIPS use only four conjunctions to combine single-intent utterances, creating highly predictable multi-intent examples that do not reflect the diversity of natural language.
Narrow dataset scope: Prior work has been limited to only two source datasets (ATIS and SNIPS), restricting the diversity of intents and domains evaluated.
Inflated performance: The overly simplistic construction allows models to achieve near-perfect scores by detecting surface-level concatenation cues (e.g., the word "and") rather than genuinely decomposing complex utterances into their constituent intents.
These limitations motivate BlendX, which aims to provide a more rigorous and linguistically diverse benchmark that truly tests a model's ability to handle multi-intent utterances as they appear in real conversations.
Figure 3. Illustration of the complexity (Left) and methodology (Right) aspects of concatenation. Each approach triggers a distinct part of the possible variations (Middle) arising in the process of concatenation.
Proposed Method
BlendX constructs its benchmark along two orthogonal dimensions — complexity (explicit vs. implicit) and methodology (Naïve, Manual, Generative) — and introduces three novel complexity metrics to quantify dataset difficulty.
1
Naïve Concatenation (Baseline)
Replicates the original MixX recipe by combining single-intent utterances using only the four standard AND-variant conjunctions (and, and then, and also, comma). Serves as a direct comparison baseline with zero word reduction (W = 0%) across all datasets.
2
Manual Concatenation (Rule-Based Heuristics)
Applies diverse linguistic transformations through explicit patterns (varied conjunctions: or, before, after, additionally, meanwhile) and implicit patterns (omissions that reduce word count, coreferences via pronoun introduction, and gerund phrases). Achieves the highest complexity: 37–48% word reduction across datasets.
Leverages ChatGPT to produce natural multi-intent utterances, augmented with SBERT-based similarity filtering (cosine similarity threshold τ = 0.7) to ensure semantic fidelity. This similarity-driven selection reduces the ChatGPT error rate dramatically (e.g., from 41% to 10% on ATIS). Achieves 18–37% word reduction.
4
Novel Complexity Metrics
Three binary metrics quantify dataset difficulty: W(utt, n) measures word-count reduction after concatenation (detecting omissions), C(utt, n) measures conjunction absence (detecting implicit joins), and P(utt, n) measures pronoun introduction (detecting coreferences). These metrics objectively demonstrate BlendX's superiority over MixX.
Four widely used single-intent datasets are extended into multi-intent versions with a 3:5:2 ratio of single / double / triple-intent utterances:
Dataset
Intents
Train
Dev
Test
Total
SNIPS
7
50,625
2,613
2,615
55,853
ATIS
18
20,250
1,125
1,125
22,500
Banking77
77
36,390
2,009
2,021
40,420
CLINC150
147
54,896
2,889
2,977
60,762
Experimental Results
Three representative models are evaluated: TFMN (threshold-based multi-intent detection), SLIM (binary classification approach), and ChatGPT (in-context learning). The cross-evaluation paradigm (train on MixX, test on BlendX) reveals critical weaknesses.
TFMN Performance (Accuracy %)
Train
Test
SNIPS
ATIS
Banking77
CLINC150
MixX
MixX
95.68
77.98
76.61
85.88
MixX
BlendX
52.51
42.51
37.31
42.45
BlendX
BlendX
94.93
76.50
63.99
77.96
SLIM Performance (Accuracy %)
Train
Test
SNIPS
ATIS
Banking77
CLINC150
MixX
MixX
95.97
77.10
83.71
88.67
MixX
BlendX
93.51
72.80
69.89
73.39
BlendX
BlendX
95.73
76.92
75.30
85.62
Ablation: TFMN Trained on MixX, Tested on BlendX Subsets (Accuracy %)
Method
SNIPS
ATIS
Banking77
CLINC150
Naïve
95.32
73.23
62.30
80.73
Manual
25.32
42.40
8.05
25.73
Generative
81.58
53.93
27.95
60.17
Dramatic cross-evaluation drops: TFMN accuracy plummets by up to 43%p when moving from MixX to BlendX evaluation (e.g., SNIPS: 95.68% → 52.51%; Banking77: 76.61% → 37.31%), exposing heavy reliance on superficial concatenation cues.
Manual patterns are most challenging: The ablation reveals that Manual concatenation (implicit patterns) devastates model performance — TFMN drops to just 8.05% on Banking77 and 25.32% on SNIPS — confirming that implicit blending without conjunction markers is the primary source of difficulty.
SLIM is more robust but still affected: SLIM shows smaller drops (e.g., Banking77: 83.71% → 69.89%) due to its binary classification approach, but still suffers meaningfully on BlendX, especially on Banking77 and CLINC150.
Training on BlendX helps but does not close the gap: Even when models are both trained and tested on BlendX (e.g., TFMN: 63.99% on Banking77), performance remains substantially below MixX-only evaluation (76.61%), demonstrating BlendX's inherent complexity beyond simple distributional shift.
Similarity-based selection improves generation quality: SBERT-based similarity filtering reduces ChatGPT's error rate from 41% to 10% on ATIS (cosine similarity jumps from 0.214 to 0.758), validating the importance of controlled generation.
Why It Matters
Key Takeaway: State-of-the-art MID results on MixATIS/MixSNIPS substantially overestimate real-world capability. BlendX provides the community with a more honest evaluation framework for multi-intent detection.
BlendX makes three significant contributions to the field of multi-intent detection:
Exposes benchmark weakness: By demonstrating performance drops of up to 40%p, BlendX reveals that existing models rely on shallow pattern matching rather than genuine semantic understanding, challenging the perceived progress in the MID field.
Broader evaluation scope: Extending from 2 to 4 source datasets (adding Banking77 and CLINC150 with 77 and 147 intent types) provides a far more comprehensive evaluation that covers diverse domains and intent granularities.
Principled complexity framework: The two-dimensional construction framework (complexity x methodology) and three novel statistical metrics give researchers systematic tools for understanding and measuring dataset difficulty, enabling future benchmark development.
Practical implications: For production dialogue systems, BlendX's findings suggest that deployed models may perform far worse on real user utterances than lab evaluations indicate, motivating the development of more robust multi-intent detection approaches.