EN KO
← All Publications

BlendX: Complex Multi-Intent Detection with Blended Patterns

LREC-COLING 2024
Yejin Yoon, Jungyeon Lee, Kangsan Kim, Chanhee Park, Taeuk Kim

One-Line Summary

BlendX is a suite of refined multi-intent detection datasets that exposes the fragility of existing benchmarks by constructing linguistically diverse blended utterances via rule-based heuristics and ChatGPT-assisted generation, revealing up to 40%p accuracy drops in state-of-the-art models.

Limitations of existing multi-intent detection datasets and the need for BlendX
Figure 1. Motivation for BlendX: existing multi-intent detection datasets (MixX) rely on overly simple concatenation patterns, highlighting the need for a more rigorous testbed for multi-intent detection.

Background & Motivation

Core Problem: Dominant multi-intent detection benchmarks (MixATIS, MixSNIPS) construct examples using only four conjunction templates — "and," "and then," "and also," and a comma — allowing models to achieve high accuracy by exploiting shallow concatenation cues rather than truly understanding compositional semantics.

Multi-intent detection (MID) addresses the realistic scenario where a single user utterance conveys multiple intents at once. Studies show that over half of utterances in production dialogue systems contain more than one intent, making MID a critical capability for real-world task-oriented dialogue systems.

However, existing MID benchmarks suffer from two fundamental limitations:

These limitations motivate BlendX, which aims to provide a more rigorous and linguistically diverse benchmark that truly tests a model's ability to handle multi-intent utterances as they appear in real conversations.

Illustration of the complexity and methodology aspects of concatenation in BlendX
Figure 3. Illustration of the complexity (Left) and methodology (Right) aspects of concatenation. Each approach triggers a distinct part of the possible variations (Middle) arising in the process of concatenation.

Proposed Method

BlendX constructs its benchmark along two orthogonal dimensions — complexity (explicit vs. implicit) and methodology (Naïve, Manual, Generative) — and introduces three novel complexity metrics to quantify dataset difficulty.

1
Naïve Concatenation (Baseline)
Replicates the original MixX recipe by combining single-intent utterances using only the four standard AND-variant conjunctions (and, and then, and also, comma). Serves as a direct comparison baseline with zero word reduction (W = 0%) across all datasets.
2
Manual Concatenation (Rule-Based Heuristics)
Applies diverse linguistic transformations through explicit patterns (varied conjunctions: or, before, after, additionally, meanwhile) and implicit patterns (omissions that reduce word count, coreferences via pronoun introduction, and gerund phrases). Achieves the highest complexity: 37–48% word reduction across datasets.
3
Generative Concatenation (ChatGPT + Similarity Selection)
Leverages ChatGPT to produce natural multi-intent utterances, augmented with SBERT-based similarity filtering (cosine similarity threshold τ = 0.7) to ensure semantic fidelity. This similarity-driven selection reduces the ChatGPT error rate dramatically (e.g., from 41% to 10% on ATIS). Achieves 18–37% word reduction.
4
Novel Complexity Metrics
Three binary metrics quantify dataset difficulty: W(utt, n) measures word-count reduction after concatenation (detecting omissions), C(utt, n) measures conjunction absence (detecting implicit joins), and P(utt, n) measures pronoun introduction (detecting coreferences). These metrics objectively demonstrate BlendX's superiority over MixX.

Four widely used single-intent datasets are extended into multi-intent versions with a 3:5:2 ratio of single / double / triple-intent utterances:

DatasetIntentsTrainDevTestTotal
SNIPS750,6252,6132,61555,853
ATIS1820,2501,1251,12522,500
Banking777736,3902,0092,02140,420
CLINC15014754,8962,8892,97760,762

Experimental Results

Three representative models are evaluated: TFMN (threshold-based multi-intent detection), SLIM (binary classification approach), and ChatGPT (in-context learning). The cross-evaluation paradigm (train on MixX, test on BlendX) reveals critical weaknesses.

TFMN Performance (Accuracy %)

TrainTestSNIPSATISBanking77CLINC150
MixXMixX95.6877.9876.6185.88
MixXBlendX52.5142.5137.3142.45
BlendXBlendX94.9376.5063.9977.96

SLIM Performance (Accuracy %)

TrainTestSNIPSATISBanking77CLINC150
MixXMixX95.9777.1083.7188.67
MixXBlendX93.5172.8069.8973.39
BlendXBlendX95.7376.9275.3085.62

Ablation: TFMN Trained on MixX, Tested on BlendX Subsets (Accuracy %)

MethodSNIPSATISBanking77CLINC150
Naïve95.3273.2362.3080.73
Manual25.3242.408.0525.73
Generative81.5853.9327.9560.17

Why It Matters

Key Takeaway: State-of-the-art MID results on MixATIS/MixSNIPS substantially overestimate real-world capability. BlendX provides the community with a more honest evaluation framework for multi-intent detection.

BlendX makes three significant contributions to the field of multi-intent detection:

Links

Dialogue Benchmark