One-Line Summary
A new multi-intent detection dataset that incorporates realistic linguistic phenomena -- ellipsis and coreference -- to create more naturalistic multi-intent utterances beyond simple concatenation of single-intent sentences, revealing significant performance drops in existing models when faced with natural language complexity.
Background & Motivation
Multi-intent detection is the task of identifying multiple user intents within a single utterance, a common scenario in real-world dialogue systems (e.g., "Book a flight to Seoul and find me a hotel nearby"). While this capability is critical for building practical conversational AI, existing datasets fall short of capturing the true complexity of human language.
Key Limitations of Existing Multi-Intent Datasets:
- Mechanical concatenation: Most datasets (e.g., MixATIS, MixSNIPS) are constructed by simply joining two or more single-intent utterances with a conjunction, producing stilted and artificial examples like "Play jazz music and what is the weather in Seoul."
- Missing ellipsis: In natural speech, speakers omit repeated elements (e.g., "Book a flight to Seoul and a hotel there" instead of "Book a flight to Seoul and book a hotel in Seoul"), but existing datasets rarely model this phenomenon.
- Absent coreference: Real users naturally use pronouns and anaphoric references across intents (e.g., "Find a restaurant nearby and reserve it for two"), which existing datasets do not capture.
- Evaluation gap: Models trained and evaluated on artificially concatenated data may appear to perform well but fail on real-world utterances that exhibit these natural linguistic phenomena.
This gap between artificial benchmarks and real-world language use motivates the creation of a dataset that incorporates implicit concatenation -- multi-intent utterances where ellipsis and coreference make the expressions compact and natural, yet significantly more challenging for automated systems to parse.
Explicit vs. Implicit Concatenation: Concrete Examples
| Type | Example | Linguistic Phenomenon |
| Explicit (Existing) | "Book a flight to Seoul and book a hotel in Seoul" | Mechanical conjunction; repeated elements preserved |
| Implicit (Ellipsis) | "Book a flight and a hotel in Seoul" | Shared verb and location collapsed |
| Implicit (Coreference) | "Find an Italian restaurant and reserve it for two" | Anaphoric pronoun replaces repeated entity |
| Implicit (Combined) | "Find a hotel near the airport and check in at 3 PM" | Both ellipsis (hotel omitted) and deixis (implied location) |
The examples illustrate a spectrum from artificial to natural: while explicit concatenation preserves all information redundantly, implicit concatenation removes redundancy in ways that require inference to recover the full semantic content — exactly the kind of reasoning that current models lack.
Proposed Method: Implicit Concatenation Dataset Construction
The core idea is to transform mechanically concatenated multi-intent utterances into linguistically natural ones by systematically applying ellipsis and coreference. Unlike previous datasets that simply join single-intent sentences with conjunctions, this approach produces utterances that mirror how real users naturally compress multiple requests into a single, compact expression. The construction follows a structured pipeline:
1
Base Pair Selection
Single-intent utterances from existing benchmarks (e.g., ATIS, SNIPS) are paired to form multi-intent combinations. Pairs are selected to cover a diverse range of intent combinations across domains such as travel, music, weather, and restaurant booking.
2
Implicit Concatenation via Ellipsis
Repeated elements shared between the two intents are identified and removed from one of the utterances. For example, "Book a flight to Seoul" + "Book a hotel in Seoul" becomes "Book a flight and a hotel in Seoul," where the repeated verb and location are collapsed into a single, natural expression.
3
Implicit Concatenation via Coreference
Entities mentioned in the first intent are replaced with pronouns or deictic expressions in the second intent. For instance, "Find an Italian restaurant" + "Make a reservation at the Italian restaurant" becomes "Find an Italian restaurant and make a reservation there," introducing anaphoric reference across intents.
4
Quality Validation & Annotation
Human annotators review the transformed utterances for linguistic naturalness, semantic preservation, and correct intent/slot labeling. Utterances that sound unnatural or lose intent information are revised or discarded to maintain dataset quality.
Experimental Results
State-of-the-art multi-intent detection models were evaluated on both the standard (explicit) concatenation dataset and the new implicit concatenation dataset to quantify the impact of realistic linguistic phenomena. The evaluation is designed to answer a critical question: do models that succeed on artificial benchmarks truly understand multi-intent utterances, or have they merely learned to exploit surface-level concatenation patterns?
Performance Comparison: Explicit vs. Implicit Concatenation
| Evaluation Setting | Intent Detection | Slot Filling | Overall Difficulty |
| Explicit Concatenation (Baseline) | High | High | Standard |
| Implicit Concatenation (Ours) | Significantly Lower | Significantly Lower | Challenging |
- Significant performance degradation: All evaluated models showed notable accuracy drops when tested on the implicit concatenation dataset, confirming that ellipsis and coreference pose genuine challenges for current multi-intent detection systems.
- Ellipsis is particularly challenging: Utterances where repeated elements are omitted proved especially difficult, as models struggle to recover the missing information needed to correctly identify all intents and fill all slots.
- Coreference adds complexity: Pronominalized references across intents increased parsing difficulty, as models must resolve what "it," "there," or "that" refers to in order to correctly assign slot values to the right intents.
- Slot filling hit hardest: While intent detection accuracy declined, the largest performance drops were observed in slot filling, where models must map specific values to specific intents -- a task made much harder when slot values are shared or implicitly referenced across intents.
- Gap between artificial and natural evaluation: The results demonstrate that strong performance on standard concatenation-based benchmarks does not guarantee robustness on more naturalistic multi-intent utterances, underscoring the need for more realistic evaluation data.
Detailed Error Analysis by Phenomenon Type:
- Shared-verb ellipsis: When two intents share a verb (e.g., "book") and only one instance is retained, models frequently fail to propagate the action to both intent slots, resulting in incomplete slot filling for the second intent.
- Shared-entity ellipsis: When a location or entity is mentioned once but applies to both intents (e.g., "in Seoul"), models sometimes assign it only to the first intent, leaving the second intent's location slot empty.
- Pronominal coreference: When "it" or "there" refers back to an entity from the first intent, models struggle to resolve the reference, often leaving the referent slot unfilled or filling it with a generic placeholder.
- Combined phenomena: Utterances exhibiting both ellipsis and coreference simultaneously show the steepest performance drops, as models must perform multiple types of inference to recover the full intent structure.
Why It Matters
As dialogue systems are deployed in increasingly complex real-world scenarios, the ability to understand natural multi-intent utterances becomes essential. The findings of this work have direct implications for both academic research on dialogue understanding and the practical engineering of production-grade conversational AI systems. This work makes three important contributions:
- More realistic evaluation benchmark: By incorporating ellipsis and coreference, the dataset provides a benchmark that better reflects how real users express multiple intents, enabling more meaningful evaluation of dialogue systems.
- Exposing model blind spots: The significant performance drops observed on implicit concatenation data reveal that current models have been overfitting to artificial concatenation patterns rather than learning robust multi-intent understanding.
- Guiding future research: The findings motivate the development of multi-intent detection models that can handle ellipsis resolution, coreference resolution, and other natural language phenomena -- capabilities that are critical for building dialogue systems that work reliably in practice.
Dialogue
Benchmark