EN KO
← All Publications

Beyond Task-Oriented and Chitchat Dialogues: Proactive and Transition-Aware Conversational Agents

EMNLP 2025
Yejin Yoon, Yuri Son, Namyoung So, Minseo Kim, Minsoo Cho, Chanhee Park, Seungshin Lee, Taeuk Kim

One-Line Summary

TACT is a large-scale dataset (9,936 dialogues from SLURP + 7,109 from MultiWOZ) with structurally diverse mode transitions between task-oriented dialogue and chitchat, paired with novel Switch/Recovery metrics and a DPO-based training framework that achieves 75.74% joint mode-intent accuracy and a 70.1% human win rate against GPT-4o.

TACT dialogue scenario
Figure 2. Only the TACT-trained agent exhibits transition-awareness and proactivity by successfully returning to the original train-booking task after a chitchat interruption about scenic routes. Models trained on FusedChat, InterfereChat, and GPT-4o-mini all fail to recover the original context.

Background & Motivation

Conversational agents have traditionally been developed for either task-oriented dialogue (TOD) systems or open-ended chitchat, with limited progress in unifying the two. Yet real-world conversations naturally involve fluid transitions between these modes -- for example, a user booking a train ticket might casually mention enjoying scenic routes, then expect the agent to return to the booking task seamlessly.

Problem with Existing Datasets: Prior mode-switching datasets like FusedChat and InterfereChat allow only a single transition between TOD and chitchat (at most 1 switch point), and lack structural diversity. They adhere to a TOD-centric perspective, inserting chitchat at fixed points as single exchanges. This makes them unsuitable for modeling the dynamic, multi-turn transitions that occur in real conversations.

Two Missing Capabilities: Current systems lack (1) transition-awareness -- the ability to detect and adapt to mode changes, and (2) proactivity -- the ability to plan ahead and guide the conversation flow when appropriate. TACT is the first dataset that requires agents to both initiate and recover from mode transitions across multiple turns.

Proposed Method

Dataset Construction

TACT (TOD-And-Chitchat Transition) is built by augmenting two established TOD corpora -- MultiWOZ 2.2 and SLURP -- with structurally diverse chitchat transitions. Two core dialogue flow types are defined:

1
TCT Flow (TOD → Chitchat → TOD)
Extract task segments of 4+ turns from MultiWOZ 2.2, and insert a chitchat block at a natural boundary between intents. The chitchat briefly diverges from the task before the dialogue returns to its original goal.
2
CTC Flow (Chitchat → TOD → Chitchat)
Begin with a short TOD segment (2-3 turns) and attach chitchat before and after the task, forming a wrap-around flow that simulates users casually engaging in a task during a social exchange.
3
Automatic Validation
A hybrid validation pipeline combines human-aligned criteria (from G-Eval) with model-based reasoning (from Active-Critic), scoring each dialogue on Intent Accuracy, Transition Quality, and Dialogue Naturalness using GPT-4o-mini as judge.

Dataset Statistics

StatisticTACTMultiWOZTACTSLURP
# Intents1150+
# Dialogues7,1099,936
# Avg. Turns15.0416.42
# Avg. Switch1.932.06
# Avg. Recovery0.931.07
# Unique Flow Types1112
Flow PatternsTCT, CTC, TCTCT, etc.

Training Framework

1
SFT with FnCTOD
Supervised fine-tuning using the FnCTOD architecture on LLaMA-3.1-8B-Instruct. The model reinterprets function-calling as structured intent representation: at each turn, it first predicts an intent, then generates a response conditioned on that intent -- enabling unified intent prediction and response generation in a single auto-regressive pass.
2
DPO (Direct Preference Optimization)
Applied on top of SFT to align model outputs with human preferences. Preference pairs (3,009 instances) are generated by comparing FnCTOD and GPT-4o-mini outputs, with Gemini-2.5-Pro judging on sensibleness, specificity, interestingness, and transition naturalness. This is the first application of DPO in a unified dialogue generation setting combining structured function-calling with TOD and chitchat.

Evaluation Metrics: Switch & Recovery

S
Switch
Measures when the agent shifts from one mode to another (e.g., TOD→Chitchat). Reported as Attempt (average number of agent-driven mode shifts per dialogue) and Success (average number of transitions accepted by the user). A switch is successful only when the user subsequently accepts the agent-suggested mode transition.
R
Recovery
Measures when the agent returns to a previously suspended mode (e.g., TCT, CTC). Also reported as Attempt and Success. Only about 34% of successful recoveries return to the exact previous intent -- the rest initiate a new but relevant intent within the correct mode, showing that recovery does not always mean returning to the same task.

Experimental Results

All SFT models are initialized with LLaMA-3.1-8B-Instruct and trained for 3 epochs with a learning rate of 1e-5 and batch size of 256. Four methods are compared: ICL (zero-shot and few-shot with GPT-4o), SFT, SFT-DPO, and a generative-classifier-based Pipeline.

Method Comparison (Table 4)

MethodMode Sel. Acc.Mode Sel. F1Intent Acc./turnJoint Acc./turnJoint Acc./dlgSwitch Att.Switch Suc.Recov. Att.Recov. Suc.Chitchat Win-Rate
ICL-ZS90.4686.2187.5785.0130.000.8790.3740.8800.099-
ICL-FS91.4588.9884.0986.8936.761.5770.8651.5710.652-
SFT98.9598.5096.3596.4175.591.3221.3000.9770.85623.16
SFT-DPO98.8298.3296.0396.2175.741.3431.3220.9770.85940.86
Pipeline98.9598.5096.3596.4175.591.3221.3000.9770.85624.32

Human Preference Evaluation (Figure 8)

Human annotators (10 evaluators) assessed DPO vs. GPT-4o (few-shot) on 77 dialogues without a tie option:

CriterionDPO Win %DPO Lose %
Sensibleness71.428.6
Specificity77.922.1
Interestingness71.428.6
Transition Naturalness81.918.2
Overall70.114.3

Cross-Dataset Comparison (Table 3)

Only the TACT-trained agent achieves non-zero transition-aware scores. Models trained on FusedChat and InterfereChat produce zero switch and recovery attempts due to their lack of multi-turn transition structures:

Training SetAvg. Joint Acc./turnAvg. Joint Acc./dlgSwitch Att.Switch Suc.Recov. Suc.
FusedChat92.2556.130.0000.000-
InterfereChat84.8235.390.0000.000-
TACTMultiWOZ92.1158.601.3221.3000.856

Why It Matters

Real-world deployed dialogue systems frequently encounter users who shift between task-oriented requests and casual conversation within a single session. TACT is the first dataset to model this conversational fluidity with structurally diverse, multi-turn transitions and recoverable dialogue structures. By combining TACT's diverse training data with preference optimization via DPO, the resulting agent learns soft conversational skills -- engagement, flow continuity, and transition smoothness -- that go beyond mere response accuracy. The open-source dataset (HuggingFace) and code (GitHub) pave the way for building more autonomous and predictive conversational agents.

Links

Dialogue