OmniACBench - HYU NLP Lab

One-Line Summary

OmniACBench is a 3,559-instance benchmark that systematically evaluates whether omni-modal models can generate speech with acoustically appropriate delivery -- covering speech rate, phonation, pronunciation, emotion, accent, and timbre -- by grounding in multimodal context (spoken instructions, text scripts, and images), revealing that all eight tested models fall drastically short of human-level acoustic control.

Background & Motivation

Omni-modal models have made remarkable strides in processing text, vision, and audio inputs simultaneously while generating speech outputs. However, existing evaluation frameworks (e.g., OmniBench, AnyBench) have focused almost exclusively on the textual content of model responses, ignoring a critical dimension: how the speech sounds. Speech responses encode meaning through both linguistic content and paralinguistic cues -- the same sentence can convey comfort, urgency, or indifference depending on tone, speed, and vocal quality.

This gap means we have no systematic way to evaluate whether a model that scores well on text-output benchmarks can actually speak appropriately when the situation demands it. For instance, can a model whisper when shown an image of a sleeping baby, or speak with urgency when shown an emergency scene?

Key Question: Can omni-modal models go beyond generating semantically correct speech to producing acoustically appropriate delivery that matches multimodal context -- combining what they see, hear, and read into natural-sounding vocalization?

The authors define "context-grounded acoustic control" as the ability to generate speech with appropriate acoustic characteristics inferred from the combination of text, visual, and audio inputs. To address this, they introduce OmniACBench, targeting six acoustic features selected by two criteria: multimodal groundability (features naturally inferable from visual scenes) and evaluation diversity (a mixture of objectively measurable and perceptually abstract properties).

Proposed Method

Construction pipeline — **Figure 2.** OmniACBench construction pipeline: Illustrates the 3-stage process with representative examples.

OmniACBench is built through a rigorous 3-stage pipeline that produces tri-modal test instances (spoken instruction + text script + image), each targeting a specific acoustic feature value.

1

Acoustic Feature Selection

Six features are defined with specific target values: Speech Rate (fast/slow), Phonation (whisper), Pronunciation (heteronyms -- words spelled the same but pronounced differently), Emotion (joy, surprise, anger, disgust, fear, sadness), Global Accent (India, UK, Australia), and Timbre (adult male/female, elderly male/female). For each target value, image keywords are manually curated as visual concepts (e.g., "emergency scene" for fast speech rate, "sleeping baby" for whisper).

2

Tri-Modal Instance Generation

Text scripts: LLM-generated neutral scripts explicitly designed to avoid encoding acoustic cues -- remaining agnostic regarding emotion, nationality, gender, and age.
Speech instructions: Control signal templates paraphrased by LLMs for linguistic diversity (Word Position Deviation = 0.11, Lexical Deviation = 0.69), then synthesized via TTS.
Images: A meta-prompting strategy expands image keywords into 5-8 element visual descriptions before image generation, yielding significantly greater diversity (CLIP distance 0.124 vs. 0.067; LPIPS 0.466 vs. 0.373) compared to direct keyword prompting.

3

Two-Stage Quality Control

LLM-based filtering checks three criteria: semantic preservation of paraphrases, text neutrality (removing scripts that leak target acoustic values), and image-keyword alignment. Human verification then validates each surviving instance. Starting from 3,640 candidates, 3,586 passed LLM filtering and 3,559 passed human review (97.78% retention rate). Speech quality is verified with near-perfect fidelity: WER = 0.004, CER = 0.001, STOI = 0.994. The final dataset contains approximately 600 instances per feature.

Evaluation metrics are split by feature type. Measurable features use signal-level metrics: Speech Rate via delta Words Per Minute (DWPM) between fast/slow conditions, Pronunciation via Phoneme Error Rate (PER), and Phonation via Voiced Frame Ratio (VFR@0.3). Abstract features use WavLM-Large-based classifiers trained on curated datasets, achieving strong reference accuracies: 89.43% (Emotion), 97.29% (Global Accent), and 96.67% (Timbre). Semantic fidelity is measured via WER using Whisper-large-v3 transcription.

Experimental Results

Performance results — **Figure 3.** Results across all evaluation metrics: Most models fall significantly short of baseline performance, particularly on abstract acoustic properties (emotion, accent).

Eight omni-modal models were evaluated: MiniCPM-o 4.5, InteractiveOmni (8B/4B), Qwen3-Omni 30B, Qwen2.5-Omni (7B/3B), Uni-MoE-2.0-Omni, and MGM-Omni 7B.

Model	Semantic WER ↓	DWPM ↑	Pronunciation PER ↓	Emotion Acc ↑	Accent Acc ↑	Timbre Acc ↑
Reference	0.05	65.87	1.21	89.43%	97.29%	96.67%
MiniCPM-o 4.5	1.04	6.42	5.46	21.44%	39.34%	24.66%
Qwen3-Omni 30B	2.14	-1.81	7.40	17.09%	31.33%	25.17%
Qwen2.5-Omni 7B	4.15	0.76	10.27	19.10%	28.96%	24.66%

Massive performance gap: Even the best model (MiniCPM-o 4.5) achieves only DWPM 6.42 vs. the reference 65.87 for speech rate, and 21.44% vs. 89.43% for emotion -- roughly 10x and 4x gaps respectively.
Near-zero speech rate modulation: Most models show DWPM values near zero (Qwen3-Omni even has -1.81), indicating virtually no ability to adjust speaking speed based on context.
Timbre is essentially uncontrollable: All models cluster around the 25% random baseline for timbre, meaning they cannot adjust voice characteristics (male/female, young/old) at all.
Phonation control absent: Near-zero VFR@0.3 detection rates across all models indicate an inability to produce whispered speech.
Emotion relatively strongest but still limited: Emotion shows the best relative performance among abstract features, but even MiniCPM-o 4.5's 21.44% is only slightly above the 16.7% random baseline for 6-class classification.

Three Distinct Failure Types Identified via Controlled Input Decomposition:

Type I -- Lack of Direct Acoustic Control: Performance stays at chance even under Oracle conditions (explicit target specification). Timbre exhibits this completely; Global Accent and Phonation show it for most models except MiniCPM-o 4.5.
Type II -- Failure of Implicit Inference: Models can execute acoustic control with explicit instructions (Oracle) but fail to infer the correct target from context. Observed in Speech Rate for Qwen models and Phonation for MiniCPM-o 4.5 -- sharp Oracle improvement without corresponding context-based improvement.
Type III -- Failure in Multimodal Grounding: Models infer correctly from textualized context but fail when information is distributed across modalities. MiniCPM-o 4.5's Speech Rate shows strong text-only performance degrading significantly with speech-only or image-only cues.

Linear probing visualization — **Figure 4.** Linear probing analysis: Visualizes how context-related information is processed across internal model layers.

Architectural insight from context flow analysis: Linear probing of hidden states reveals that MiniCPM-o 4.5 retains acoustic context information throughout its language model backbone into the TTS decoder, maintaining high probing accuracy. In contrast, Qwen3-Omni 30B shows high decodability in its "Thinker" (language understanding) component but drops to chance in its "Talker" (speech generation) component -- suggesting that tighter integration between language understanding and speech generation is a key architectural factor for better acoustic control.

Fundamental capability assessment: Testing components in isolation shows script-only generation WER is uniformly low (0.09-0.15) and visual cue selection accuracy reaches 92-97%. The bottleneck is not any individual capability but the integration of these capabilities into coherent context-grounded speech generation.

Why It Matters

As voice AI assistants become ubiquitous in daily life, the quality of speech generation extends far beyond getting the words right. "How it is said" matters as much as "what is said" -- imagine a medical assistant delivering a cancer diagnosis in a cheerful tone, or a navigation system calmly announcing an emergency detour. OmniACBench makes three key contributions:

First systematic evaluation of acoustic control: It fills a critical gap in omni-modal evaluation by moving beyond text-output metrics to assess whether models can actually speak appropriately.
Precise diagnosis of failure modes: The three identified failure types (direct control, implicit inference, multimodal grounding) provide clear and actionable directions for model improvement.
Architectural guidance: The finding that tighter LM-TTS integration correlates with better acoustic control offers a concrete design principle for next-generation omni-modal architectures.

Links