EN KO
← All Publications

OmniACBench: A Benchmark for Evaluating Context-Grounded Acoustic Control in Omni-Modal Models

arXiv 2026
Seunghee Kim, Bumkyu Park, Kyudan Jung, Joosung Lee, Soyoon Kim, Jeonghoon Kim, Taeuk Kim, Hwiyeol Jo

One-Line Summary

OmniACBench is a 3,559-instance benchmark that systematically evaluates whether omni-modal models can generate speech with acoustically appropriate delivery -- covering speech rate, phonation, pronunciation, emotion, accent, and timbre -- by grounding in multimodal context (spoken instructions, text scripts, and images), revealing that all eight tested models fall drastically short of human-level acoustic control.

Comparison of omni-modal benchmarks
Figure 1. Comparison with existing omni-modal benchmarks: OmniACBench is the first benchmark to evaluate acoustic control capability grounded in multimodal context.

Background & Motivation

Omni-modal models have made remarkable strides in processing text, vision, and audio inputs simultaneously while generating speech outputs. However, existing evaluation frameworks (e.g., OmniBench, AnyBench) have focused almost exclusively on the textual content of model responses, ignoring a critical dimension: how the speech sounds. Speech responses encode meaning through both linguistic content and paralinguistic cues -- the same sentence can convey comfort, urgency, or indifference depending on tone, speed, and vocal quality.

This gap means we have no systematic way to evaluate whether a model that scores well on text-output benchmarks can actually speak appropriately when the situation demands it. For instance, can a model whisper when shown an image of a sleeping baby, or speak with urgency when shown an emergency scene?

Key Question: Can omni-modal models go beyond generating semantically correct speech to producing acoustically appropriate delivery that matches multimodal context -- combining what they see, hear, and read into natural-sounding vocalization?

The authors define "context-grounded acoustic control" as the ability to generate speech with appropriate acoustic characteristics inferred from the combination of text, visual, and audio inputs. To address this, they introduce OmniACBench, targeting six acoustic features selected by two criteria: multimodal groundability (features naturally inferable from visual scenes) and evaluation diversity (a mixture of objectively measurable and perceptually abstract properties).

Proposed Method

Construction pipeline
Figure 2. OmniACBench construction pipeline: Illustrates the 3-stage process with representative examples.

OmniACBench is built through a rigorous 3-stage pipeline that produces tri-modal test instances (spoken instruction + text script + image), each targeting a specific acoustic feature value.

1
Acoustic Feature Selection
Six features are defined with specific target values: Speech Rate (fast/slow), Phonation (whisper), Pronunciation (heteronyms -- words spelled the same but pronounced differently), Emotion (joy, surprise, anger, disgust, fear, sadness), Global Accent (India, UK, Australia), and Timbre (adult male/female, elderly male/female). For each target value, image keywords are manually curated as visual concepts (e.g., "emergency scene" for fast speech rate, "sleeping baby" for whisper).
2
Tri-Modal Instance Generation
Text scripts: LLM-generated neutral scripts explicitly designed to avoid encoding acoustic cues -- remaining agnostic regarding emotion, nationality, gender, and age.
Speech instructions: Control signal templates paraphrased by LLMs for linguistic diversity (Word Position Deviation = 0.11, Lexical Deviation = 0.69), then synthesized via TTS.
Images: A meta-prompting strategy expands image keywords into 5-8 element visual descriptions before image generation, yielding significantly greater diversity (CLIP distance 0.124 vs. 0.067; LPIPS 0.466 vs. 0.373) compared to direct keyword prompting.
3
Two-Stage Quality Control
LLM-based filtering checks three criteria: semantic preservation of paraphrases, text neutrality (removing scripts that leak target acoustic values), and image-keyword alignment. Human verification then validates each surviving instance. Starting from 3,640 candidates, 3,586 passed LLM filtering and 3,559 passed human review (97.78% retention rate). Speech quality is verified with near-perfect fidelity: WER = 0.004, CER = 0.001, STOI = 0.994. The final dataset contains approximately 600 instances per feature.

Evaluation metrics are split by feature type. Measurable features use signal-level metrics: Speech Rate via delta Words Per Minute (DWPM) between fast/slow conditions, Pronunciation via Phoneme Error Rate (PER), and Phonation via Voiced Frame Ratio (VFR@0.3). Abstract features use WavLM-Large-based classifiers trained on curated datasets, achieving strong reference accuracies: 89.43% (Emotion), 97.29% (Global Accent), and 96.67% (Timbre). Semantic fidelity is measured via WER using Whisper-large-v3 transcription.

Experimental Results

Performance results
Figure 3. Results across all evaluation metrics: Most models fall significantly short of baseline performance, particularly on abstract acoustic properties (emotion, accent).

Eight omni-modal models were evaluated: MiniCPM-o 4.5, InteractiveOmni (8B/4B), Qwen3-Omni 30B, Qwen2.5-Omni (7B/3B), Uni-MoE-2.0-Omni, and MGM-Omni 7B.

ModelSemantic WER ↓DWPM ↑Pronunciation PER ↓Emotion Acc ↑Accent Acc ↑Timbre Acc ↑
Reference0.0565.871.2189.43%97.29%96.67%
MiniCPM-o 4.51.046.425.4621.44%39.34%24.66%
Qwen3-Omni 30B2.14-1.817.4017.09%31.33%25.17%
Qwen2.5-Omni 7B4.150.7610.2719.10%28.96%24.66%

Three Distinct Failure Types Identified via Controlled Input Decomposition:

Linear probing visualization
Figure 4. Linear probing analysis: Visualizes how context-related information is processed across internal model layers.

Architectural insight from context flow analysis: Linear probing of hidden states reveals that MiniCPM-o 4.5 retains acoustic context information throughout its language model backbone into the TTS decoder, maintaining high probing accuracy. In contrast, Qwen3-Omni 30B shows high decodability in its "Thinker" (language understanding) component but drops to chance in its "Talker" (speech generation) component -- suggesting that tighter integration between language understanding and speech generation is a key architectural factor for better acoustic control.

Fundamental capability assessment: Testing components in isolation shows script-only generation WER is uniformly low (0.09-0.15) and visual cue selection accuracy reaches 92-97%. The bottleneck is not any individual capability but the integration of these capabilities into coherent context-grounded speech generation.

Why It Matters

As voice AI assistants become ubiquitous in daily life, the quality of speech generation extends far beyond getting the words right. "How it is said" matters as much as "what is said" -- imagine a medical assistant delivering a cancer diagnosis in a cheerful tone, or a navigation system calmly announcing an emergency detour. OmniACBench makes three key contributions:

Links

Multimodal Benchmark