OMHBench: Benchmarking Balanced and Grounded Omni-Modal Multi-Hop Reasoning
ACL 2026 Findings
Seunghee Kim, Ingyu Bang, Seokgyu Jang, Changhyeon Kim, Sanghwan Bae, Jihun Choi, Richeng Xuan, Taeuk Kim
One-Line Summary
A 6,144-question benchmark that enforces balanced, three-hop reasoning across text, image, and speech modalities, revealing that even the best models exhibit asymmetric omni-modal grounding — particularly when transitioning information to the speech modality.
Figure 1. Limitations of existing benchmarks: OMU (left) lacks textual context and contains modality shortcuts, while CMR (right) excludes audio and has imbalanced reasoning paths. OMHBench addresses both of these limitations.
Background & Motivation
Multimodal large language models (MLLMs) now claim to process text, images, and audio simultaneously, yet two fundamental questions remain unanswered: (1) Can omni-modal understanding (OMU) benchmarks truly evaluate all three modalities if most questions are solvable without using each modality? (2) Can cross-modal reasoning (CMR) benchmarks reliably measure reasoning when they are dominated by a single reasoning path?
The authors systematically analyzed existing evaluation frameworks and uncovered two critical shortcomings:
Problem 1 — Modality Shortcuts in OMU Benchmarks: Approximately 70-80% of instances in existing OMU benchmarks can be solved without accessing specific modalities (e.g., without visual or audio input), allowing models to take shortcuts that bypass true omni-modal understanding.
Problem 2 — Path Imbalance in CMR Benchmarks: Existing cross-modal reasoning datasets exhibit severely imbalanced reasoning paths (e.g., MuMuQA contains only Image-to-Text instances, MMQA skews ~2:1 toward Image-to-Text). When researchers forcibly balanced the paths, model accuracy dropped by up to 18%, revealing that previously reported results were overestimated due to path bias.
These findings motivated the creation of OMHBench, which bridges OMU and CMR paradigms while enforcing three requirements: (1) no shortcut-prone evaluation via enforced multi-hop reasoning, (2) incorporation of all three modalities (text, image, speech), and (3) explicit control of reasoning paths for unbiased assessment.
Figure 2. OMHBench task examples: Answers must be derived through diverse reasoning paths such as image→text, image→text→audio, etc. Attributes are unique to each modality, but entities are shared across all three tables.
Benchmark Construction Pipeline
Figure 5. OMHBench construction pipeline: A balanced benchmark is created through a 4-stage process — from structured table triplets to fully diversified omni-modal questions.
1
Table Triplet Formation
Construct triplets of 3 tables (10 entities x 3 attributes each), sharing identical entities but distinct attributes. Data spans 4 real-world domains: Finance (23 companies, 15 financial attributes), Economics (18 countries, 18 economic indicators from World Bank), Climate (20 cities, 12 meteorological attributes), and Nutrition (24 foods, 19 nutritional attributes). Value ratios are constrained (max/min ≤ 30) for stable visualization.
2
Multi-Hop QA Construction
Generate 3-hop reasoning chains using 8 deterministic operations: Lookup (retrieve attribute values), Comparison (filter by inequality), Ranking (select top/bottom entities), Range (select within value intervals), Proximity (find closest to reference), Retrieval (extract final values), Summation, and Mean. This yields 33 valid operation combinations and two subsets: OMHBench-Connect (3,072 instances, entity-selection focused) and OMHBench-Reasoning (3,072 instances, aggregation-focused). Construction is fully automated without generative AI.
3
Omni-Modal Context Generation
Each table is converted into one of three modalities. Image: Matplotlib/Seaborn charts with 10 chart types, 20 fonts, 20 color palettes. Text: Domain-specific scenarios (analyst reports, news articles, meeting minutes) generated by 3 LLMs (GPT-5.1, Grok-4, Claude Sonnet 4.5) for linguistic diversity. Speech: Kokoro-82M TTS with 22 speech types, 27 voice variations, and multi-speaker dialogue format.
4
Reasoning Path Diversification
By permuting modality assignments across 3 tables, each question is instantiated in all 3! = 6 reasoning paths (S-I-T, S-T-I, I-S-T, T-S-I, I-T-S, T-I-S). This yields 1,024 instances per path, preserving identical question-answer pairs while varying the required modality sequence — enabling controlled analysis of path sensitivity.
Quality Assurance: Entity names are anonymized with alphabetical codes to prevent parametric knowledge exploitation. QA-based validation and LLM-based table reconstruction achieve 100% consistency. Question rephrasing uses multiple LLMs, achieving higher lexical deviation than the PAWS dataset (0.32 vs. 0.13). TTS quality is verified with WER: 0.03, CER: 0.02, STOI: 99.2, and SI-SDR: 21.0.
Experimental Results
13 state-of-the-art models were evaluated: 6 proprietary (Gemini family) and 7 open-source (Qwen3-Omni, Phi-4 Multimodal, Qwen2.5-Omni, OmniVinci, MiniCPM-o, Omni-AutoThink). The paper introduces the Path Balance Score (PBS), which counts an instance as correct only when the model answers correctly across all 6 path variations.
OMHBench-Connect (Entity Selection)
Model
Type
Avg Accuracy
PBS
Path Range
Gemini 3 Flash
Proprietary
78.3%
32.2
60.2% - 98.4%
Gemini 2.5 Pro
Proprietary
72.5%
25.0
50.8% - 96.9%
Gemini 2.5 Flash
Proprietary
53.6%
4.7
21.9% - 85.9%
Qwen3-Omni 30B
Open-source
46.8%
2.3
16.0% - 77.0%
Most other open-source
Open-source
< 5%
~0
-
OMHBench-Reasoning (Aggregation)
Model
Type
Avg Accuracy
PBS
Path Range
Gemini 3 Flash
Proprietary
49.4%
8.6
40.0% - 58.8%
Gemini 2.5 Pro
Proprietary
48.8%
10.9
41.4% - 53.9%
Qwen3-Omni 30B
Open-source
15.0%
0.0
2.7% - 28.5%
Most other open-source
Open-source
~0%
0
-
Figure 6. Key trends: A large performance gap between proprietary and open-source models, difficulty with the speech modality, and high sensitivity to reasoning path changes are confirmed across all evaluated models.
Proprietary vs. Open-source Gap: Gemini 3 Flash (78.3%) vastly outperforms the best open-source model Qwen3-Omni 30B (46.8%) on Connect; the gap widens on Reasoning (49.4% vs. 15.0%)
Asymmetric Speech Grounding: Transitions to speech (I→S, T→S) are particularly challenging, while transitions from speech are relatively robust. For example, Qwen3-Omni scores 77.0% on S-T-I but only 16.0% on I-T-S
Path Sensitivity: Large accuracy swings across paths — up to 38% for Qwen3-Omni on Connect and 25.8% for Gemini 3 Flash on Reasoning — demonstrate that single-path evaluation unreliably characterizes model capabilities
Operation Difficulty: Performance degrades progressively: Ranking (easiest) > Comparison > Proximity > Range (hardest), indicating MLLMs handle ordinal comparisons well but struggle with interval constraints
Cross-Domain Gap: Gemini 3 Flash shows up to 21.8% difference between Economics and Nutrition domains, indicating imperfect domain generalization
Prompting Ineffective: Advanced strategies (Self-Ask, Least-to-Most, Plan-and-Solve) yield no consistent improvements over standard chain-of-thought, suggesting asymmetric grounding reflects fundamental model limitations
Why It Matters
OMHBench is the first benchmark to simultaneously enforce balanced reasoning across text, image, and speech while eliminating modality shortcuts. Its key contributions are threefold:
Reliable Evaluation: By controlling all 6 reasoning paths with identical Q&A pairs, OMHBench prevents the path bias that inflated results on prior benchmarks by up to 18%
Diagnosing Asymmetric Grounding: The benchmark reveals a systematic weakness — models struggle specifically when grounding information into the speech modality, sometimes behaving as if speech input is absent entirely. This "asymmetric omni-modal grounding" is a fundamental limitation, not a prompting issue
Actionable Roadmap: The stepwise failure analysis and operation-level breakdowns pinpoint exactly where models fail (e.g., early-stage failures in weak models, cross-modal attribute grounding in stronger ones), providing concrete development directions for next-generation multimodal AI