EN KO
← All Publications

OMHBench: Benchmarking Balanced and Grounded Omni-Modal Multi-Hop Reasoning

ACL 2026 Findings
Seunghee Kim, Ingyu Bang, Seokgyu Jang, Changhyeon Kim, Sanghwan Bae, Jihun Choi, Richeng Xuan, Taeuk Kim

One-Line Summary

A 6,144-question benchmark that enforces balanced, three-hop reasoning across text, image, and speech modalities, revealing that even the best models exhibit asymmetric omni-modal grounding — particularly when transitioning information to the speech modality.

Comparison of OMU and CMR benchmarks
Figure 1. Limitations of existing benchmarks: OMU (left) lacks textual context and contains modality shortcuts, while CMR (right) excludes audio and has imbalanced reasoning paths. OMHBench addresses both of these limitations.

Background & Motivation

Multimodal large language models (MLLMs) now claim to process text, images, and audio simultaneously, yet two fundamental questions remain unanswered: (1) Can omni-modal understanding (OMU) benchmarks truly evaluate all three modalities if most questions are solvable without using each modality? (2) Can cross-modal reasoning (CMR) benchmarks reliably measure reasoning when they are dominated by a single reasoning path?

The authors systematically analyzed existing evaluation frameworks and uncovered two critical shortcomings:

Problem 1 — Modality Shortcuts in OMU Benchmarks: Approximately 70-80% of instances in existing OMU benchmarks can be solved without accessing specific modalities (e.g., without visual or audio input), allowing models to take shortcuts that bypass true omni-modal understanding.

Problem 2 — Path Imbalance in CMR Benchmarks: Existing cross-modal reasoning datasets exhibit severely imbalanced reasoning paths (e.g., MuMuQA contains only Image-to-Text instances, MMQA skews ~2:1 toward Image-to-Text). When researchers forcibly balanced the paths, model accuracy dropped by up to 18%, revealing that previously reported results were overestimated due to path bias.

These findings motivated the creation of OMHBench, which bridges OMU and CMR paradigms while enforcing three requirements: (1) no shortcut-prone evaluation via enforced multi-hop reasoning, (2) incorporation of all three modalities (text, image, speech), and (3) explicit control of reasoning paths for unbiased assessment.

Task formulation examples
Figure 2. OMHBench task examples: Answers must be derived through diverse reasoning paths such as image→text, image→text→audio, etc. Attributes are unique to each modality, but entities are shared across all three tables.

Benchmark Construction Pipeline

OMHBench construction pipeline
Figure 5. OMHBench construction pipeline: A balanced benchmark is created through a 4-stage process — from structured table triplets to fully diversified omni-modal questions.
1
Table Triplet Formation
Construct triplets of 3 tables (10 entities x 3 attributes each), sharing identical entities but distinct attributes. Data spans 4 real-world domains: Finance (23 companies, 15 financial attributes), Economics (18 countries, 18 economic indicators from World Bank), Climate (20 cities, 12 meteorological attributes), and Nutrition (24 foods, 19 nutritional attributes). Value ratios are constrained (max/min ≤ 30) for stable visualization.
2
Multi-Hop QA Construction
Generate 3-hop reasoning chains using 8 deterministic operations: Lookup (retrieve attribute values), Comparison (filter by inequality), Ranking (select top/bottom entities), Range (select within value intervals), Proximity (find closest to reference), Retrieval (extract final values), Summation, and Mean. This yields 33 valid operation combinations and two subsets: OMHBench-Connect (3,072 instances, entity-selection focused) and OMHBench-Reasoning (3,072 instances, aggregation-focused). Construction is fully automated without generative AI.
3
Omni-Modal Context Generation
Each table is converted into one of three modalities. Image: Matplotlib/Seaborn charts with 10 chart types, 20 fonts, 20 color palettes. Text: Domain-specific scenarios (analyst reports, news articles, meeting minutes) generated by 3 LLMs (GPT-5.1, Grok-4, Claude Sonnet 4.5) for linguistic diversity. Speech: Kokoro-82M TTS with 22 speech types, 27 voice variations, and multi-speaker dialogue format.
4
Reasoning Path Diversification
By permuting modality assignments across 3 tables, each question is instantiated in all 3! = 6 reasoning paths (S-I-T, S-T-I, I-S-T, T-S-I, I-T-S, T-I-S). This yields 1,024 instances per path, preserving identical question-answer pairs while varying the required modality sequence — enabling controlled analysis of path sensitivity.

Quality Assurance: Entity names are anonymized with alphabetical codes to prevent parametric knowledge exploitation. QA-based validation and LLM-based table reconstruction achieve 100% consistency. Question rephrasing uses multiple LLMs, achieving higher lexical deviation than the PAWS dataset (0.32 vs. 0.13). TTS quality is verified with WER: 0.03, CER: 0.02, STOI: 99.2, and SI-SDR: 21.0.

Experimental Results

13 state-of-the-art models were evaluated: 6 proprietary (Gemini family) and 7 open-source (Qwen3-Omni, Phi-4 Multimodal, Qwen2.5-Omni, OmniVinci, MiniCPM-o, Omni-AutoThink). The paper introduces the Path Balance Score (PBS), which counts an instance as correct only when the model answers correctly across all 6 path variations.

OMHBench-Connect (Entity Selection)

ModelTypeAvg AccuracyPBSPath Range
Gemini 3 FlashProprietary78.3%32.260.2% - 98.4%
Gemini 2.5 ProProprietary72.5%25.050.8% - 96.9%
Gemini 2.5 FlashProprietary53.6%4.721.9% - 85.9%
Qwen3-Omni 30BOpen-source46.8%2.316.0% - 77.0%
Most other open-sourceOpen-source< 5%~0-

OMHBench-Reasoning (Aggregation)

ModelTypeAvg AccuracyPBSPath Range
Gemini 3 FlashProprietary49.4%8.640.0% - 58.8%
Gemini 2.5 ProProprietary48.8%10.941.4% - 53.9%
Qwen3-Omni 30BOpen-source15.0%0.02.7% - 28.5%
Most other open-sourceOpen-source~0%0-
Core trends
Figure 6. Key trends: A large performance gap between proprietary and open-source models, difficulty with the speech modality, and high sensitivity to reasoning path changes are confirmed across all evaluated models.

Why It Matters

OMHBench is the first benchmark to simultaneously enforce balanced reasoning across text, image, and speech while eliminating modality shortcuts. Its key contributions are threefold:

Links

Multimodal Benchmark