FCMR is a 2,199-instance benchmark for evaluating multimodal LLMs on cross-modal multi-hop reasoning across financial text reports, tables, and charts at three difficulty tiers, revealing that even the best model (Claude 3.5 Sonnet) achieves only 30.4% accuracy on the hardest tier requiring three-hop reasoning.
Real-world financial analysis demands synthesizing information across multiple modalities simultaneously -- reading text reports, verifying numbers in tables, and identifying trends in charts. MMQA, the leading cross-modal multi-hop reasoning benchmark, suffers from two critical limitations that undermine its reliability as an evaluation tool:
Data Contamination: GPT-4o achieves 43.4% on the most challenging part of MMQA without relying on visual hints (vs. 63.4% with images), suggesting the benchmark data may already be present in model pretraining corpora. MMQA's reliance on Wikipedia further exacerbates this risk. On FCMR, performance without charts drops to 14.71% (near the 12.28% random baseline), confirming robustness against contamination.
Lack of Complex Queries: Only 0.8% of MMQA instances require genuine three-hop cross-modal reasoning involving all three modalities (text, table, image). This severely limits evaluation of models' ability to perform deep multi-step integration across data formats.
FCMR addresses both issues by grounding its data in financial filings (SEC EDGAR 10-K reports and WRDS Compustat financial statements from the top 101 companies by net sales), covering five years (2019--2023), and explicitly constructing Easy (1-hop), Medium (2-hop), and Hard (3-hop) difficulty tiers.
The Cross-Modal Multi-Hop Reasoning Generator (CMRGen) is an automated framework that constructs benchmark instances across three stages. Each instance presents three statements that may or may not be true, requiring the model to select all correct ones (0--3 correct answers possible). The dataset totals 2,199 instances: 757 Easy, 728 Medium, and 714 Hard.
Models are evaluated in a zero-shot Chain-of-Thought setting with no task-specific tuning. Tables are provided in JSON format.
| Model | Easy | Medium | Hard | Average |
|---|---|---|---|---|
| Claude 3.5 Sonnet | 75.43% | 50.82% | 30.39% | 52.21% |
| GPT-4o | 64.20% | 43.70% | 24.37% | 44.09% |
| Gemini 1.5 Pro | 63.01% | 31.18% | 22.27% | 38.82% |
| Gemini 1.5 Flash | 57.33% | 26.65% | 13.43% | 32.80% |
| GPT-4o mini | 49.14% | 21.98% | 13.03% | 28.05% |
| Llama 3.2 90B-Vision | Open-source best, but below proprietary models | |||
| Random Baseline | 12.20% | 12.91% | 12.28% | 12.46% |
| Model | Easy | Medium | Hard |
|---|---|---|---|
| Claude 3.5 Sonnet + DePlot | 66.84% | 46.15% | 36.13% |
| GPT-4o + DePlot | 68.69% | 49.18% | 32.91% |
Notably, DePlot conversion improved Hard-level performance for both models (Claude: 30.39% → 36.13%; GPT-4o: 24.37% → 32.91%), suggesting that current MLLMs' visual chart interpretation remains imperfect.
| Level | Text | Table | Chart | Total Errors |
|---|---|---|---|---|
| Easy | 4% | 21% | 75% | 24 |
| Medium | 16% | 19% | 65% | 31 |
| Hard | 14% | 32% | 54% | 41 |
| Chart Type | Easy | Medium | Hard |
|---|---|---|---|
| Pie | 84.31% | N/A | N/A |
| Bar | 78.60% | 50.00% | 29.20% |
| Line | 74.89% | 52.70% | 39.22% |
| Scatter | 71.01% | 49.79% | 23.44% |
Finance is a high-stakes domain where accuracy is non-negotiable. For AI to meaningfully assist in financial analysis, it must reliably integrate and reason across heterogeneous data formats -- yet FCMR reveals that current state-of-the-art multimodal LLMs are far from this capability. The benchmark makes three lasting contributions: (1) it is robust against data contamination (near-random performance without charts), (2) it systematically evaluates genuine multi-hop cross-modal reasoning up to three hops, and (3) the CMRGen framework enables scalable, low-cost dataset generation at $0.004 per question, with potential extensibility to other domains such as law, medicine, and engineering.