EN KO
← All Publications

FCMR: Robust Evaluation of Financial Cross-Modal Multi-Hop Reasoning

ACL 2025
Seunghee Kim, Changhyeon Kim, Taeuk Kim

One-Line Summary

FCMR is a 2,199-instance benchmark for evaluating multimodal LLMs on cross-modal multi-hop reasoning across financial text reports, tables, and charts at three difficulty tiers, revealing that even the best model (Claude 3.5 Sonnet) achieves only 30.4% accuracy on the hardest tier requiring three-hop reasoning.

FCMR vs existing benchmarks
Figure 1. Limitations of existing benchmarks (MMQA): Only 0.8% of instances utilize all three modalities. FCMR addresses this issue to require genuine cross-modal reasoning.

Background & Motivation

Real-world financial analysis demands synthesizing information across multiple modalities simultaneously -- reading text reports, verifying numbers in tables, and identifying trends in charts. MMQA, the leading cross-modal multi-hop reasoning benchmark, suffers from two critical limitations that undermine its reliability as an evaluation tool:

Data Contamination: GPT-4o achieves 43.4% on the most challenging part of MMQA without relying on visual hints (vs. 63.4% with images), suggesting the benchmark data may already be present in model pretraining corpora. MMQA's reliance on Wikipedia further exacerbates this risk. On FCMR, performance without charts drops to 14.71% (near the 12.28% random baseline), confirming robustness against contamination.

Lack of Complex Queries: Only 0.8% of MMQA instances require genuine three-hop cross-modal reasoning involving all three modalities (text, table, image). This severely limits evaluation of models' ability to perform deep multi-step integration across data formats.

FCMR addresses both issues by grounding its data in financial filings (SEC EDGAR 10-K reports and WRDS Compustat financial statements from the top 101 companies by net sales), covering five years (2019--2023), and explicitly constructing Easy (1-hop), Medium (2-hop), and Hard (3-hop) difficulty tiers.

Hard-level example
Figure 2. Hard-level example: Requires 3-hop cross-modal reasoning -- identifying company information from text, verifying financial figures from tables, and analyzing trends from charts.

Proposed Method: CMRGen Pipeline

The Cross-Modal Multi-Hop Reasoning Generator (CMRGen) is an automated framework that constructs benchmark instances across three stages. Each instance presents three statements that may or may not be true, requiring the model to select all correct ones (0--3 correct answers possible). The dataset totals 2,199 instances: 757 Easy, 728 Medium, and 714 Hard.

CMRGen framework
Figure 3. CMRGen dataset generation framework: Combines SEC EDGAR 10-K reports (text), WRDS Compustat financial statements (tables), and synthetic charts (images) to automatically generate questions at 3 difficulty levels.
1
Input Data Construction
Text: SEC EDGAR 10-K reports (Items 1, 2, 7, 7A, 8) from top 101 companies by net sales. Tables: WRDS Compustat Annual Simplified Financial Statements (70 columns after preprocessing) spanning 2019--2023. Charts: Four types (line, bar, scatter, pie) generated from table columns using matplotlib, seaborn, and plotly -- covering 98% of chart types found in actual 10-K filings.
2
Statement Generation
Single-modal one-hop statements are created via GPT-4o-mini using five template types: Fact-Checking (FC), Conditional Threshold (CT), Arithmetic (AR), Trend (TR), and Ranking (RK). Multi-hop statements are composed by combining: 1-hop + 1-hop = 2-hop (Medium), and 2-hop + 1-hop = 3-hop (Hard), ensuring genuine cross-modal reasoning chains.
3
Paraphrasing & Filtering
Two-stage lexical/syntactic paraphrasing using GPT-4o, followed by semantic verification with Claude 3.5 Sonnet. Quality metrics: Word Position Deviation = 0.2 and Lexical Deviation = 0.45, both surpassing established paraphrase benchmarks (MRPC, PAWS). Hard-level instances undergo additional human expert review. Cost: only $0.004 per question.

Experimental Results

Models are evaluated in a zero-shot Chain-of-Thought setting with no task-specific tuning. Tables are provided in JSON format.

Main MLLM Performance

ModelEasyMediumHardAverage
Claude 3.5 Sonnet75.43%50.82%30.39%52.21%
GPT-4o64.20%43.70%24.37%44.09%
Gemini 1.5 Pro63.01%31.18%22.27%38.82%
Gemini 1.5 Flash57.33%26.65%13.43%32.80%
GPT-4o mini49.14%21.98%13.03%28.05%
Llama 3.2 90B-VisionOpen-source best, but below proprietary models
Random Baseline12.20%12.91%12.28%12.46%

LLM + DePlot (Chart-to-Table Conversion)

ModelEasyMediumHard
Claude 3.5 Sonnet + DePlot66.84%46.15%36.13%
GPT-4o + DePlot68.69%49.18%32.91%

Notably, DePlot conversion improved Hard-level performance for both models (Claude: 30.39% → 36.13%; GPT-4o: 24.37% → 32.91%), suggesting that current MLLMs' visual chart interpretation remains imperfect.

Error Analysis by Modality (Claude 3.5 Sonnet, 90 statements)

LevelTextTableChartTotal Errors
Easy4%21%75%24
Medium16%19%65%31
Hard14%32%54%41

Accuracy by Chart Type (Claude 3.5 Sonnet)

Chart TypeEasyMediumHard
Pie84.31%N/AN/A
Bar78.60%50.00%29.20%
Line74.89%52.70%39.22%
Scatter71.01%49.79%23.44%
Fine-grained reasoning stage analysis
Figure 5. Fine-grained reasoning stage analysis: The Information Retrieval stage emerges as the primary failure point across all models.

Why It Matters

Finance is a high-stakes domain where accuracy is non-negotiable. For AI to meaningfully assist in financial analysis, it must reliably integrate and reason across heterogeneous data formats -- yet FCMR reveals that current state-of-the-art multimodal LLMs are far from this capability. The benchmark makes three lasting contributions: (1) it is robust against data contamination (near-random performance without charts), (2) it systematically evaluates genuine multi-hop cross-modal reasoning up to three hops, and (3) the CMRGen framework enables scalable, low-cost dataset generation at $0.004 per question, with potential extensibility to other domains such as law, medicine, and engineering.

Links

Multimodal Benchmark