FCMR - HYU NLP Lab

One-Line Summary

FCMR is a 2,199-instance benchmark for evaluating multimodal LLMs on cross-modal multi-hop reasoning across financial text reports, tables, and charts at three difficulty tiers, revealing that even the best model (Claude 3.5 Sonnet) achieves only 30.4% accuracy on the hardest tier requiring three-hop reasoning.

Background & Motivation

Real-world financial analysis demands synthesizing information across multiple modalities simultaneously -- reading text reports, verifying numbers in tables, and identifying trends in charts. MMQA, the leading cross-modal multi-hop reasoning benchmark, suffers from two critical limitations that undermine its reliability as an evaluation tool:

Data Contamination: GPT-4o achieves 43.4% on the most challenging part of MMQA without relying on visual hints (vs. 63.4% with images), suggesting the benchmark data may already be present in model pretraining corpora. MMQA's reliance on Wikipedia further exacerbates this risk. On FCMR, performance without charts drops to 14.71% (near the 12.28% random baseline), confirming robustness against contamination.

Lack of Complex Queries: Only 0.8% of MMQA instances require genuine three-hop cross-modal reasoning involving all three modalities (text, table, image). This severely limits evaluation of models' ability to perform deep multi-step integration across data formats.

FCMR addresses both issues by grounding its data in financial filings (SEC EDGAR 10-K reports and WRDS Compustat financial statements from the top 101 companies by net sales), covering five years (2019--2023), and explicitly constructing Easy (1-hop), Medium (2-hop), and Hard (3-hop) difficulty tiers.

Proposed Method: CMRGen Pipeline

The Cross-Modal Multi-Hop Reasoning Generator (CMRGen) is an automated framework that constructs benchmark instances across three stages. Each instance presents three statements that may or may not be true, requiring the model to select all correct ones (0--3 correct answers possible). The dataset totals 2,199 instances: 757 Easy, 728 Medium, and 714 Hard.

CMRGen framework — **Figure 3.** CMRGen dataset generation framework: Combines SEC EDGAR 10-K reports (text), WRDS Compustat financial statements (tables), and synthetic charts (images) to automatically generate questions at 3 difficulty levels.

1

Input Data Construction

Text: SEC EDGAR 10-K reports (Items 1, 2, 7, 7A, 8) from top 101 companies by net sales. Tables: WRDS Compustat Annual Simplified Financial Statements (70 columns after preprocessing) spanning 2019--2023. Charts: Four types (line, bar, scatter, pie) generated from table columns using matplotlib, seaborn, and plotly -- covering 98% of chart types found in actual 10-K filings.

2

Statement Generation

Single-modal one-hop statements are created via GPT-4o-mini using five template types: Fact-Checking (FC), Conditional Threshold (CT), Arithmetic (AR), Trend (TR), and Ranking (RK). Multi-hop statements are composed by combining: 1-hop + 1-hop = 2-hop (Medium), and 2-hop + 1-hop = 3-hop (Hard), ensuring genuine cross-modal reasoning chains.

3

Paraphrasing & Filtering

Two-stage lexical/syntactic paraphrasing using GPT-4o, followed by semantic verification with Claude 3.5 Sonnet. Quality metrics: Word Position Deviation = 0.2 and Lexical Deviation = 0.45, both surpassing established paraphrase benchmarks (MRPC, PAWS). Hard-level instances undergo additional human expert review. Cost: only $0.004 per question.

Experimental Results

Models are evaluated in a zero-shot Chain-of-Thought setting with no task-specific tuning. Tables are provided in JSON format.

Main MLLM Performance

Model	Easy	Medium	Hard	Average
Claude 3.5 Sonnet	75.43%	50.82%	30.39%	52.21%
GPT-4o	64.20%	43.70%	24.37%	44.09%
Gemini 1.5 Pro	63.01%	31.18%	22.27%	38.82%
Gemini 1.5 Flash	57.33%	26.65%	13.43%	32.80%
GPT-4o mini	49.14%	21.98%	13.03%	28.05%
Llama 3.2 90B-Vision	Open-source best, but below proprietary models
Random Baseline	12.20%	12.91%	12.28%	12.46%

LLM + DePlot (Chart-to-Table Conversion)

Model	Easy	Medium	Hard
Claude 3.5 Sonnet + DePlot	66.84%	46.15%	36.13%
GPT-4o + DePlot	68.69%	49.18%	32.91%

Notably, DePlot conversion improved Hard-level performance for both models (Claude: 30.39% → 36.13%; GPT-4o: 24.37% → 32.91%), suggesting that current MLLMs' visual chart interpretation remains imperfect.

Error Analysis by Modality (Claude 3.5 Sonnet, 90 statements)

Level	Text	Table	Chart	Total Errors
Easy	4%	21%	75%	24
Medium	16%	19%	65%	31
Hard	14%	32%	54%	41

Accuracy by Chart Type (Claude 3.5 Sonnet)

Chart Type	Easy	Medium	Hard
Pie	84.31%	N/A	N/A
Bar	78.60%	50.00%	29.20%
Line	74.89%	52.70%	39.22%
Scatter	71.01%	49.79%	23.44%

**Figure 5.** Fine-grained reasoning stage analysis: The Information Retrieval stage emerges as the primary failure point across all models.

Performance Ceiling: Even the best model (Claude 3.5 Sonnet) achieves only 30.39% on Hard, with an average of 52.21% across all tiers -- far from practical deployment in financial analysis
Information Retrieval is the Bottleneck: Manual analysis of 40 samples across four reasoning stages (Planning, Modality Identification, Information Retrieval, Information Reasoning) reveals that failure to extract the correct information from the identified modality is the most common error
Chart Understanding is Weakest: Charts account for 75% of errors at the Easy level and 54% at Hard. Scatter plots are the most challenging chart type (23.44% Hard accuracy) due to less structured representations
Conservative Bias: All models exhibit a tendency to default to "false" under uncertainty, with statement-level precision (74.27%) significantly exceeding recall (47.99%) for Claude on Hard
Dominant Error Types: Trend assessment errors (35 cases), data/value interpretation mistakes (17 cases), ranking errors (16 cases), and difficulty with negative numbers
Optimization Headroom: Combining modality integration + 4-stage reasoning + Self-Refine on 100 Hard samples improves accuracy from 32% to 46%, but saturates there -- dedicated approaches are needed. Advanced reasoning models (o1: 43%, Gemini 2.0 Flash Thinking: 39%) also fail to close the gap

Why It Matters

Finance is a high-stakes domain where accuracy is non-negotiable. For AI to meaningfully assist in financial analysis, it must reliably integrate and reason across heterogeneous data formats -- yet FCMR reveals that current state-of-the-art multimodal LLMs are far from this capability. The benchmark makes three lasting contributions: (1) it is robust against data contamination (near-random performance without charts), (2) it systematically evaluates genuine multi-hop cross-modal reasoning up to three hops, and (3) the CMRGen framework enables scalable, low-cost dataset generation at $0.004 per question, with potential extensibility to other domains such as law, medicine, and engineering.

Links

ACL Anthology arXiv Paper

FCMR: Robust Evaluation of Financial Cross-Modal Multi-Hop Reasoning