MIDAS - HYU NLP Lab

One-Line Summary

Introduces MIDAS, a multilingual idiom dataset of 64,660 expressions across 6 languages, and reveals that LLMs process idioms through a hybrid mechanism integrating memorization, compositional reasoning, and contextual cues -- not memorization alone.

Background & Motivation

Idioms such as "It's raining cats and dogs" carry figurative meanings that diverge sharply from their literal components. When an LLM correctly interprets such an expression, is it retrieving a memorized mapping from training data, or reasoning about the expression from context and compositionality? Despite the centrality of this question to understanding LLM capabilities, prior idiom datasets have been limited to single languages, lacked verified meanings, or relied on GPT-generated definitions.

This work addresses three core research questions: (1) Do LLMs rely on memorization, reasoning, or both? (2) How do contextual cues and compositionality influence idiom processing? (3) What mechanisms underlie idiom understanding across typologically diverse languages? To answer these, the authors construct a large-scale, human-validated multilingual benchmark and design controlled experiments that isolate each factor.

Proposed Method

1

MIDAS Dataset Construction

64,660 unique idioms (70,909 instances) collected from authoritative sources in 6 languages: English (9,766 from Wiktionary), German (10,097 from Duden), Chinese (11,851 from chinese-xinhua), Korean (11,316 from the Korean Standard Dictionary), Arabic (8,051 from the Dictionary of Arabic Idioms), and Turkish (13,579 from TDK). Each idiom is paired with a human-validated figurative meaning refined through LLM-assisted processing and native speaker verification.

2

MCQ Evaluation for Idiom Understanding

5-choice multiple-choice questions are constructed for each idiom: one correct figurative meaning and four distractors (two based on surface-form similarity and two on meaning similarity, using multilingual-e5-large-instruct embeddings). Each question is administered 3 times with shuffled answer positions; correctness requires all 3 trials answered correctly, eliminating position bias.

3

Memorization Assessment via Continuation Task

Models are given the beginning of an idiom and must predict the final word. Strict filtering ensures the target word is not trivially predictable: minimum idiom length of 3-4 words, context-target FastText similarity <0.7, and no overlap across idioms. An idiom is classified as "memorized" if the model generates the correct first token in its top predictions (top-5 log probability for open-source; temperature-0 generation for closed-source).

4

Compositionality Scoring

Each idiom receives a compositionality score (1-5 scale) via LLM prompting, measuring how inferable the figurative meaning is from individual component words. Score 1 = completely opaque; score 5 = highly inferable. The correlation between compositionality and MCQ success rate reveals whether models exploit compositional reasoning.

5

Context and Reasoning Analysis

MCQ accuracy is compared with and without example sentences to quantify contextual cue effects. Additionally, the reasoning-specialized model QwQ-32B is compared against its base model Qwen2.5-32B, and Chain-of-Thought (CoT) prompting effects are measured to isolate the contribution of explicit reasoning.

Sample English idiom entry — **Figure 2.** MIDAS data example: Each idiom includes the expression, figurative meaning, and example sentence.

Experimental Results

MCQ Accuracy Across Models and Languages

Model	EN	DE	ZH	KO	AR	TR
Aya-Expanse-32B	81.71	71.77	75.45	49.89	65.62	48.94
Qwen2.5-32B	83.71	73.94	93.35	51.39	71.25	40.31
DeepSeek-V3	90.34	83.94	95.65	55.64	75.53	62.52
GPT-4o	91.13	88.08	91.44	72.72	72.85	71.82

Memorization Rates (%)

Model	EN	DE	ZH	KO	AR	TR
Aya-Expanse-32B	80.36	56.43	92.95	36.59	30.54	32.66
Qwen2.5-32B	73.72	45.27	77.97	31.61	29.87	22.28
DeepSeek-V3	70.83	59.28	89.51	31.06	29.45	45.82
GPT-4o	67.18	49.53	70.26	26.13	27.25	35.08

Impact of Context: Accuracy Without vs. With Example Sentences

Model	Language	w/o Context	w/ Context	Gain
Aya-Expanse-32B	KO	52.78	82.81	+30.03
Qwen2.5-32B	TR	37.34	69.21	+31.87
Qwen2.5-32B	KO	51.46	82.14	+30.68
DeepSeek-V3	KO	57.93	83.73	+25.80
DeepSeek-V3	TR	63.06	88.45	+25.39
GPT-4o	TR	71.24	90.72	+19.48
GPT-4o	KO	75.93	92.42	+16.49
GPT-4o	EN	91.53	95.22	+3.69
DeepSeek-V3	ZH	95.44	95.87	+0.43

Context impact on accuracy — **Figure 3.** Accuracy changes with context provided: The effect of context is dramatically large for low-resource languages (Korean, Turkish).

Reasoning method comparison — **Figure 4.** Reasoning model (QwQ) vs base model (Qwen2.5) comparison: Reasoning ability improves idiom understanding.

Key Findings

Hybrid Processing: LLMs process idioms through a hybrid mechanism that integrates memorization, compositional reasoning, and contextual cues -- not pure memorization alone. All three factors contribute statistically significantly across all model-language combinations.
Memorization Matters but is Not Sufficient: Memorized idioms achieve 3.82-15.83%p higher accuracy (all differences statistically significant), with the largest gap being Qwen2.5 on English (+15.83%p). However, even unmemorized idioms are frequently answered correctly, confirming the role of reasoning.
Compositionality Enables Reasoning: Correctly answered idioms have 1.63-44.00% higher compositionality scores (Mann-Whitney U, all significant). Qwen2.5 shows the strongest effect: +44.00% in Arabic and +42.86% in English, indicating heavy reliance on compositional inference.
Context is Critical for Low-Resource Languages: Providing example sentences yields 25-32%p gains for Korean and Turkish, compared to only 0.4-4.7%p for Chinese with top models. Context compensates for lower memorization in underrepresented languages.
Reasoning Models Outperform Consistently: QwQ-32B's Thinking mode outperforms across all 6 languages. However, CoT prompting on the base Qwen2.5 model is a double-edged sword: it helps low-resource languages (+2-5%p) but hurts high-resource ones (-1-4%p).
Downstream Impact: Providing idiom meanings to GPT-4o as an evaluator increases alignment with human judgment from 0.118 to 0.739 (Kendall's tau) for sentence generation, confirming that idiom knowledge is critical for practical NLP tasks.

Why It Matters

Idiom understanding is one of the most revealing tests of whether LLMs genuinely comprehend language or merely pattern-match. This study provides the first large-scale, controlled, multilingual evidence that LLMs employ a hybrid mechanism combining memorization, compositional reasoning, and contextual inference for idiom processing. The finding that unmemorized idioms can still be understood through reasoning demonstrates genuine linguistic capability beyond rote retrieval.

Practically, the dramatic performance gaps between high-resource languages (EN/ZH: 91-96%) and low-resource ones (KO/TR: 40-73%) highlight an urgent need for better multilingual idiom resources. The MIDAS dataset itself -- 64,660 human-validated idioms across 6 typologically diverse languages -- fills a critical gap and enables future research on figurative language understanding. The downstream experiments further show that explicit idiom knowledge can substantially improve real-world tasks like translation and sentence generation.

Links

ACL Anthology arXiv Paper

Memorization or Reasoning? Exploring the Idiom Understanding of LLMs