EN KO
← All Publications

HYU at SemEval-2022 Task 2: Effective Idiomaticity Detection with Consideration at Different Levels of Contextualization

International Workshop on Semantic Evaluation (SemEval 2022) at NAACL 2022
Youngju Joung, Taeuk Kim

One-Line Summary

A unified framework for idiomaticity detection that leverages four features computed at different levels of contextualization -- both inter-sentence and inner-sentence context -- to determine whether a multi-word expression (MWE) such as a two-noun compound is used idiomatically or literally, achieving strong cross-lingual generalization across English, Portuguese, and Galician.

Idiomaticity detection framework
Figure 1. Proposed framework with four features for idiomaticity detection. Features 1 and 2 (left) are based on surrounding inter-sentence context, while features 3 and 4 (right) are derived from inner-sentence context inspired by metaphor identification theories.

Background & Motivation

Multi-word expressions (MWEs) are groups of linguistic components containing two or more words with outstanding collocation. They enrich the expressiveness of a language by allowing diverse interpretations depending on context. For instance, the expression wet blanket can be interpreted either compositionally ("a piece of cloth soaked in liquid") or idiomatically ("a person who spoils the mood"). Detecting whether an MWE is used idiomatically is a challenging NLP problem because the same expression can have different meanings depending on context, and most current NLP models are chiefly focused on capturing compositionality.

SemEval-2022 Task 2 focuses on classifying two-noun compounds into idiomatic and non-idiomatic usage under two configurations: a zero-shot setting where the model is evaluated on MWEs never seen during training, and a one-shot setting where the model is exposed to one idiomatic and one non-idiomatic example per MWE during training.

Key Challenge: Prior work (Tayyar Madabushi et al., 2021) showed that simply concatenating three sentences (previous, target, next) in order is generally unhelpful for idiomaticity detection. This naive approach approximately triples the input sequence length (~3x), making it harder for the encoder to distinguish the target sentence from its surrounding context, which can actually degrade performance. A more sophisticated approach to exploiting context is needed -- one that emphasizes the target sentence while still leveraging surrounding information.

Proposed Method: Multi-Level Contextualization Framework

The framework computes four features from a Transformer encoder (XLM-RoBERTa base), each capturing different aspects of contextualization. Each feature is derived from a [CLS] embedding (v[CLS]) and an MWE embedding (vMWE, the average of subword representations constituting the target MWE) extracted from the encoder's last layer. The four features are concatenated and passed through a linear classifier for the final idiomatic/non-idiomatic prediction:

1
Feature 1: Previous + Target Context (fprev)
The target sentence is concatenated with its previous sentence and fed into the encoder. Trainable segment embeddings are applied to distinguish MWE tokens (segment=1) from other tokens (segment=0), providing a positional clue for the MWE location. The MWE is also repeated at the tail of the input sequence. A linear transformation of the concatenation of v[CLS] and vMWE produces the feature.
2
Feature 2: Target + Next Context (fnext)
Same procedure as Feature 1, but the target sentence is paired with its next sentence instead. By splitting the context into two separate chunks rather than concatenating all three sentences, the target sentence is naturally emphasized (it appears in both chunks) while the encoder is not overwhelmed by excessively long inputs.
3
Feature 3: Context-Exclusive Representation (fctx)
Inspired by Selectional Preference Violation (SPV) theory from metaphor detection: the target sentence is fed to the encoder with the MWE tokens replaced by [MASK]. This produces a representation of the inner-sentence context independent of the MWE itself. Unlike prior SPV implementations that compute v[CLS] and vMWE from the same input (causing them to be intertwined by attention), masking the MWE ensures truly separate context semantics.
4
Feature 4: MWE-Exclusive Representation (fmwe)
Inspired by the Metaphor Identification Procedure (MIP): only the MWE itself (removed from its context) is presented as input to the encoder. This captures the static, context-free semantics of the expression. When the difference between this static representation and the contextualized one (from Features 1-2) is large, it signals idiomatic usage -- mirroring the MIP principle that a word is metaphorical when its contextualized meaning diverges from its literal meaning.

Design Insight: The MWE tokens from the target sentence (preserving inflectional form) are copied rather than using the original dictionary form. This preserves morphological consistency between the contextualized and isolated representations, which is important for fair comparison.

Experimental Results

Evaluated on SemEval-2022 Task 2 Subtask A (idiomaticity detection for two-noun compounds) across three languages: English, Portuguese, and Galician. Galician is not included in the training data, making it a true cross-lingual transfer test. The model uses XLM-R (base) with max sequence length 300, AdamW optimizer (lr=3e-5), batch size 16, trained for 10 epochs. Five instances per model are run with different random seeds, and the best checkpoint by dev macro F1 is selected.

Model / SettingEnglishPortugueseGalicianOverall
Baseline (BERT) - Zero-shot70.7068.0350.6565.40
Baseline (XLM-R) - Zero-shot72.2965.6846.1663.21
Ours (submitted) - Zero-shot76.4272.8262.9272.27
Baseline (BERT) - One-shot88.6286.3781.6286.46
Baseline (XLM-R) - One-shot88.4585.0384.0286.56
Ours (submitted) - One-shot91.5984.5782.8787.50
Ours (post-eval) - One-shot92.2988.0587.1089.96

The ablation study compared six variations: (A) no context, (B) naive three-sentence concatenation, (C) removing segment embeddings, (D) not repeating MWE at sequence tail, (E) recovering (unmasking) MWE in the context-exclusive feature, and (F) removing the MWE-exclusive feature:

Why It Matters

Understanding idiomatic language is essential for NLP systems to properly process figurative expressions, which are pervasive in everyday communication. This work makes several important contributions:

Links

Representation Learning Multilingual