EN KO
← All Publications

IDS at SemEval-2020 Task 10: Does Pre-trained Language Model Know What to Emphasize?

International Workshop on Semantic Evaluation (SemEval 2020) at COLING 2020
Jaeyoul Shin, Taeuk Kim, Sang-goo Lee

One-Line Summary

A zero-shot emphasis selection method that exploits self-attention distributions of pre-trained language models (PLMs) to identify words deserving emphasis in visual media text, achieving a Ranking Score of 0.6898 on the SemEval-2020 Task 10 validation set without any task-specific training.

Attention map for emphasis selection
Figure 1. Sample attention map of the sentence "In honor of the brave", where each row represents the attention distribution of the corresponding word over other words.

Background & Motivation

In visual communication such as social media posts, posters, flyers, and advertisements, text emphasis is crucial for conveying the author's intent and facilitating comprehension. When designing visual media, content creators must decide which words to make bold, italic, larger, or differently colored — choices that dramatically affect how the message is perceived by the audience. An automatic system that recommends which words to emphasize could significantly accelerate the creation of visual media content and help non-expert designers produce more effective materials.

SemEval-2020 Task 10 formalizes this as the emphasis selection problem: given short English sentences from Adobe Spark (e.g., "In honor of the brave"), predict the correct ranking of words based on emphasis frequency annotations from nine human annotators. Each annotator independently selects up to four words they would emphasize in a visual context, and the emphasis frequency of a word is defined as the proportion of annotators who selected it. The task thus requires modeling a fundamentally subjective phenomenon — what humans collectively consider "important" in visual text.

Core Insight: Recent studies have shown that Transformer-based PLMs such as BERT encode rich linguistic knowledge in their self-attention distributions — for example, certain attention heads can parse dependency trees (Clark et al., 2019) or induce constituency structures (Kim et al., 2020). This paper hypothesizes that some attention heads are naturally specialized for emphasis selection, making it possible to identify important words in a fully zero-shot manner, without any supervised training or gold-standard annotations. This is particularly compelling because emphasis selection is inherently subjective: rather than learning from potentially noisy labels, the method leverages the distributional semantics already captured during pre-training.

Unlike prior approaches to keyword extraction or keyphrase generation — which typically rely on statistical measures (TF-IDF), graph-based algorithms (TextRank), or supervised neural models — this work takes a fundamentally different path by treating attention weights as a direct proxy for word importance. The key question driving this research is: Does the way a PLM internally "attends" to words during language modeling correlate with human judgments of emphasis?

Proposed Method: Attention-Based Emphasis Selection

The method derives emphasis frequency (e_freq) for each word from PLM attention maps. Given a sentence fed to a PLM, an attention map g(i,j) is extracted for the j-th attention head on the i-th layer. Since most PLMs use subword tokenization (e.g., WordPiece for BERT), the subword-level attention maps must be converted to word-level by averaging attention weights of subword tokens belonging to the same word. Three strategies are then proposed for converting attention maps into emphasis scores:

1
Words2Target
Computes emphasis frequency as the average attention weight all words (including [CLS] and [SEP]) pay to the target word — i.e., the column average of the attention map. This measures how influential the target word is when constructing other words' hidden representations for the next layer. Intuitively, a word that many other words attend to strongly is likely semantically central to the sentence. This is the most broadly applicable method, working with both encoder (BERT) and decoder (GPT-2) models.
2
CLS2Target
Uses the attention weight from the [CLS] token to the target word. Since [CLS] is specifically designed to encode the sentence-level representation during pre-training (via next sentence prediction in BERT), its attention distribution can be interpreted as a relevance weighting of words for sentence-level understanding. Not applicable to GPT-2, which has no [CLS] token.
3
SEP2Target
Uses the attention weight from the [SEP] token to the target word as an alternative sentence-level signal for word importance. Clark et al. (2019) observed that [SEP] tokens often serve as a "no-op" attention target, but this method tests whether the reverse direction — [SEP] attending to content words — carries meaningful importance signals. Not applicable to GPT-2.
4
Exhaustive Configuration Search
For each PLM, all possible (layer, head, method) configurations are exhaustively evaluated — e.g., 12 × 12 × 3 = 432 for BERT-base, 24 × 16 × 3 = 1,152 for BERT-large. The configuration with the highest Ranking Score on the validation set is selected. This brute-force search is computationally feasible because inference on the short sentences in the dataset is fast, and no gradient computation or parameter updates are needed.
5
Top-K Ensemble
The final prediction is generated by ensembling the top-5 best-performing (model, layer, head, method) configurations. The e_freq predictions from each configuration are averaged to produce a smoothed emphasis ranking. This simple averaging strategy provides complementary signals from different models and attention heads, reducing variance and improving overall accuracy.

A critical implementation detail is the subword aggregation step: when a word is split into multiple subword tokens (e.g., "emphasize" → "em", "##pha", "##size"), the attention weights of all subword tokens are averaged to produce a single word-level score. This ensures that the method operates at the word level, matching the granularity of the emphasis annotations.

Experimental Results

Evaluated on the SemEval-2020 Task 10 dataset consisting of short English sentences from Adobe Spark (flyers, posters, ads, motivational memes), with emphasis annotations from nine annotators. The dataset contains 3,000 instances split into training (70%), validation (10%), and test (20%) sets. Performance is measured by Matchm (overlap of top-m emphasized words between prediction and gold standard) and the Ranking Score (average of Match1 through Match4). Six PLMs were evaluated: BERT-base-uncased, BERT-large-uncased, DistilBERT-base-uncased, DistilBERT-base-multilingual, RoBERTa-base, XLNet-base, and GPT-2.

ModelMethodMatch1Match2Match3Match4R (dev)R (test)
Random-0.1730.3090.3750.4520.3270.318
TF-IDF-0.3060.4620.6150.6760.5150.518
BERT-base-uncasedWord2Target0.4310.6250.7250.7650.6370.625
BERT-large-uncasedWord2Target0.4490.6230.7360.7600.6420.629
DistilBERT-base-uncasedWord2Target0.4540.6190.7260.7680.6420.629
DistilBERT-base-multi.Word2Target0.4360.6260.7140.7610.6340.620
RoBERTa-baseCLS2Target0.4410.5890.6880.7150.608-
GPT-2Word2Target0.2250.4350.5690.6250.463-
Ensemble (Top-5)-0.4850.6790.7800.8150.6900.666
Supervised Baseline-0.5920.7520.8040.8220.7420.750

Why It Matters

This work provides concrete evidence that pre-trained language models implicitly learn word importance through their self-attention mechanisms during pre-training. The fully zero-shot approach eliminates the need for expensive, subjective emphasis annotations, making it practical for automated visual media creation tools — a significant advantage in real-world applications where emphasis judgments are inherently subjective and costly to collect at scale.

From a scientific perspective, the discovery of specialized attention heads deepens our understanding of what linguistic knowledge PLMs encode. The finding that certain heads are naturally "tuned" for emphasis selection — without ever being explicitly trained on emphasis data — raises intriguing questions about the emergence of pragmatic knowledge during language model pre-training. It suggests that the distributional patterns in large text corpora encode not just syntactic and semantic information, but also aspects of communicative importance.

Practically, the method's simplicity (requiring only a single forward pass and no gradient computation) makes it suitable for real-time deployment in content creation tools. The ensemble approach offers a principled way to improve accuracy when slightly higher latency is acceptable. This work also opens the door to probing PLMs for other word-level importance phenomena — such as prosodic stress prediction, summarization salience estimation, or information structure analysis — using the same attention-based framework.

Links

Representation Learning