EN KO
← All Publications

Multilingual Chart-based Constituency Parse Extraction from Pre-trained Language Models

EMNLP 2021 Findings
Taeuk Kim, Bowen Li, Sang-goo Lee

One-Line Summary

A chart-based constituency parse extraction (CPE) framework that induces non-trivial parse trees from multilingual pre-trained language models across 9 typologically diverse languages in a fully language-agnostic manner, identifying universal attention heads that are consistently sensitive to syntactic structure regardless of input language.

Distribution of syntactically-sensitive attention heads in XLM-R
Figure 2. Distribution of syntactically-sensitive attention heads across layers and heads in XLM-R for 9 languages. Significant overlap in the top-20 heads across languages reveals the existence of universal attention heads that capture constituency structure regardless of language.

Background & Motivation

Recent work has shown that pre-trained language models (PLMs) like BERT encode rich syntactic information in their internal representations. One line of research focuses on constituency parse extraction (CPE) -- recovering hierarchical phrase structure from PLM representations without any supervised syntactic training. The core intuition is that words belonging to the same syntactic constituent should share similar internal representations, while constituent boundaries should exhibit representational discontinuities. However, prior CPE methods have been evaluated almost exclusively on English using the Penn Treebank, and rely on a greedy top-down algorithm that is inherently limited in its ability to assess whole-phrase plausibility.

Key Gaps in Prior Work:

  • English-centric evaluation: Existing CPE methods (e.g., the syntactic distance approach of Hewitt & Manning, 2019; Kim et al., 2020) have been validated only on the Penn Treebank, leaving it unknown whether the syntactic structures they extract generalize across typologically different languages with varying word orders, morphological richness, and phrase structures.
  • Top-down limitations: Prior approaches rely on syntactic distances between adjacent words and use a greedy top-down procedure, which only considers boundary information at each split step. This means the algorithm cannot evaluate whether all words within a candidate phrase actually cohere as a constituent -- it simply picks the largest boundary distance and recurses.
  • Untapped multilingual PLMs: Models like mBERT (104 languages) and XLM-RoBERTa (100 languages) are trained on massive multilingual corpora, but their cross-lingual syntactic representations have not been systematically probed for constituency structure. These models offer a unique opportunity to test whether syntactic knowledge transfers across languages.
  • Unknown universality: It remains an open question whether syntactic structure encoded in multilingual PLMs is language-specific (learned separately for each language) or reflects universal grammatical patterns shared across languages -- a question with deep implications for linguistic theory and practical cross-lingual NLP.

This paper addresses all four gaps by proposing a chart-based CPE framework that considers all words within a candidate phrase (not just boundaries), applying it to 9 languages across two multilingual PLMs to investigate the universality of encoded syntactic structure.

Proposed Method: Chart-Based Constituency Parse Extraction

The method formulates CPE as finding the minimum-cost parse tree, where costs are derived from syntactic distances computed from PLM representations. Unlike the greedy top-down approach that makes locally optimal splits, the chart-based method uses dynamic programming to consider all possible span decompositions simultaneously, yielding a globally optimal binary tree.

Top-down vs chart-based constituency parse extraction
Figure 1. Comparison of top-down (TD) and chart-based (CP/CC) parse extraction. TD greedily splits at the position of maximum syntactic distance, while the chart-based method uses CYK dynamic programming to find the globally optimal tree by scoring all possible spans.
1
Syntactic Distance Computation
For each pair of adjacent words, compute a syntactic distance di = f(g(wi), g(wi+1)), where g(·) extracts representations from a specific PLM layer and f(·,·) is a distance metric. Two types of representations are used: hidden states (Gv) -- the contextualized word embeddings from each layer, paired with Cosine, L1, or L2 distance; and attention distributions (Gd) -- the attention weight vectors from individual heads, paired with Jensen-Shannon divergence or Hellinger distance. The attention-based representations prove more effective for parse extraction because they encode relational patterns between tokens.
2
Chart-Based Span Scoring
Instead of greedily splitting top-down, the method assigns a compositionality score to every candidate span (i, j). Two scoring functions are proposed: Pair Score (sp) computes the average pairwise distance among all word combinations within a span -- sp(i,j) = C(j-i+1, 2)-1 Σ f(g(wx), g(wy)) -- measuring internal coherence under the assumption that constituent members share similar attention patterns. Characteristic Score (sc) first computes a characteristic vector c as the mean attention distribution of all words in the span, then measures each word's deviation from it: sc(i,j) = (j-i+1)-1 Σ f(g(wx), c). Lower scores indicate more cohesive spans that are likely valid constituents.
3
CYK Dynamic Programming
The total tree score is decomposed as stree(T) = Σ(i,j)∈T sspan(i,j), where each span score combines its compositionality score with the optimal split point. The CYK algorithm operates in O(n3) time, filling a chart bottom-up to efficiently find the minimum-cost binary tree by considering all possible split points for every span. While more computationally expensive than the O(n log n) top-down approach, this global optimization overcomes the greedy method's limitation of making irrevocable local decisions based solely on boundary information.
4
Top-K Ensemble
To exploit the complementary strengths of different attention heads, a Top-K ensemble strategy is introduced. First, all attention heads are exhaustively evaluated on a validation set using every (f, g) combination. The K best-performing heads are selected (K=20 found optimal across all settings). At test time, each of the K heads generates its own parse tree, which is then converted to a syntactic distance vector. These K distance vectors are averaged, and the final parse is derived from the averaged distances. This provides consistent 2-4 F1 point gains orthogonal to the choice of scoring method, confirming that syntactic information is distributed across multiple attention heads rather than concentrated in a single one.

Experimental Results

The method is evaluated on 9 typologically diverse languages spanning different language families (Indo-European, Uralic, Language Isolate, Afro-Asiatic) and word orders (SVO, SOV, VSO, free). Treebanks include the Penn Treebank (English) and SPMRL shared task treebanks (Basque, French, German, Hebrew, Hungarian, Korean, Polish, Swedish). Multilingual models tested are mBERT and XLM-RoBERTa (XLM-R). Three CPE variants are compared: TD (top-down baseline), CP (chart-based with Pair Score), and CC (chart-based with Characteristic Score).

Monolingual English Results (Sentence-level F1)

ModelTDCPCC
BERT-base37.038.539.0
RoBERTa-base35.637.838.1
XLNet-base40.142.341.8
XLNet-large40.143.446.4

Comparison with Unsupervised Parsing Baselines (F1)

LanguagePRPNON-LSTMC-PCFGCPE-PLM (Ours)
English------46.4
French----40.542.4
German----37.339.6
Korean----27.747.3
Swedish----23.738.4

CPE-PLM results use the best monolingual PLM for each language with Top-K ensemble. The method substantially outperforms Compound PCFG (C-PCFG), especially on Korean (+19.6 F1) and Swedish (+14.7 F1), without requiring any language-specific parser training.

Multilingual Results with XLM-R (Sentence-level F1)

LanguageTDCPCC
English45.546.747.0
Basque43.743.845.1
French45.844.245.5
German41.442.241.6
Hebrew45.043.245.3
Hungarian42.444.043.4
Korean55.955.754.3
Polish43.143.744.6
Swedish39.540.641.5

Cross-Lingual Transfer & Universal Attention Heads

A key question is whether the optimal attention heads for CPE are language-specific or universal. To test this, the authors perform cross-lingual transfer experiments: selecting the top-K heads using only the English PTB validation set, then applying those same heads to all other languages.

Cross-Lingual Transfer Results:

  • Using English-selected heads on other languages typically causes only 0-2 F1 point drops compared to language-specific head selection.
  • In some cases, transfer actually improves performance: Basque improves from 45.1 to 46.2 F1, suggesting that English validation data can identify syntactically informative heads that generalize better than language-specific selection with smaller validation sets.
  • Languages with smaller validation sets (Hebrew, Polish, Swedish) benefit most from cross-lingual transfer, as English provides a more reliable signal for head selection.

Visualizing the top-20 attention heads for each language in XLM-R reveals striking overlap: most heads that are syntactically sensitive for one language are also sensitive for others. These universal attention heads cluster in layers 6-12 (middle-to-upper layers), consistent with prior findings that lower layers encode surface-level features while upper layers encode more abstract, structural information.

Phrase-Type Analysis

Breaking down performance by phrase type reveals an interesting asymmetry:

Why It Matters

This work makes three contributions that advance our understanding of syntactic knowledge in multilingual PLMs:

For low-resource languages that lack large treebanks, this approach offers a viable path to approximate syntactic analysis using only a multilingual PLM -- no language-specific parser training is needed. The dramatic improvements over C-PCFG on languages like Korean (+19.6 F1) and Swedish (+14.7 F1) are particularly promising for practical applications. The insights about universal attention heads also inform the broader study of how transformers represent grammar, and suggest that probing multilingual models for structural properties can reveal deep cross-lingual regularities.

Links

Parsing & Syntax Multilingual