EN KO
← All Publications

Revisiting the Practical Effectiveness of Constituency Parse Extraction from Pre-trained Language Models

COLING 2022
Taeuk Kim

One-Line Summary

A rigorous re-examination of constituency parse extraction from pre-trained language models (CPE-PLM) that introduces novel ensemble techniques combining multiple heterogeneous PLMs, achieving 55.7 F1 on PTB -- competitive with unsupervised parsers -- and demonstrating clear advantages over supervised parsers in few-shot settings.

Paper overview
Figure 1. Concept diagram of CPE-PLM (Constituency Parse Extraction from PLM) with various ensemble methods including single attention head, top-K ensemble, and layer-wise ensemble.

Background & Motivation

Constituency Parse Extraction from Pre-trained Language Models (CPE-PLM) is a recent paradigm that attempts to induce constituency parse trees relying only on the internal knowledge of pre-trained language models, without any task-specific fine-tuning. The key idea is that PLMs such as BERT and RoBERTa encode syntactic structure in their attention patterns and hidden representations, which can be decoded into parse trees using chart-based algorithms.

Key Challenges with Prior Work:

  • Inconsistent evaluation: Previous studies used different PLMs, layer selection strategies, distance metrics, and decoding algorithms, making fair comparison impossible.
  • Limited scope: Most evaluations focused on a single PLM or a narrow set of configurations, missing the bigger picture of what CPE-PLM can truly achieve.
  • Unexplored ensemble potential: While individual attention heads show limited parsing ability, the potential of combining information across multiple heads, layers, and even multiple PLMs had not been systematically explored.
  • Unknown practical value: It was unclear whether CPE-PLM -- a training-free approach -- could offer any practical advantage over established unsupervised or supervised parsers in real-world scenarios.

This paper addresses all of these gaps by providing a mathematical reformulation of CPE-PLM, proposing novel ensemble methods, and conducting comprehensive experiments across multiple languages, downstream tasks, and data regimes. The study systematically evaluates 16 PLMs (12 English and 4 multilingual), covering encoder-based (BERT, RoBERTa, ELECTRA), decoder-based (GPT-2, CTRL), and hybrid architectures (XLNet, BART), providing the most thorough assessment of CPE-PLM to date.

Proposed Method

The paper first reformulates CPE-PLM in a unified mathematical framework that clarifies the relationship between prior approaches and enables principled ensemble strategies.

Core Formulation

Given a sentence, each attention head provides a pair-wise score function for spans. The tree score decomposes as:

stree(T) = Σ(i,j)∈T sspan(i,j)

where span scores are recursively defined: sspan(i,j) = scomp(i,j) + mini≤k<j ssplit(i,k,j) for i < j, and sspan(i,i) = 0. The split score further decomposes as ssplit(i,k,j) = sspan(i,k) + sspan(k+1,j). Parse trees are then found via the CKY algorithm that minimizes the total tree score: T̂ = argminT stree(T).

Two distance functions are used to measure divergence between attention distributions: Jensen-Shannon divergence (JSD) and Hellinger distance (HEL), with Hellinger distance favored for its simplicity and comparable effectiveness.

1
Mathematical Reformulation
The pair score function is defined as sp(i,j) := C(j-i+1,2)-1 Σ f(g(wx), g(wy)), where g(m,n) extracts attention distributions from the n-th head at layer m, and f measures the distance between attention distributions of word pairs. This formulation unifies prior approaches under a single framework and enables principled ensemble strategies. Each attention head in a PLM with l layers and a heads produces a candidate tree, yielding l × a candidate trees per model.
2
Greedy Ensemble
Starting from the best-performing single attention head, sequentially adds additional heads to the ensemble if they improve overall parsing performance on a validation set. The algorithm iterates over a sorted set of candidate heads Gsorted and retains each head only if it improves the validation metric. There is no fixed limit on the number of participants, allowing the method to adaptively determine the optimal ensemble size.
3
Beam Ensemble
Inspired by beam search in neural text generation, this method maintains b best hypotheses (head combinations) at each expansion step. At each iteration, it expands all current hypotheses by adding one more candidate head, evaluates all expansions, and keeps only the top-b combinations. Each head is selected at most once (no replacement). The beam size is set to b=5 for single-PLM and b=30 for multi-PLM setups, avoiding greedy local optima at a modest computational cost.
4
Multi-PLM Extension
Expands the candidate pool of attention heads beyond a single PLM to include heads from P heterogeneous models, forming a combined tree pool: τmulti := {T̂(p,m,n) | p∈{1,...,P}, m∈{1,...,l}, n∈{1,...,a}}. The ensemble selects the best combination across all available heads from all PLMs, exploiting complementary syntactic knowledge encoded in different architectures. The paper evaluates 16 PLMs total: 12 English (BERT-base/large, RoBERTa-base/large, ELECTRA-base/large, GPT-2, GPT-2-medium, CTRL, BART-large, XLNet-base/large) and 4 multilingual (MBERT, XLM, XLM-R, XLM-R-large).

Experimental Results

Single-PLM Performance on PTB (Selected Models)

PLMBest Single HeadGreedy EnsembleBeam Ensemble
BERT-base42.743.0--
BERT-large44.245.0--
RoBERTa-large41.947.2--
ELECTRA-large44.347.9--
XLNet-large46.447.2--
XLM-R46.748.5--
Prior best (Kim et al., 2021)47.7

CPE-PLM vs. Unsupervised Parsers (PTB Test Set)

ModelTypeF1NPVPPPADVP
PRPNUnsupervised47.359465732
ON-LSTMUnsupervised48.164415431
Neural PCFGUnsupervised50.871335845
Compound PCFGUnsupervised55.274416852
Neural L-PCFGUnsupervised55.367486558
XLM-R + GreedyCPE-PLM (single)48.569296273
All PLMs + GreedyCPE-PLM (multi)55.375367676
All PLMs + BeamCPE-PLM (multi)55.774427572

Phrase-level analysis: CPE-PLM excels at recognizing NP (75%), PP (76%), and ADVP (76%) constituents, substantially outperforming unsupervised parsers on PP and ADVP. However, VP recall (42%) remains a relative weakness compared to Neural L-PCFG (48%), suggesting that verb phrase boundaries are harder to capture from attention patterns alone.

Multilingual Results (9 Languages, F1)

MethodAvgENEUFRDEHEHUKOPLSV
Top-K (MBERT)39.844.639.335.935.937.833.247.551.132.6
Greedy (MBERT)40.447.140.236.937.538.630.249.152.431.9
Greedy (All MLMs)47.551.944.041.947.348.140.153.761.439.0

Combining four multilingual PLMs (MBERT, XLM, XLM-R, XLM-R-large) boosts the average F1 from 40.4 to 47.5 (+7.1 points). Polish achieves the highest score at 61.4 F1, followed by Korean (53.7) and English (51.9). The consistent gains across all nine typologically diverse languages confirm that multi-PLM synergy is a language-agnostic phenomenon.

Few-Shot Parsing: CPE-PLM vs. Supervised Parser (Benepar)

AnnotationsCPE-PLM GreedyCPE-PLM BeamBenepar (Supervised)
146.245.411.6
248.445.912.5
549.947.712.5
1049.149.614.0
17 (1% of data)49.451.331.1
100% validation55.355.792.2

Validation Set Dependency

Validation DataGreedy F1Relative LossBeam F1Relative Loss
1%49.4-5.9%51.3-4.5%
2%49.9-5.3%49.8-6.0%
5%52.7-2.5%51.8-4.0%
10%54.3-0.9%52.9-2.9%
100%55.3--55.7--

Even with only 1% of the validation data (17 annotated trees), CPE-PLM retains over 90% of its full performance, confirming its extreme data efficiency.

Downstream Task: URNNG Training

ConfigurationPerplexityF1
Compound PCFG85.457.8
CPE-PLM (All + Greedy) → URNNG81.357.2
CPE-PLM (All + Beam) → URNNG82.060.7

Using CPE-PLM-induced trees to train URNNG boosts parsing F1 from 55.7 to 60.7 (+5 points), demonstrating that CPE-PLM trees serve as effective supervision for training dedicated parsers.

Downstream Task: TreeLSTM Classification Accuracy

Parse SourceSST-2MRSUBJTREC
Right-branching85.7283.3794.8094.50
CPE-PLM (All + Beam)86.1083.6294.8594.75
Supervised Parser86.7083.6295.1295.05

CPE-PLM trees consistently outperform the right-branching baseline and nearly match supervised parser trees on all four text classification benchmarks, confirming that the induced syntactic structure carries meaningful information for downstream applications.

Inference Time Comparison

ApproachF1Time
Compound PCFG55.231 min
CPE-PLM (Greedy)55.327 min
Distance parser trained on CPE-PLM55.036 sec
Benepar trained on CPE-PLM59.332 sec

Knowledge distillation from CPE-PLM to a supervised parser (Benepar) reduces inference time from 27 minutes to just 32 seconds while actually improving F1 from 55.7 to 59.3, offering a practical deployment path.

Why It Matters

This paper makes several important contributions to the field of syntactic analysis and PLM interpretability:

Links

Parsing & Syntax