Revisiting the Practical Effectiveness of Constituency Parse Extraction from Pre-trained Language Models
COLING 2022
Taeuk Kim
One-Line Summary
A rigorous re-examination of constituency parse extraction from pre-trained language models (CPE-PLM) that introduces novel ensemble techniques combining multiple heterogeneous PLMs, achieving 55.7 F1 on PTB -- competitive with unsupervised parsers -- and demonstrating clear advantages over supervised parsers in few-shot settings.
Figure 1. Concept diagram of CPE-PLM (Constituency Parse Extraction from PLM) with various ensemble methods including single attention head, top-K ensemble, and layer-wise ensemble.
Background & Motivation
Constituency Parse Extraction from Pre-trained Language Models (CPE-PLM) is a recent paradigm that attempts to induce constituency parse trees relying only on the internal knowledge of pre-trained language models, without any task-specific fine-tuning. The key idea is that PLMs such as BERT and RoBERTa encode syntactic structure in their attention patterns and hidden representations, which can be decoded into parse trees using chart-based algorithms.
Key Challenges with Prior Work:
Inconsistent evaluation: Previous studies used different PLMs, layer selection strategies, distance metrics, and decoding algorithms, making fair comparison impossible.
Limited scope: Most evaluations focused on a single PLM or a narrow set of configurations, missing the bigger picture of what CPE-PLM can truly achieve.
Unexplored ensemble potential: While individual attention heads show limited parsing ability, the potential of combining information across multiple heads, layers, and even multiple PLMs had not been systematically explored.
Unknown practical value: It was unclear whether CPE-PLM -- a training-free approach -- could offer any practical advantage over established unsupervised or supervised parsers in real-world scenarios.
This paper addresses all of these gaps by providing a mathematical reformulation of CPE-PLM, proposing novel ensemble methods, and conducting comprehensive experiments across multiple languages, downstream tasks, and data regimes. The study systematically evaluates 16 PLMs (12 English and 4 multilingual), covering encoder-based (BERT, RoBERTa, ELECTRA), decoder-based (GPT-2, CTRL), and hybrid architectures (XLNet, BART), providing the most thorough assessment of CPE-PLM to date.
Proposed Method
The paper first reformulates CPE-PLM in a unified mathematical framework that clarifies the relationship between prior approaches and enables principled ensemble strategies.
Core Formulation
Given a sentence, each attention head provides a pair-wise score function for spans. The tree score decomposes as:
stree(T) = Σ(i,j)∈T sspan(i,j)
where span scores are recursively defined: sspan(i,j) = scomp(i,j) + mini≤k<j ssplit(i,k,j) for i < j, and sspan(i,i) = 0. The split score further decomposes as ssplit(i,k,j) = sspan(i,k) + sspan(k+1,j). Parse trees are then found via the CKY algorithm that minimizes the total tree score: T̂ = argminT stree(T).
Two distance functions are used to measure divergence between attention distributions: Jensen-Shannon divergence (JSD) and Hellinger distance (HEL), with Hellinger distance favored for its simplicity and comparable effectiveness.
1
Mathematical Reformulation
The pair score function is defined as sp(i,j) := C(j-i+1,2)-1 Σ f(g(wx), g(wy)), where g(m,n) extracts attention distributions from the n-th head at layer m, and f measures the distance between attention distributions of word pairs. This formulation unifies prior approaches under a single framework and enables principled ensemble strategies. Each attention head in a PLM with l layers and a heads produces a candidate tree, yielding l × a candidate trees per model.
2
Greedy Ensemble
Starting from the best-performing single attention head, sequentially adds additional heads to the ensemble if they improve overall parsing performance on a validation set. The algorithm iterates over a sorted set of candidate heads Gsorted and retains each head only if it improves the validation metric. There is no fixed limit on the number of participants, allowing the method to adaptively determine the optimal ensemble size.
3
Beam Ensemble
Inspired by beam search in neural text generation, this method maintains b best hypotheses (head combinations) at each expansion step. At each iteration, it expands all current hypotheses by adding one more candidate head, evaluates all expansions, and keeps only the top-b combinations. Each head is selected at most once (no replacement). The beam size is set to b=5 for single-PLM and b=30 for multi-PLM setups, avoiding greedy local optima at a modest computational cost.
4
Multi-PLM Extension
Expands the candidate pool of attention heads beyond a single PLM to include heads from P heterogeneous models, forming a combined tree pool: τmulti := {T̂(p,m,n) | p∈{1,...,P}, m∈{1,...,l}, n∈{1,...,a}}. The ensemble selects the best combination across all available heads from all PLMs, exploiting complementary syntactic knowledge encoded in different architectures. The paper evaluates 16 PLMs total: 12 English (BERT-base/large, RoBERTa-base/large, ELECTRA-base/large, GPT-2, GPT-2-medium, CTRL, BART-large, XLNet-base/large) and 4 multilingual (MBERT, XLM, XLM-R, XLM-R-large).
Experimental Results
Single-PLM Performance on PTB (Selected Models)
PLM
Best Single Head
Greedy Ensemble
Beam Ensemble
BERT-base
42.7
43.0
--
BERT-large
44.2
45.0
--
RoBERTa-large
41.9
47.2
--
ELECTRA-large
44.3
47.9
--
XLNet-large
46.4
47.2
--
XLM-R
46.7
48.5
--
Prior best (Kim et al., 2021)
47.7
CPE-PLM vs. Unsupervised Parsers (PTB Test Set)
Model
Type
F1
NP
VP
PP
ADVP
PRPN
Unsupervised
47.3
59
46
57
32
ON-LSTM
Unsupervised
48.1
64
41
54
31
Neural PCFG
Unsupervised
50.8
71
33
58
45
Compound PCFG
Unsupervised
55.2
74
41
68
52
Neural L-PCFG
Unsupervised
55.3
67
48
65
58
XLM-R + Greedy
CPE-PLM (single)
48.5
69
29
62
73
All PLMs + Greedy
CPE-PLM (multi)
55.3
75
36
76
76
All PLMs + Beam
CPE-PLM (multi)
55.7
74
42
75
72
Phrase-level analysis: CPE-PLM excels at recognizing NP (75%), PP (76%), and ADVP (76%) constituents, substantially outperforming unsupervised parsers on PP and ADVP. However, VP recall (42%) remains a relative weakness compared to Neural L-PCFG (48%), suggesting that verb phrase boundaries are harder to capture from attention patterns alone.
Multilingual Results (9 Languages, F1)
Method
Avg
EN
EU
FR
DE
HE
HU
KO
PL
SV
Top-K (MBERT)
39.8
44.6
39.3
35.9
35.9
37.8
33.2
47.5
51.1
32.6
Greedy (MBERT)
40.4
47.1
40.2
36.9
37.5
38.6
30.2
49.1
52.4
31.9
Greedy (All MLMs)
47.5
51.9
44.0
41.9
47.3
48.1
40.1
53.7
61.4
39.0
Combining four multilingual PLMs (MBERT, XLM, XLM-R, XLM-R-large) boosts the average F1 from 40.4 to 47.5 (+7.1 points). Polish achieves the highest score at 61.4 F1, followed by Korean (53.7) and English (51.9). The consistent gains across all nine typologically diverse languages confirm that multi-PLM synergy is a language-agnostic phenomenon.
Few-Shot Parsing: CPE-PLM vs. Supervised Parser (Benepar)
Annotations
CPE-PLM Greedy
CPE-PLM Beam
Benepar (Supervised)
1
46.2
45.4
11.6
2
48.4
45.9
12.5
5
49.9
47.7
12.5
10
49.1
49.6
14.0
17 (1% of data)
49.4
51.3
31.1
100% validation
55.3
55.7
92.2
Validation Set Dependency
Validation Data
Greedy F1
Relative Loss
Beam F1
Relative Loss
1%
49.4
-5.9%
51.3
-4.5%
2%
49.9
-5.3%
49.8
-6.0%
5%
52.7
-2.5%
51.8
-4.0%
10%
54.3
-0.9%
52.9
-2.9%
100%
55.3
--
55.7
--
Even with only 1% of the validation data (17 annotated trees), CPE-PLM retains over 90% of its full performance, confirming its extreme data efficiency.
Downstream Task: URNNG Training
Configuration
Perplexity
F1
Compound PCFG
85.4
57.8
CPE-PLM (All + Greedy) → URNNG
81.3
57.2
CPE-PLM (All + Beam) → URNNG
82.0
60.7
Using CPE-PLM-induced trees to train URNNG boosts parsing F1 from 55.7 to 60.7 (+5 points), demonstrating that CPE-PLM trees serve as effective supervision for training dedicated parsers.
Downstream Task: TreeLSTM Classification Accuracy
Parse Source
SST-2
MR
SUBJ
TREC
Right-branching
85.72
83.37
94.80
94.50
CPE-PLM (All + Beam)
86.10
83.62
94.85
94.75
Supervised Parser
86.70
83.62
95.12
95.05
CPE-PLM trees consistently outperform the right-branching baseline and nearly match supervised parser trees on all four text classification benchmarks, confirming that the induced syntactic structure carries meaningful information for downstream applications.
Inference Time Comparison
Approach
F1
Time
Compound PCFG
55.2
31 min
CPE-PLM (Greedy)
55.3
27 min
Distance parser trained on CPE-PLM
55.0
36 sec
Benepar trained on CPE-PLM
59.3
32 sec
Knowledge distillation from CPE-PLM to a supervised parser (Benepar) reduces inference time from 27 minutes to just 32 seconds while actually improving F1 from 55.7 to 59.3, offering a practical deployment path.
Competitive with unsupervised parsers: The multi-PLM beam ensemble achieves 55.7 F1, surpassing Compound PCFG (55.2) and matching Neural L-PCFG (55.3) -- all without any training.
Massive few-shot advantage: With only 1 annotated example, CPE-PLM scores 46.2 F1 versus Benepar's 11.6 F1. Even with 17 examples (1% of data), CPE-PLM outperforms supervised parsing by 20+ F1 points, making it highly practical in low-resource settings.
Ensemble diversity is key: Single-PLM extraction tops out around 47-48 F1, but combining heads across heterogeneous models (BERT, RoBERTa, ELECTRA, multilingual variants) boosts performance by +8.5 F1, demonstrating that different PLMs encode complementary syntactic knowledge.
Multi-PLM synergy is consistent: The benefit of heterogeneous ensembles holds across all 9 evaluated languages, with the multi-MLM greedy ensemble improving average F1 from 40.4 to 47.5 (+7.1 points).
Useful for downstream tasks: CPE-PLM-induced parse trees used in URNNG training yield 60.7 F1 (+5 points above standalone), and TreeLSTM classifiers trained with CPE-PLM parses rank second only to supervised parsers across four benchmarks.
Knowledge distillation enables fast deployment: Distilling CPE-PLM's output into Benepar achieves 59.3 F1 in just 32 seconds -- a 50x speedup over direct CPE-PLM while improving accuracy by 3.6 points.
Encoder PLMs dominate: Encoder-based models (BERT, RoBERTa, ELECTRA) consistently outperform decoder-based (GPT-2: 37.2-40.8 F1) and hybrid architectures (BART: 38.5 F1) for parse extraction, suggesting that bidirectional attention is crucial for capturing syntactic structure.
Why It Matters
This paper makes several important contributions to the field of syntactic analysis and PLM interpretability:
Training-free parsing that works: CPE-PLM requires no training data or model fine-tuning, yet achieves competitive results with unsupervised parsers that require extensive training. This makes it an attractive option when computational resources or training data are limited.
Clear practical niche in few-shot settings: The dramatic advantage over supervised parsers in low-data regimes (46.2 vs. 11.6 F1 with 1 example) establishes CPE-PLM as the method of choice when only a handful of annotated parse trees are available.
Insights into PLM syntactic knowledge: The success of multi-PLM ensembles reveals that different pre-trained models encode complementary aspects of syntactic structure, advancing our understanding of what linguistic knowledge PLMs capture and how it can be effectively extracted.
Practical deployment pathway: The knowledge distillation results (distilling CPE-PLM to Benepar: 59.3 F1 in 32 seconds) show that CPE-PLM is not just a research curiosity but can be deployed efficiently in production systems, bridging the gap between training-free extraction and practical inference speed.
Unified theoretical framework: The mathematical reformulation brings clarity to a previously fragmented research area, providing a principled foundation for future work on extracting structured linguistic knowledge from pre-trained models.