Revisiting the Practical Effectiveness of Constituency Parse Extraction from Pre-trained Language Models

One-Line Summary

A rigorous re-examination of constituency parse extraction from pre-trained language models (CPE-PLM) that introduces novel ensemble techniques combining multiple heterogeneous PLMs, achieving 55.7 F1 on PTB -- competitive with unsupervised parsers -- and demonstrating clear advantages over supervised parsers in few-shot settings.

Background & Motivation

Constituency Parse Extraction from Pre-trained Language Models (CPE-PLM) is a recent paradigm that attempts to induce constituency parse trees relying only on the internal knowledge of pre-trained language models, without any task-specific fine-tuning. The key idea is that PLMs such as BERT and RoBERTa encode syntactic structure in their attention patterns and hidden representations, which can be decoded into parse trees using chart-based algorithms.

Key Challenges with Prior Work:

Inconsistent evaluation: Previous studies used different PLMs, layer selection strategies, distance metrics, and decoding algorithms, making fair comparison impossible.
Limited scope: Most evaluations focused on a single PLM or a narrow set of configurations, missing the bigger picture of what CPE-PLM can truly achieve.
Unexplored ensemble potential: While individual attention heads show limited parsing ability, the potential of combining information across multiple heads, layers, and even multiple PLMs had not been systematically explored.
Unknown practical value: It was unclear whether CPE-PLM -- a training-free approach -- could offer any practical advantage over established unsupervised or supervised parsers in real-world scenarios.

This paper addresses all of these gaps by providing a mathematical reformulation of CPE-PLM, proposing novel ensemble methods, and conducting comprehensive experiments across multiple languages, downstream tasks, and data regimes. The study systematically evaluates 16 PLMs (12 English and 4 multilingual), covering encoder-based (BERT, RoBERTa, ELECTRA), decoder-based (GPT-2, CTRL), and hybrid architectures (XLNet, BART), providing the most thorough assessment of CPE-PLM to date.

Proposed Method

The paper first reformulates CPE-PLM in a unified mathematical framework that clarifies the relationship between prior approaches and enables principled ensemble strategies.

Core Formulation

Given a sentence, each attention head provides a pair-wise score function for spans. The tree score decomposes as:

s_tree(T) = Σ_(i,j)∈T s_span(i,j)

where span scores are recursively defined: s_span(i,j) = s_comp(i,j) + min_i≤k<j s_split(i,k,j) for i < j, and s_span(i,i) = 0. The split score further decomposes as s_split(i,k,j) = s_span(i,k) + s_span(k+1,j). Parse trees are then found via the CKY algorithm that minimizes the total tree score: T̂ = argmin_T s_tree(T).

Two distance functions are used to measure divergence between attention distributions: Jensen-Shannon divergence (JSD) and Hellinger distance (HEL), with Hellinger distance favored for its simplicity and comparable effectiveness.

1

Mathematical Reformulation

The pair score function is defined as s_p(i,j) := C(j-i+1,2)^-1 Σ f(g(w_x), g(w_y)), where g(m,n) extracts attention distributions from the n-th head at layer m, and f measures the distance between attention distributions of word pairs. This formulation unifies prior approaches under a single framework and enables principled ensemble strategies. Each attention head in a PLM with l layers and a heads produces a candidate tree, yielding l × a candidate trees per model.

2

Greedy Ensemble

Starting from the best-performing single attention head, sequentially adds additional heads to the ensemble if they improve overall parsing performance on a validation set. The algorithm iterates over a sorted set of candidate heads G_sorted and retains each head only if it improves the validation metric. There is no fixed limit on the number of participants, allowing the method to adaptively determine the optimal ensemble size.

3

Beam Ensemble

Inspired by beam search in neural text generation, this method maintains b best hypotheses (head combinations) at each expansion step. At each iteration, it expands all current hypotheses by adding one more candidate head, evaluates all expansions, and keeps only the top-b combinations. Each head is selected at most once (no replacement). The beam size is set to b=5 for single-PLM and b=30 for multi-PLM setups, avoiding greedy local optima at a modest computational cost.

4

Multi-PLM Extension

Expands the candidate pool of attention heads beyond a single PLM to include heads from P heterogeneous models, forming a combined tree pool: τ_multi := {T̂_(p,m,n) | p∈{1,...,P}, m∈{1,...,l}, n∈{1,...,a}}. The ensemble selects the best combination across all available heads from all PLMs, exploiting complementary syntactic knowledge encoded in different architectures. The paper evaluates 16 PLMs total: 12 English (BERT-base/large, RoBERTa-base/large, ELECTRA-base/large, GPT-2, GPT-2-medium, CTRL, BART-large, XLNet-base/large) and 4 multilingual (MBERT, XLM, XLM-R, XLM-R-large).

Experimental Results

Single-PLM Performance on PTB (Selected Models)

PLM	Best Single Head	Greedy Ensemble	Beam Ensemble
BERT-base	42.7	43.0	--
BERT-large	44.2	45.0	--
RoBERTa-large	41.9	47.2	--
ELECTRA-large	44.3	47.9	--
XLNet-large	46.4	47.2	--
XLM-R	46.7	48.5	--
Prior best (Kim et al., 2021)	47.7

CPE-PLM vs. Unsupervised Parsers (PTB Test Set)

Model	Type	F1	NP	VP	PP	ADVP
PRPN	Unsupervised	47.3	59	46	57	32
ON-LSTM	Unsupervised	48.1	64	41	54	31
Neural PCFG	Unsupervised	50.8	71	33	58	45
Compound PCFG	Unsupervised	55.2	74	41	68	52
Neural L-PCFG	Unsupervised	55.3	67	48	65	58
XLM-R + Greedy	CPE-PLM (single)	48.5	69	29	62	73
All PLMs + Greedy	CPE-PLM (multi)	55.3	75	36	76	76
All PLMs + Beam	CPE-PLM (multi)	55.7	74	42	75	72

Phrase-level analysis: CPE-PLM excels at recognizing NP (75%), PP (76%), and ADVP (76%) constituents, substantially outperforming unsupervised parsers on PP and ADVP. However, VP recall (42%) remains a relative weakness compared to Neural L-PCFG (48%), suggesting that verb phrase boundaries are harder to capture from attention patterns alone.

Multilingual Results (9 Languages, F1)

Method	Avg	EN	EU	FR	DE	HE	HU	KO	PL	SV
Top-K (MBERT)	39.8	44.6	39.3	35.9	35.9	37.8	33.2	47.5	51.1	32.6
Greedy (MBERT)	40.4	47.1	40.2	36.9	37.5	38.6	30.2	49.1	52.4	31.9
Greedy (All MLMs)	47.5	51.9	44.0	41.9	47.3	48.1	40.1	53.7	61.4	39.0

Combining four multilingual PLMs (MBERT, XLM, XLM-R, XLM-R-large) boosts the average F1 from 40.4 to 47.5 (+7.1 points). Polish achieves the highest score at 61.4 F1, followed by Korean (53.7) and English (51.9). The consistent gains across all nine typologically diverse languages confirm that multi-PLM synergy is a language-agnostic phenomenon.

Few-Shot Parsing: CPE-PLM vs. Supervised Parser (Benepar)

Annotations	CPE-PLM Greedy	CPE-PLM Beam	Benepar (Supervised)
1	46.2	45.4	11.6
2	48.4	45.9	12.5
5	49.9	47.7	12.5
10	49.1	49.6	14.0
17 (1% of data)	49.4	51.3	31.1
100% validation	55.3	55.7	92.2

Validation Set Dependency

Validation Data	Greedy F1	Relative Loss	Beam F1	Relative Loss
1%	49.4	-5.9%	51.3	-4.5%
2%	49.9	-5.3%	49.8	-6.0%
5%	52.7	-2.5%	51.8	-4.0%
10%	54.3	-0.9%	52.9	-2.9%
100%	55.3	--	55.7	--

Even with only 1% of the validation data (17 annotated trees), CPE-PLM retains over 90% of its full performance, confirming its extreme data efficiency.

Downstream Task: URNNG Training

Configuration	Perplexity	F1
Compound PCFG	85.4	57.8
CPE-PLM (All + Greedy) → URNNG	81.3	57.2
CPE-PLM (All + Beam) → URNNG	82.0	60.7

Using CPE-PLM-induced trees to train URNNG boosts parsing F1 from 55.7 to 60.7 (+5 points), demonstrating that CPE-PLM trees serve as effective supervision for training dedicated parsers.

Downstream Task: TreeLSTM Classification Accuracy

Parse Source	SST-2	MR	SUBJ	TREC
Right-branching	85.72	83.37	94.80	94.50
CPE-PLM (All + Beam)	86.10	83.62	94.85	94.75
Supervised Parser	86.70	83.62	95.12	95.05

CPE-PLM trees consistently outperform the right-branching baseline and nearly match supervised parser trees on all four text classification benchmarks, confirming that the induced syntactic structure carries meaningful information for downstream applications.

Inference Time Comparison

Approach	F1	Time
Compound PCFG	55.2	31 min
CPE-PLM (Greedy)	55.3	27 min
Distance parser trained on CPE-PLM	55.0	36 sec
Benepar trained on CPE-PLM	59.3	32 sec

Knowledge distillation from CPE-PLM to a supervised parser (Benepar) reduces inference time from 27 minutes to just 32 seconds while actually improving F1 from 55.7 to 59.3, offering a practical deployment path.

Competitive with unsupervised parsers: The multi-PLM beam ensemble achieves 55.7 F1, surpassing Compound PCFG (55.2) and matching Neural L-PCFG (55.3) -- all without any training.
Massive few-shot advantage: With only 1 annotated example, CPE-PLM scores 46.2 F1 versus Benepar's 11.6 F1. Even with 17 examples (1% of data), CPE-PLM outperforms supervised parsing by 20+ F1 points, making it highly practical in low-resource settings.
Ensemble diversity is key: Single-PLM extraction tops out around 47-48 F1, but combining heads across heterogeneous models (BERT, RoBERTa, ELECTRA, multilingual variants) boosts performance by +8.5 F1, demonstrating that different PLMs encode complementary syntactic knowledge.
Multi-PLM synergy is consistent: The benefit of heterogeneous ensembles holds across all 9 evaluated languages, with the multi-MLM greedy ensemble improving average F1 from 40.4 to 47.5 (+7.1 points).
Useful for downstream tasks: CPE-PLM-induced parse trees used in URNNG training yield 60.7 F1 (+5 points above standalone), and TreeLSTM classifiers trained with CPE-PLM parses rank second only to supervised parsers across four benchmarks.
Knowledge distillation enables fast deployment: Distilling CPE-PLM's output into Benepar achieves 59.3 F1 in just 32 seconds -- a 50x speedup over direct CPE-PLM while improving accuracy by 3.6 points.
Encoder PLMs dominate: Encoder-based models (BERT, RoBERTa, ELECTRA) consistently outperform decoder-based (GPT-2: 37.2-40.8 F1) and hybrid architectures (BART: 38.5 F1) for parse extraction, suggesting that bidirectional attention is crucial for capturing syntactic structure.

Why It Matters

This paper makes several important contributions to the field of syntactic analysis and PLM interpretability:

Training-free parsing that works: CPE-PLM requires no training data or model fine-tuning, yet achieves competitive results with unsupervised parsers that require extensive training. This makes it an attractive option when computational resources or training data are limited.
Clear practical niche in few-shot settings: The dramatic advantage over supervised parsers in low-data regimes (46.2 vs. 11.6 F1 with 1 example) establishes CPE-PLM as the method of choice when only a handful of annotated parse trees are available.
Insights into PLM syntactic knowledge: The success of multi-PLM ensembles reveals that different pre-trained models encode complementary aspects of syntactic structure, advancing our understanding of what linguistic knowledge PLMs capture and how it can be effectively extracted.
Practical deployment pathway: The knowledge distillation results (distilling CPE-PLM to Benepar: 59.3 F1 in 32 seconds) show that CPE-PLM is not just a research curiosity but can be deployed efficiently in production systems, bridging the gap between training-free extraction and practical inference speed.
Unified theoretical framework: The mathematical reformulation brings clarity to a previously fragmented research area, providing a principled foundation for future work on extracting structured linguistic knowledge from pre-trained models.

Links

ACL Anthology arXiv Paper PDF