EN KO
← All Publications

Summary Level Training of Sentence Rewriting for Abstractive Summarization

The Second Workshop on New Frontiers in Summarization (NewSum 2019) at EMNLP-IJCNLP 2019
Sanghwan Bae, Taeuk Kim, Jihoon Kim, Sang-goo Lee

One-Line Summary

An improved Sentence Rewriting framework for abstractive summarization that introduces summary-level ROUGE optimization through reinforcement learning and a BERT-based extractor, achieving state-of-the-art results on CNN/Daily Mail and New York Times datasets.

Extractor network architecture
Figure 1. Overview of the extractor network architecture with BERT and interval segment embeddings.

Background & Motivation

Abstractive summarization aims to produce concise, natural-language summaries of documents. The Sentence Rewriting paradigm (Chen & Bansal, 2018) is a two-stage approach that bridges extractive and abstractive methods: first, an extractor selects the most salient sentences from the source document; then, an abstractor rewrites each extracted sentence into a more concise form. The final summary is the concatenation of these rewritten sentences.

While this decomposition is elegant, existing Sentence Rewriting models suffer from two critical limitations:

Training-Evaluation Mismatch: The extractor is trained using sentence-level ROUGE rewards -- each sentence is independently matched to a reference sentence and rewarded based on its individual ROUGE score. However, the final model is evaluated using summary-level ROUGE, which compares the full generated summary against the full reference. Greedily selecting sentences with the highest individual scores can produce redundant summaries with overlapping information, leading to suboptimal summary-level performance.

Limited Contextual Understanding: Prior extractors (e.g., based on temporal convolutional networks) have limited capacity to capture long-range dependencies and rich semantic relationships across sentences, hindering their ability to identify truly salient content in the context of the entire document.

Proposed Method

The proposed model retains the two-module architecture -- an extractor and an abstractor -- but introduces substantial improvements to both components and the training procedure:

1
BERT-based Extractor Encoder
BERT is adapted as the document encoder. A [CLS] token is inserted before each sentence to serve as the sentence-level representation. To handle multi-sentence documents that exceed BERT's typical input format, interval segment embeddings are introduced: sentences are alternately assigned segment A and segment B embeddings (analogous to BERT's two-segment design), allowing the model to distinguish sentence boundaries while processing the full document. The resulting [CLS] representations capture rich, contextualized sentence semantics.
2
LSTM Pointer Network Decoder
An LSTM-based pointer network serves as the extraction decoder. At each time step, it attends over all sentence representations from the BERT encoder and selects one sentence. The LSTM hidden state carries information about previously extracted sentences, enabling the decoder to avoid redundant selections and make contextually informed extraction decisions sequentially.
3
Summary-Level RL Training (A2C)
The extractor is trained using the Advantage Actor-Critic (A2C) algorithm to directly maximize summary-level ROUGE-L F1 scores. The key insight is that the reward is computed on the full summary (all extracted-then-rewritten sentences concatenated), aligning the training objective with the evaluation metric. To address the sparse reward problem (reward is only available after all sentences are extracted), reward shaping provides dense intermediate signals: at each extraction step t, the agent receives an incremental reward equal to the difference in summary-level ROUGE when adding the t-th sentence.
4
Abstractor with Copy Mechanism
The abstractor is a standard sequence-to-sequence model with attention and a copy mechanism (pointer-generator). It is trained independently on (extracted sentence, reference sentence) pairs using maximum likelihood. At inference time, it rewrites each extracted sentence into a more concise abstractive form.
5
Redundancy Control: Trigram Blocking & Reranking
Two mechanisms reduce redundancy: (1) Trigram blocking at the extractor level prevents selecting sentences that share trigrams with already-selected sentences; (2) Reranking at the abstractor level generates multiple candidate rewrites via beam search, then selects the candidate that maximizes ROUGE with respect to the other summary sentences while minimizing repetition.

Experimental Results

The model is evaluated on three benchmark datasets: CNN/Daily Mail (non-anonymized version), New York Times (NYT50), and DUC-2002. Results demonstrate consistent improvements from both the BERT-based extractor and summary-level RL training.

CNN/Daily Mail

Model (CNN/Daily Mail)ROUGE-1ROUGE-2ROUGE-L
Sentence Rewrite (Chen & Bansal, 2018)40.8817.8038.54
Bottom-Up (Gehrmann et al., 2018)41.2218.6838.34
BERTSUM (Liu, 2019) -- extractive43.2520.2439.63
BERT-ext + abs (ours)40.1417.8737.83
BERT-ext + abs + RL (ours)41.5818.8739.34
BERT-ext + abs + RL + rerank (ours)41.9019.0839.64

NYT50 & DUC-2002

ModelDatasetROUGE-1ROUGE-2ROUGE-L
BERT-ext + abs + RL + rerank (ours)NYT5046.6326.7643.38
BERT-ext + abs + RL + rerank (ours)DUC-200243.3919.3840.14

Ablation & Analysis

Why It Matters

This work addresses a fundamental problem in extractive-abstractive summarization: the disconnect between how models are trained (sentence-level optimization) and how they are evaluated (summary-level metrics). The key contributions are threefold:

Links

Representation Learning