EN KO
← All Publications

Cell-aware Stacked LSTMs for Modeling Sentences

ACML 2019
Jihun Choi, Taeuk Kim, Sang-goo Lee

One-Line Summary

A novel stacked LSTM architecture that passes both hidden states and cell states between layers via soft gating, enabling richer inter-layer information flow for improved sentence modeling across NLI, paraphrase detection, sentiment classification, and machine translation.

Paper overview
Figure 2. Schematic diagram of a CAS-LSTM block, showing how cell states from the lower layer (red) are incorporated alongside hidden states for inter-layer communication.

Background & Motivation

Long Short-Term Memory (LSTM) networks are the workhorses of sequence modeling, and stacking multiple LSTM layers has become a standard technique for building more powerful models. In a conventional stacked LSTM, each layer takes the hidden state sequence of the layer below as its input and produces a new hidden state sequence. This vertical composition allows the network to learn increasingly abstract representations at higher layers.

However, there is an asymmetry in how information flows in these architectures. Within a single layer, two types of state are maintained: the hidden state h (used as the layer's output) and the cell state c (the internal memory responsible for capturing long-range dependencies). When stacking layers, only the hidden state is passed upward — the cell state is kept entirely private to each layer. This means the rich, unfiltered memory accumulated in a lower layer's cell state is invisible to all layers above it, creating an information bottleneck at every layer boundary.

Key Insight: In standard stacked LSTMs, upper layers can only see the output-gated hidden states from lower layers, not their raw cell states. Since the output gate selectively filters what information is exposed, potentially useful long-term memory signals stored in lower-layer cell states are lost during vertical propagation. The Cell-aware Stacked LSTM (CAS-LSTM) addresses this by explicitly incorporating lower-layer cell states into upper-layer computations through a learned soft gating mechanism.

Prior approaches to improving multi-layer RNNs focused on residual connections or highway connections between layers, but these operate only on hidden states. The authors argue that cell states carry complementary information that is fundamentally different from hidden states, and that leveraging this information across layers can yield strictly better representations with minimal additional cost.

Proposed Method: Cell-aware Stacked LSTM (CAS-LSTM)

The core idea is to modify the standard stacked LSTM so that each layer receives both the hidden state and the cell state from the layer below, fusing the two sources of information via a soft gating mechanism before computing its own gates. The method introduces two key components:

1
Cell-aware Inter-layer Connection
In a standard stacked LSTM, the input to layer l at time step t is simply the hidden state h(l-1)t from the layer below. In CAS-LSTM, the input is augmented to include the cell state c(l-1)t as well. Specifically, the gate computations (input gate, forget gate, output gate, and candidate cell) in layer l are modified to accept a concatenation of the previous layer's hidden state and cell state. A learned transformation fuses these two signals, allowing the model to modulate how much lower-layer memory information flows into the current layer's computations.
2
Soft Gating Fusion Mechanism
Rather than simply concatenating or adding the lower-layer cell state to the input, CAS-LSTM uses a soft gating mechanism that learns to dynamically control the amount of cell-state information delivered to the upper layer. This gate produces a value between 0 and 1 for each dimension, determining how much of the lower-layer cell state should be blended with the hidden state. This prevents the model from being overwhelmed by raw memory signals and allows it to selectively extract the most relevant information.
3
Bidirectional Information Flow
With this modification, information in CAS-LSTM flows in two directions: horizontally through the recurrent connections within each layer (as in standard LSTMs), and vertically through both hidden and cell state channels across layers. This creates a richer information pathway that allows upper layers to access both the filtered output (hidden state) and the raw memory (cell state) from lower layers, leading to more expressive multi-layer representations.
4
Drop-in Replacement with Minimal Overhead
The modification is architecturally lightweight: it only adds one additional weight matrix per gate per layer (to process the incoming cell state). The total parameter increase is modest compared to the base model. Crucially, CAS-LSTM can serve as a drop-in replacement for standard stacked LSTMs in any existing pipeline without changes to the training procedure, loss function, or downstream architecture.

Experimental Results

CAS-LSTM is evaluated on four major NLP tasks, comparing against standard stacked LSTMs and other competitive baselines. All experiments use the same hyperparameters and training procedures, differing only in the LSTM architecture.

Natural Language Inference (SNLI)

ModelTest Accuracy (%)
300D Stacked LSTM86.0
300D Gumbel TreeLSTM86.0
300D SPINN-PI86.6
300D CAS-LSTM (Ours)86.8
600D Stacked LSTM86.6
600D Residual Stacked LSTM86.4
600D CAS-LSTM (Ours)87.1

Paraphrase Detection (Quora Question Pairs)

ModelTest Accuracy (%)
Stacked LSTM86.5
BiMPM88.2
CAS-LSTM (Ours)87.2

Sentiment Classification (SST)

ModelFine-grained (%)Binary (%)
Stacked LSTM50.387.8
Tree-LSTM51.088.0
CAS-LSTM (Ours)51.588.8

Machine Translation (WMT English-German)

ModelBLEU
Standard LSTM Encoder24.6
CAS-LSTM Encoder (Ours)25.2

Why It Matters

This work identifies and addresses a fundamental but often overlooked asymmetry in multi-layer LSTM design: hidden states flow freely between layers, but cell states — which carry crucial long-term memory — are siloed within each layer. By introducing a simple soft gating mechanism to share cell states across layers, CAS-LSTM achieves consistent improvements across diverse NLP tasks with minimal additional parameters.

The significance extends beyond the specific architecture. The paper demonstrates that careful attention to information flow in deep recurrent networks can yield meaningful gains without resorting to fundamentally different architectures (such as Transformers or tree-structured models). The drop-in nature of CAS-LSTM makes it immediately practical: any system using stacked LSTMs can benefit from this modification with no changes to the training pipeline. The approach also provides insights into what information is lost at layer boundaries in deep RNNs, contributing to our understanding of how to design better recurrent architectures.

Links

Representation Learning