Element-wise Bilinear Interaction for Sentence Matching
*SEM 2018
Jihun Choi, Taeuk Kim, Sang-goo Lee
One-Line Summary
An element-wise bilinear interaction function that captures fine-grained multiplicative relationships between sentence pair representations with only O(d) parameters, achieving competitive performance on natural language inference and semantic similarity tasks while being orders of magnitude more parameter-efficient than full bilinear models.
Figure 1. Element-wise bilinear interaction for sentence matching. Each dimension of the two sentence vectors interacts through a compact bilinear form, enabling multiplicative cross-sentence signals without the cubic parameter cost of a full bilinear tensor.
Background & Motivation
Sentence matching — determining the semantic relationship between two sentences — is a core task underlying natural language inference (NLI), paraphrase detection, and answer selection in question answering. Given two sentences, the goal is to predict a label such as entailment, contradiction, or neutral (in NLI), or a continuous similarity score (in STS). The quality of this prediction depends critically on how the model compares the two sentence representations.
At the heart of the problem lies a fundamental question: what is the best way to compare two fixed-length sentence vectors? Ideally, the comparison function should (1) capture both symmetric and asymmetric relationships, (2) model interactions across dimensions of the two vectors, and (3) remain computationally efficient. Prior to this work, the dominant approaches either satisfied only a subset of these desiderata or required prohibitively many parameters.
Limitations of Existing Interaction Methods (circa 2018):
Independent encoding + simple comparison: Siamese-style models encode each sentence separately and compare using cosine similarity or element-wise difference. These miss complex cross-sentence interactions because the comparison function is too shallow.
Full bilinear interaction: Computing uTWv with a full weight tensor W ∈ Rd×d captures rich interactions but requires O(d2) or O(d3) parameters (when multiple bilinear maps are used), leading to overfitting and high computational cost.
Cross-attention networks: Models like ESIM and decomposable attention compute word-level alignments between sentences, achieving strong results but at significantly higher computational cost and architectural complexity.
Concatenation-based approaches: Simply concatenating the two sentence vectors and passing through an MLP treats the interaction implicitly, often failing to capture explicit multiplicative relationships between dimensions.
A key insight motivates this work: multiplicative interactions between sentence representations are crucial for capturing semantic relationships, but the standard bilinear formulation is wasteful because most off-diagonal entries in the weight matrix contribute little. An element-wise approach can preserve the essential multiplicative expressiveness while reducing parameters by orders of magnitude.
Comparison of Interaction Functions
To understand where the element-wise bilinear sits in the design space, it is helpful to compare the key interaction functions available for sentence matching:
Interaction Type
Formulation
Parameters
Captures Asymmetry?
Cross-Dim Interaction?
Cosine Similarity
u · v / (||u|| ||v||)
0
No
No (dot product only)
Element-wise Difference
|u − v|
0
No (symmetric)
No
Element-wise Product
u ⊙ v
0
No (symmetric)
No
Full Bilinear
uTWv
O(d2)
Yes
Yes (all pairs)
Element-wise Bilinear
w ⊙ u ⊙ v + b
O(d)
Yes
No (diagonal only)
The element-wise bilinear occupies a unique position: it is the most expressive O(d) interaction that supports asymmetry (via learned per-dimension weights), while avoiding the parameter explosion of full bilinear models. The empirical finding that off-diagonal interactions contribute minimally validates this design choice.
Proposed Method
The paper introduces an element-wise bilinear interaction mechanism that operates on the dimension-aligned components of two sentence vectors. Rather than learning a single large interaction matrix, the method decomposes the bilinear form into per-dimension interactions, each governed by a small set of learnable parameters. Formally, for each dimension i, the interaction output is fi = wi · ui · vi + bi, where wi and bi are learnable scalars. This can be viewed as constraining the full bilinear weight matrix W to be diagonal, hence the O(d) parameter count.
1
Sentence Encoding
Each sentence in the pair is encoded into a fixed-length vector using a shared sentence encoder (e.g., BiLSTM). Let u and v denote the resulting d-dimensional representations for the premise and hypothesis, respectively.
2
Element-wise Bilinear Interaction
Instead of computing a full bilinear product uTWv (requiring a d×d weight matrix), the method computes the interaction for each dimension independently. For each dimension i, a small bilinear form models the interaction between ui and vi using a learnable weight and bias. The resulting interaction vector has the same dimensionality d, with each entry capturing how the corresponding dimensions of the two sentences relate multiplicatively. This reduces the parameter count from O(d2) to O(d).
3
Heuristic Feature Augmentation
Following standard practice, the element-wise bilinear output is concatenated with element-wise difference |u − v| and element-wise product u ⊙ v to form a comprehensive matching vector. These hand-crafted heuristic features complement the learned bilinear interaction by providing symmetric and asymmetric comparison signals.
4
Classification
The combined matching vector is passed through a multi-layer perceptron (MLP) with ReLU activations to produce the final classification logits. The entire pipeline — encoder, bilinear interaction, and classifier — is trained end-to-end with cross-entropy loss.
Parameter efficiency: The element-wise formulation requires only d additional parameters for the bilinear interaction, compared to d2 for a standard bilinear layer, making it feasible even for high-dimensional representations (e.g., d = 300 or 600).
Drop-in compatibility: The interaction layer can replace any comparison function in existing sentence-encoding architectures without requiring changes to the encoder or classifier.
Expressiveness beyond additive models: Unlike element-wise difference or dot product, the bilinear form captures asymmetric and non-linear relationships between sentence dimensions, providing a richer comparison signal.
Experimental Results
The method is evaluated on the Stanford Natural Language Inference (SNLI) benchmark, one of the most widely used sentence matching datasets with 570k human-annotated sentence pairs labeled as entailment, contradiction, or neutral. Results are compared against both sentence-encoding models (which encode sentences independently) and more complex cross-attention models. Additional ablation studies isolate the contribution of the bilinear interaction from the heuristic features.
SNLI Test Accuracy
Model
Category
Parameters
Accuracy (%)
300D BiLSTM + simple comparison
Sentence Encoding
Baseline
~84.0
300D BiLSTM + full bilinear
Sentence Encoding
O(d2)
~84.8
300D BiLSTM + element-wise bilinear
Sentence Encoding
O(d)
~85.0
ESIM (cross-attention)
Cross-Attention
Much larger
~88.0
Matches or exceeds full bilinear models: Despite using orders of magnitude fewer parameters in the interaction layer, the element-wise bilinear achieves comparable or slightly higher accuracy than the full bilinear formulation, suggesting that off-diagonal interactions contribute minimally to performance.
Best among sentence-encoding models: Within the class of models that encode sentences independently (without cross-attention), the proposed method achieves the strongest results, demonstrating the value of learned multiplicative interactions.
Meaningful gap to cross-attention models: While cross-attention models like ESIM achieve higher absolute accuracy, they require word-level alignment computation and significantly more parameters. The element-wise bilinear occupies a favorable point on the efficiency-performance tradeoff.
Consistent across matching heuristics: The bilinear interaction provides complementary information to standard heuristic features (difference, product). Combining all three consistently outperforms any individual comparison function.
Robust to hyperparameter choices: The method shows stable performance across different encoder dimensions and training configurations, confirming that the gains are not sensitive to specific tuning.
Ablation Insight: When each comparison function is used in isolation, the element-wise bilinear outperforms both element-wise difference and element-wise product. More importantly, combining all three yields the best results, confirming that the bilinear interaction captures complementary information — specifically, the learned asymmetric multiplicative signal that neither difference (symmetric, additive) nor product (symmetric, multiplicative) can provide. This decomposition validates the theoretical motivation that diagonal bilinear interactions occupy a distinct and valuable point in the design space.
Why It Matters
This work makes a compelling case that sophisticated sentence-pair interactions do not require heavyweight architectures or massive parameter budgets. Its contributions are significant in several ways:
Principled parameter reduction: By decomposing the bilinear interaction into element-wise operations, the paper demonstrates that the essential multiplicative signal in sentence matching can be captured with O(d) instead of O(d2) parameters — a reduction of several orders of magnitude for typical embedding dimensions.
Practical building block: The element-wise bilinear layer serves as a drop-in replacement for simple comparison functions in any sentence-encoding pipeline, offering immediate improvements without architectural changes.
Design principle for future work: The finding that diagonal bilinear interactions suffice for competitive performance influenced subsequent research on lightweight interaction mechanisms in NLP, including later work on efficient attention and matching networks.
Efficiency for deployment: In resource-constrained settings where cross-attention models are too expensive, the element-wise bilinear provides a strong alternative with minimal computational overhead, making it suitable for real-time applications and edge deployment.