EN KO
← All Publications

Element-wise Bilinear Interaction for Sentence Matching

*SEM 2018
Jihun Choi, Taeuk Kim, Sang-goo Lee

One-Line Summary

An element-wise bilinear interaction function that captures fine-grained multiplicative relationships between sentence pair representations with only O(d) parameters, achieving competitive performance on natural language inference and semantic similarity tasks while being orders of magnitude more parameter-efficient than full bilinear models.

Paper overview
Figure 1. Element-wise bilinear interaction for sentence matching. Each dimension of the two sentence vectors interacts through a compact bilinear form, enabling multiplicative cross-sentence signals without the cubic parameter cost of a full bilinear tensor.

Background & Motivation

Sentence matching — determining the semantic relationship between two sentences — is a core task underlying natural language inference (NLI), paraphrase detection, and answer selection in question answering. Given two sentences, the goal is to predict a label such as entailment, contradiction, or neutral (in NLI), or a continuous similarity score (in STS). The quality of this prediction depends critically on how the model compares the two sentence representations.

At the heart of the problem lies a fundamental question: what is the best way to compare two fixed-length sentence vectors? Ideally, the comparison function should (1) capture both symmetric and asymmetric relationships, (2) model interactions across dimensions of the two vectors, and (3) remain computationally efficient. Prior to this work, the dominant approaches either satisfied only a subset of these desiderata or required prohibitively many parameters.

Limitations of Existing Interaction Methods (circa 2018):

  • Independent encoding + simple comparison: Siamese-style models encode each sentence separately and compare using cosine similarity or element-wise difference. These miss complex cross-sentence interactions because the comparison function is too shallow.
  • Full bilinear interaction: Computing uTWv with a full weight tensor W ∈ Rd×d captures rich interactions but requires O(d2) or O(d3) parameters (when multiple bilinear maps are used), leading to overfitting and high computational cost.
  • Cross-attention networks: Models like ESIM and decomposable attention compute word-level alignments between sentences, achieving strong results but at significantly higher computational cost and architectural complexity.
  • Concatenation-based approaches: Simply concatenating the two sentence vectors and passing through an MLP treats the interaction implicitly, often failing to capture explicit multiplicative relationships between dimensions.

A key insight motivates this work: multiplicative interactions between sentence representations are crucial for capturing semantic relationships, but the standard bilinear formulation is wasteful because most off-diagonal entries in the weight matrix contribute little. An element-wise approach can preserve the essential multiplicative expressiveness while reducing parameters by orders of magnitude.

Comparison of Interaction Functions

To understand where the element-wise bilinear sits in the design space, it is helpful to compare the key interaction functions available for sentence matching:

Interaction TypeFormulationParametersCaptures Asymmetry?Cross-Dim Interaction?
Cosine Similarityu · v / (||u|| ||v||)0NoNo (dot product only)
Element-wise Difference|u − v|0No (symmetric)No
Element-wise Productu ⊙ v0No (symmetric)No
Full BilinearuTWvO(d2)YesYes (all pairs)
Element-wise Bilinearw ⊙ u ⊙ v + bO(d)YesNo (diagonal only)

The element-wise bilinear occupies a unique position: it is the most expressive O(d) interaction that supports asymmetry (via learned per-dimension weights), while avoiding the parameter explosion of full bilinear models. The empirical finding that off-diagonal interactions contribute minimally validates this design choice.

Proposed Method

The paper introduces an element-wise bilinear interaction mechanism that operates on the dimension-aligned components of two sentence vectors. Rather than learning a single large interaction matrix, the method decomposes the bilinear form into per-dimension interactions, each governed by a small set of learnable parameters. Formally, for each dimension i, the interaction output is fi = wi · ui · vi + bi, where wi and bi are learnable scalars. This can be viewed as constraining the full bilinear weight matrix W to be diagonal, hence the O(d) parameter count.

1
Sentence Encoding
Each sentence in the pair is encoded into a fixed-length vector using a shared sentence encoder (e.g., BiLSTM). Let u and v denote the resulting d-dimensional representations for the premise and hypothesis, respectively.
2
Element-wise Bilinear Interaction
Instead of computing a full bilinear product uTWv (requiring a d×d weight matrix), the method computes the interaction for each dimension independently. For each dimension i, a small bilinear form models the interaction between ui and vi using a learnable weight and bias. The resulting interaction vector has the same dimensionality d, with each entry capturing how the corresponding dimensions of the two sentences relate multiplicatively. This reduces the parameter count from O(d2) to O(d).
3
Heuristic Feature Augmentation
Following standard practice, the element-wise bilinear output is concatenated with element-wise difference |u − v| and element-wise product u ⊙ v to form a comprehensive matching vector. These hand-crafted heuristic features complement the learned bilinear interaction by providing symmetric and asymmetric comparison signals.
4
Classification
The combined matching vector is passed through a multi-layer perceptron (MLP) with ReLU activations to produce the final classification logits. The entire pipeline — encoder, bilinear interaction, and classifier — is trained end-to-end with cross-entropy loss.

Experimental Results

The method is evaluated on the Stanford Natural Language Inference (SNLI) benchmark, one of the most widely used sentence matching datasets with 570k human-annotated sentence pairs labeled as entailment, contradiction, or neutral. Results are compared against both sentence-encoding models (which encode sentences independently) and more complex cross-attention models. Additional ablation studies isolate the contribution of the bilinear interaction from the heuristic features.

SNLI Test Accuracy

ModelCategoryParametersAccuracy (%)
300D BiLSTM + simple comparisonSentence EncodingBaseline~84.0
300D BiLSTM + full bilinearSentence EncodingO(d2)~84.8
300D BiLSTM + element-wise bilinearSentence EncodingO(d)~85.0
ESIM (cross-attention)Cross-AttentionMuch larger~88.0

Ablation Insight: When each comparison function is used in isolation, the element-wise bilinear outperforms both element-wise difference and element-wise product. More importantly, combining all three yields the best results, confirming that the bilinear interaction captures complementary information — specifically, the learned asymmetric multiplicative signal that neither difference (symmetric, additive) nor product (symmetric, multiplicative) can provide. This decomposition validates the theoretical motivation that diagonal bilinear interactions occupy a distinct and valuable point in the design space.

Why It Matters

This work makes a compelling case that sophisticated sentence-pair interactions do not require heavyweight architectures or massive parameter budgets. Its contributions are significant in several ways:

Links

Representation Learning