EN KO
← All Publications

Hyper-CL: Conditioning Sentence Representations with Hypernetworks

ACL 2024
Young Hyun Yoo, Jii Cha, Changhyeon Kim, Taeuk Kim

One-Line Summary

A hypernetwork-based contrastive learning framework that dynamically projects sentence embeddings into condition-specific subspaces, bridging the performance-efficiency gap between bi-encoders and tri-encoders for conditional similarity tasks.

Illustration of the Hyper-CL approach for conditioning sentence representations
Figure 1. Illustration of the proposed approach, Hyper-CL. It demonstrates the effectiveness of dynamically conditioning pre-computed sentence representations with different perspectives.

Background & Motivation

Sentence embeddings are central to many NLP applications, and contrastive learning methods like SimCSE have driven major advances in their quality. However, standard sentence representations capture a single, fixed view of a sentence's meaning. In practice, similarity between sentences often depends on a specific perspective or condition. For example, given sentences about a cyclist and a hiker, they should appear similar when considering "mode of transportation" but different regarding "speed of travel." This task is formalized as Conditional Semantic Textual Similarity (C-STS), and also arises naturally in Knowledge Graph Completion (KGC) where entity similarity depends on the relation.

Three main architectures exist for computing conditioned similarity, each with distinct trade-offs. The cross-encoder concatenates both sentences and the condition as a single input [s1; s2; c], achieving strong performance but requiring a forward pass for every unique triplet -- making it impractical for retrieval. The bi-encoder concatenates each sentence with the condition [si; c] separately, requiring |S| x |C| encoder passes. The tri-encoder encodes sentences and conditions independently (|S| + |C| passes) then combines them with a lightweight function, enabling efficient caching but sacrificing accuracy because it cannot model explicit sentence-condition interactions.

Core Problem: Tri-encoders are highly efficient thanks to independent encoding and caching, but their simple composition functions (e.g., Hadamard product, concatenation) cannot capture the rich sentence-condition interactions that bi-encoders achieve. Hyper-CL addresses this by using a hypernetwork to generate expressive, condition-specific projection matrices -- maintaining tri-encoder efficiency while substantially closing the performance gap with bi-encoders.

Training procedure of Hyper-CL
Figure 3. Training procedure of Hyper-CL. Every embedding is the output of the same encoder f. The hypernetwork constructs projection matrices to condition sentence representations based on the embedding of the condition.

Proposed Method

Hyper-CL builds on the tri-encoder architecture by introducing a hypernetwork that generates condition-sensitive linear transformation matrices on-the-fly. This enables dynamic, expressive conditioning of sentence embeddings without sacrificing the caching benefits inherent to tri-encoders.

1
Independent Embedding Computation
A shared encoder f (e.g., SimCSE-RoBERTa) independently encodes sentences and conditions: h_s = f(s) for the sentence and h_c = f(c) for the condition. Since encoding is independent, all embeddings can be pre-computed and cached.
2
Hypernetwork-Generated Projection Matrix
A hypernetwork q takes the condition embedding h_c and generates a full linear transformation matrix: W_c = q(h_c), where q maps from R^(N_h) to R^(N_h x N_h). To avoid the cubic parameter explosion (N_h^3), the method uses low-rank decomposition: two smaller hypernetworks q_1 and q_2 each generate a low-rank factor (R^(N_h) to R^(N_h x N_K)), and the final matrix is W_c = W_c1 * W_c2^T where N_K << N_h (e.g., K=64 for base models, K=85 for large).
3
Condition-Aware Subspace Projection
The sentence embedding is projected into the condition-specific subspace: h_sc = W_c * h_s. This means the same sentence embedding is projected differently for each condition, enabling fine-grained, perspective-dependent representations. Crucially, W_c depends only on h_c, so transformation matrices can also be cached alongside condition embeddings.
4
Task-Specific Contrastive Training
For C-STS: a combined loss of MSE (matching predicted similarity to gold scores) and InfoNCE (contrastive loss pushing high-similarity condition pairs together and low-similarity pairs apart). For KGC: an adapted SimKGC objective with additive margin, learnable temperature, and self-negative, pre-batch, and in-batch negatives.

Why low-rank decomposition matters: A full-rank hypernetwork for a 768-dim encoder would require ~453M parameters per layer. Low-rank factorization with K=64 reduces this to ~98K parameters while sacrificing only 0.39 Spearman points on C-STS. The rank K is selected as the optimal trade-off between expressiveness and efficiency (K=64 for base, K=85 for large models).

Expressiveness advantage: Compared to simpler Hadamard product conditioning (equivalent to a diagonal matrix), Hyper-CL's full transformation matrices exhibit 24.8x higher variance in Frobenius norm across conditions, indicating far more diverse and expressive condition-specific projections.

Experimental Results

Hyper-CL is evaluated on two tasks: Conditional Semantic Textual Similarity (C-STS) using Spearman/Pearson correlation, and Knowledge Graph Completion (KGC) on WN18RR and FB15K-237 using MRR and Hits@K.

C-STS Results

MethodArchitectureSpearmanPearson
DiffCSE_baseTri-encoder28.927.8
SimCSE_baseTri-encoder31.531.0
SimCSE_largeTri-encoder35.335.6
SimCSE_base + Hyper-CLTri-encoder38.7538.38
SimCSE_large + Hyper-CLTri-encoder39.6039.96
SimCSE_baseBi-encoder44.844.9
SimCSE_largeBi-encoder47.547.6

KGC Results (WN18RR)

MethodMRRHits@1Hits@3Hits@10
SimKGC (bi-encoder)0.6660.5870.7170.800
SimKGC + Hadamard0.1640.0040.2430.481
SimKGC + Concatenation0.3350.2260.3820.550
SimKGC + Hyper-CL0.6160.5060.6900.810

KGC Results (FB15K-237)

MethodMRRHits@1Hits@3Hits@10
SimKGC (bi-encoder)0.3360.2490.3620.511
SimKGC + Hyper-CL0.3180.2310.3440.496

Efficiency Comparison

ArchitectureDatasetInference TimeCache Hit Rate
SimCSE_base (bi-encoder)C-STS791.71s1.46%
SimCSE_base + Hyper-CLC-STS541.55s64.11%
SimCSE_large (bi-encoder)C-STS1498.65s1.46%
SimCSE_large + Hyper-CLC-STS960.84s64.11%
SimKGC_base (bi-encoder)WN18RR994.41s46.65%
SimKGC_base + Hyper-CLWN18RR448.95s85.32%

Generalization to Unseen Conditions

MethodOverallUnseen ConditionsSeen Conditions
SimCSE_large (baseline)32.1313.9325.02
SimCSE_large + Hyper-CL38.5936.2541.14

Why It Matters

Hyper-CL demonstrates that sentence embeddings do not need to be static, one-size-fits-all vectors. By introducing hypernetwork-generated projections, the same pre-computed sentence embedding can be dynamically adapted to different perspectives at near-zero additional inference cost. The key insight is that condition-specific linear transformations, generated on-the-fly by a lightweight hypernetwork, provide far more expressive conditioning than simple element-wise operations -- while remaining fully compatible with the caching and pre-computation strategies that make tri-encoders practical.

This opens the door to more nuanced NLP applications -- from conditional retrieval to knowledge graph reasoning -- where the notion of "similarity" fundamentally depends on context. The strong generalization to unseen conditions (with a +22 point improvement) suggests that Hyper-CL learns transferable subspace projections rather than memorizing specific conditions, making it promising for real-world deployment where conditions are open-ended and unpredictable.

Links

Representation Learning