Hyper-CL: Conditioning Sentence Representations with Hypernetworks

One-Line Summary

A hypernetwork-based contrastive learning framework that dynamically projects sentence embeddings into condition-specific subspaces, bridging the performance-efficiency gap between bi-encoders and tri-encoders for conditional similarity tasks.

Background & Motivation

Sentence embeddings are central to many NLP applications, and contrastive learning methods like SimCSE have driven major advances in their quality. However, standard sentence representations capture a single, fixed view of a sentence's meaning. In practice, similarity between sentences often depends on a specific perspective or condition. For example, given sentences about a cyclist and a hiker, they should appear similar when considering "mode of transportation" but different regarding "speed of travel." This task is formalized as Conditional Semantic Textual Similarity (C-STS), and also arises naturally in Knowledge Graph Completion (KGC) where entity similarity depends on the relation.

Three main architectures exist for computing conditioned similarity, each with distinct trade-offs. The cross-encoder concatenates both sentences and the condition as a single input [s1; s2; c], achieving strong performance but requiring a forward pass for every unique triplet -- making it impractical for retrieval. The bi-encoder concatenates each sentence with the condition [si; c] separately, requiring |S| x |C| encoder passes. The tri-encoder encodes sentences and conditions independently (|S| + |C| passes) then combines them with a lightweight function, enabling efficient caching but sacrificing accuracy because it cannot model explicit sentence-condition interactions.

Core Problem: Tri-encoders are highly efficient thanks to independent encoding and caching, but their simple composition functions (e.g., Hadamard product, concatenation) cannot capture the rich sentence-condition interactions that bi-encoders achieve. Hyper-CL addresses this by using a hypernetwork to generate expressive, condition-specific projection matrices -- maintaining tri-encoder efficiency while substantially closing the performance gap with bi-encoders.

Proposed Method

Hyper-CL builds on the tri-encoder architecture by introducing a hypernetwork that generates condition-sensitive linear transformation matrices on-the-fly. This enables dynamic, expressive conditioning of sentence embeddings without sacrificing the caching benefits inherent to tri-encoders.

1

Independent Embedding Computation

A shared encoder f (e.g., SimCSE-RoBERTa) independently encodes sentences and conditions: h_s = f(s) for the sentence and h_c = f(c) for the condition. Since encoding is independent, all embeddings can be pre-computed and cached.

2

Hypernetwork-Generated Projection Matrix

A hypernetwork q takes the condition embedding h_c and generates a full linear transformation matrix: W_c = q(h_c), where q maps from R^(N_h) to R^(N_h x N_h). To avoid the cubic parameter explosion (N_h^3), the method uses low-rank decomposition: two smaller hypernetworks q_1 and q_2 each generate a low-rank factor (R^(N_h) to R^(N_h x N_K)), and the final matrix is W_c = W_c1 * W_c2^T where N_K << N_h (e.g., K=64 for base models, K=85 for large).

3

Condition-Aware Subspace Projection

The sentence embedding is projected into the condition-specific subspace: h_sc = W_c * h_s. This means the same sentence embedding is projected differently for each condition, enabling fine-grained, perspective-dependent representations. Crucially, W_c depends only on h_c, so transformation matrices can also be cached alongside condition embeddings.

4

Task-Specific Contrastive Training

For C-STS: a combined loss of MSE (matching predicted similarity to gold scores) and InfoNCE (contrastive loss pushing high-similarity condition pairs together and low-similarity pairs apart). For KGC: an adapted SimKGC objective with additive margin, learnable temperature, and self-negative, pre-batch, and in-batch negatives.

Why low-rank decomposition matters: A full-rank hypernetwork for a 768-dim encoder would require ~453M parameters per layer. Low-rank factorization with K=64 reduces this to ~98K parameters while sacrificing only 0.39 Spearman points on C-STS. The rank K is selected as the optimal trade-off between expressiveness and efficiency (K=64 for base, K=85 for large models).

Expressiveness advantage: Compared to simpler Hadamard product conditioning (equivalent to a diagonal matrix), Hyper-CL's full transformation matrices exhibit 24.8x higher variance in Frobenius norm across conditions, indicating far more diverse and expressive condition-specific projections.

Experimental Results

Hyper-CL is evaluated on two tasks: Conditional Semantic Textual Similarity (C-STS) using Spearman/Pearson correlation, and Knowledge Graph Completion (KGC) on WN18RR and FB15K-237 using MRR and Hits@K.

C-STS Results

Method	Architecture	Spearman	Pearson
DiffCSE_base	Tri-encoder	28.9	27.8
SimCSE_base	Tri-encoder	31.5	31.0
SimCSE_large	Tri-encoder	35.3	35.6
SimCSE_base + Hyper-CL	Tri-encoder	38.75	38.38
SimCSE_large + Hyper-CL	Tri-encoder	39.60	39.96
SimCSE_base	Bi-encoder	44.8	44.9
SimCSE_large	Bi-encoder	47.5	47.6

KGC Results (WN18RR)

Method	MRR	Hits@1	Hits@3	Hits@10
SimKGC (bi-encoder)	0.666	0.587	0.717	0.800
SimKGC + Hadamard	0.164	0.004	0.243	0.481
SimKGC + Concatenation	0.335	0.226	0.382	0.550
SimKGC + Hyper-CL	0.616	0.506	0.690	0.810

KGC Results (FB15K-237)

Method	MRR	Hits@1	Hits@3	Hits@10
SimKGC (bi-encoder)	0.336	0.249	0.362	0.511
SimKGC + Hyper-CL	0.318	0.231	0.344	0.496

Efficiency Comparison

Architecture	Dataset	Inference Time	Cache Hit Rate
SimCSE_base (bi-encoder)	C-STS	791.71s	1.46%
SimCSE_base + Hyper-CL	C-STS	541.55s	64.11%
SimCSE_large (bi-encoder)	C-STS	1498.65s	1.46%
SimCSE_large + Hyper-CL	C-STS	960.84s	64.11%
SimKGC_base (bi-encoder)	WN18RR	994.41s	46.65%
SimKGC_base + Hyper-CL	WN18RR	448.95s	85.32%

Generalization to Unseen Conditions

Method	Overall	Unseen Conditions	Seen Conditions
SimCSE_large (baseline)	32.13	13.93	25.02
SimCSE_large + Hyper-CL	38.59	36.25	41.14

C-STS: Hyper-CL improves tri-encoder performance by up to +7.25 Pearson, reducing the gap with bi-encoders from 13.3 to 6.05 points
KGC: On WN18RR, Hyper-CL achieves 0.616 MRR -- close to the bi-encoder's 0.666 -- while naive tri-encoder variants (Hadamard: 0.164, Concatenation: 0.335) drastically underperform. Hits@10 (0.810) even surpasses the bi-encoder (0.800)
Efficiency: ~40% faster on C-STS and ~55% faster on WN18RR compared to bi-encoders, thanks to 64% and 85% cache hit rates respectively
Generalization: On unseen conditions (25.79% of C-STS validation), Hyper-CL achieves 36.25 Spearman vs. the baseline's 13.93 -- a +22 point improvement, demonstrating strong zero-shot condition transfer
Clustering quality: Condition-specific projection reduces embedding cluster impurity (entropy) from 0.739 to 0.270, confirming that Hyper-CL effectively organizes embeddings by condition
Ablation: Contrastive learning combined with hypernetworks (37.96 Spearman) outperforms contrastive learning alone (36.13) and hypernetworks alone (35.38), showing synergistic benefits

Why It Matters

Hyper-CL demonstrates that sentence embeddings do not need to be static, one-size-fits-all vectors. By introducing hypernetwork-generated projections, the same pre-computed sentence embedding can be dynamically adapted to different perspectives at near-zero additional inference cost. The key insight is that condition-specific linear transformations, generated on-the-fly by a lightweight hypernetwork, provide far more expressive conditioning than simple element-wise operations -- while remaining fully compatible with the caching and pre-computation strategies that make tri-encoders practical.

This opens the door to more nuanced NLP applications -- from conditional retrieval to knowledge graph reasoning -- where the notion of "similarity" fundamentally depends on context. The strong generalization to unseen conditions (with a +22 point improvement) suggests that Hyper-CL learns transferable subspace projections rather than memorizing specific conditions, making it promising for real-world deployment where conditions are open-ended and unpredictable.

Links

ACL Anthology arXiv