A hypernetwork-based contrastive learning framework that dynamically projects sentence embeddings into condition-specific subspaces, bridging the performance-efficiency gap between bi-encoders and tri-encoders for conditional similarity tasks.
Sentence embeddings are central to many NLP applications, and contrastive learning methods like SimCSE have driven major advances in their quality. However, standard sentence representations capture a single, fixed view of a sentence's meaning. In practice, similarity between sentences often depends on a specific perspective or condition. For example, given sentences about a cyclist and a hiker, they should appear similar when considering "mode of transportation" but different regarding "speed of travel." This task is formalized as Conditional Semantic Textual Similarity (C-STS), and also arises naturally in Knowledge Graph Completion (KGC) where entity similarity depends on the relation.
Three main architectures exist for computing conditioned similarity, each with distinct trade-offs. The cross-encoder concatenates both sentences and the condition as a single input [s1; s2; c], achieving strong performance but requiring a forward pass for every unique triplet -- making it impractical for retrieval. The bi-encoder concatenates each sentence with the condition [si; c] separately, requiring |S| x |C| encoder passes. The tri-encoder encodes sentences and conditions independently (|S| + |C| passes) then combines them with a lightweight function, enabling efficient caching but sacrificing accuracy because it cannot model explicit sentence-condition interactions.
Core Problem: Tri-encoders are highly efficient thanks to independent encoding and caching, but their simple composition functions (e.g., Hadamard product, concatenation) cannot capture the rich sentence-condition interactions that bi-encoders achieve. Hyper-CL addresses this by using a hypernetwork to generate expressive, condition-specific projection matrices -- maintaining tri-encoder efficiency while substantially closing the performance gap with bi-encoders.
Hyper-CL builds on the tri-encoder architecture by introducing a hypernetwork that generates condition-sensitive linear transformation matrices on-the-fly. This enables dynamic, expressive conditioning of sentence embeddings without sacrificing the caching benefits inherent to tri-encoders.
Why low-rank decomposition matters: A full-rank hypernetwork for a 768-dim encoder would require ~453M parameters per layer. Low-rank factorization with K=64 reduces this to ~98K parameters while sacrificing only 0.39 Spearman points on C-STS. The rank K is selected as the optimal trade-off between expressiveness and efficiency (K=64 for base, K=85 for large models).
Expressiveness advantage: Compared to simpler Hadamard product conditioning (equivalent to a diagonal matrix), Hyper-CL's full transformation matrices exhibit 24.8x higher variance in Frobenius norm across conditions, indicating far more diverse and expressive condition-specific projections.
Hyper-CL is evaluated on two tasks: Conditional Semantic Textual Similarity (C-STS) using Spearman/Pearson correlation, and Knowledge Graph Completion (KGC) on WN18RR and FB15K-237 using MRR and Hits@K.
| Method | Architecture | Spearman | Pearson |
|---|---|---|---|
| DiffCSE_base | Tri-encoder | 28.9 | 27.8 |
| SimCSE_base | Tri-encoder | 31.5 | 31.0 |
| SimCSE_large | Tri-encoder | 35.3 | 35.6 |
| SimCSE_base + Hyper-CL | Tri-encoder | 38.75 | 38.38 |
| SimCSE_large + Hyper-CL | Tri-encoder | 39.60 | 39.96 |
| SimCSE_base | Bi-encoder | 44.8 | 44.9 |
| SimCSE_large | Bi-encoder | 47.5 | 47.6 |
| Method | MRR | Hits@1 | Hits@3 | Hits@10 |
|---|---|---|---|---|
| SimKGC (bi-encoder) | 0.666 | 0.587 | 0.717 | 0.800 |
| SimKGC + Hadamard | 0.164 | 0.004 | 0.243 | 0.481 |
| SimKGC + Concatenation | 0.335 | 0.226 | 0.382 | 0.550 |
| SimKGC + Hyper-CL | 0.616 | 0.506 | 0.690 | 0.810 |
| Method | MRR | Hits@1 | Hits@3 | Hits@10 |
|---|---|---|---|---|
| SimKGC (bi-encoder) | 0.336 | 0.249 | 0.362 | 0.511 |
| SimKGC + Hyper-CL | 0.318 | 0.231 | 0.344 | 0.496 |
| Architecture | Dataset | Inference Time | Cache Hit Rate |
|---|---|---|---|
| SimCSE_base (bi-encoder) | C-STS | 791.71s | 1.46% |
| SimCSE_base + Hyper-CL | C-STS | 541.55s | 64.11% |
| SimCSE_large (bi-encoder) | C-STS | 1498.65s | 1.46% |
| SimCSE_large + Hyper-CL | C-STS | 960.84s | 64.11% |
| SimKGC_base (bi-encoder) | WN18RR | 994.41s | 46.65% |
| SimKGC_base + Hyper-CL | WN18RR | 448.95s | 85.32% |
| Method | Overall | Unseen Conditions | Seen Conditions |
|---|---|---|---|
| SimCSE_large (baseline) | 32.13 | 13.93 | 25.02 |
| SimCSE_large + Hyper-CL | 38.59 | 36.25 | 41.14 |
Hyper-CL demonstrates that sentence embeddings do not need to be static, one-size-fits-all vectors. By introducing hypernetwork-generated projections, the same pre-computed sentence embedding can be dynamically adapted to different perspectives at near-zero additional inference cost. The key insight is that condition-specific linear transformations, generated on-the-fly by a lightweight hypernetwork, provide far more expressive conditioning than simple element-wise operations -- while remaining fully compatible with the caching and pre-computation strategies that make tri-encoders practical.
This opens the door to more nuanced NLP applications -- from conditional retrieval to knowledge graph reasoning -- where the notion of "similarity" fundamentally depends on context. The strong generalization to unseen conditions (with a +22 point improvement) suggests that Hyper-CL learns transferable subspace projections rather than memorizing specific conditions, making it promising for real-world deployment where conditions are open-ended and unpredictable.