EN KO
← All Publications

Enhancing Sentence Representations with Common Error Analysis and Ensemble Techniques in Contrastive Learning

Korea Software Congress 2023 (KSC 2023)
Jii Cha, Taeuk Kim

One-Line Summary

A systematic analysis of common failure patterns across contrastive learning-based sentence embedding models (SimCSE, DiffCSE, etc.) coupled with error-informed ensemble techniques that exploit complementary error profiles to achieve improved sentence representations on STS benchmarks.

Background & Motivation

Contrastive learning has become the dominant paradigm for learning sentence embeddings, with methods like SimCSE and DiffCSE achieving strong performance on semantic textual similarity (STS) tasks. However, even state-of-the-art models exhibit systematic errors on specific types of sentence pairs, limiting their reliability in downstream applications.

Key Observations Motivating This Work:

  • Persistent error patterns: Individual contrastive learning models consistently fail on certain sentence pair types -- such as those involving negation, numerical reasoning, or fine-grained semantic distinctions -- despite strong overall STS scores.
  • Complementary weaknesses: Different model variants (SimCSE, DiffCSE, and others) tend to fail on different subsets of examples, suggesting their errors are complementary rather than identical.
  • Untapped diagnostic potential: Prior work focused on improving individual model architectures or training objectives, but lacked a systematic framework for understanding where and why these models fail.
  • Ensemble opportunity: If models fail on different examples, combining them strategically should compensate for individual weaknesses -- but naive ensemble methods do not exploit error structure.

This work addresses the gap by first building a comprehensive error taxonomy for contrastive sentence embeddings, then using those insights to design informed ensemble strategies that outperform both individual models and naive combination approaches.

Contrastive Learning for Sentence Embeddings

Contrastive learning trains models to produce similar representations for semantically equivalent sentences and dissimilar representations for unrelated ones. The key distinction among methods lies in how they construct positive and negative pairs:

ModelPositive Pair StrategyTraining Signal
SimCSE (unsup.)Dropout-augmented copies of the same sentenceIn-batch negatives
SimCSE (sup.)NLI entailment pairsNLI contradiction as hard negatives
DiffCSEDifference-aware augmentation via conditional MLMEquivariant contrastive objective

Despite their differing strategies, all these models achieve similarly strong aggregate STS scores -- yet they fail on different sentence pair types, creating the complementary error landscape that this work exploits.

Proposed Method

The approach consists of two complementary stages: a diagnostic analysis phase that categorizes model errors, followed by an ensemble phase that leverages the diagnostic findings to combine models effectively.

1
Model Evaluation & Error Collection
Multiple contrastive learning models (e.g., SimCSE, DiffCSE, and other variants) are evaluated on standard STS benchmarks (STS-B, SICK-R, etc.). For each model, sentence pairs where the predicted similarity score deviates significantly from the gold score are collected as error instances. A deviation threshold is applied to distinguish genuine errors from minor scoring noise, ensuring that only meaningful failures are included in the analysis.
2
Common Error Analysis & Taxonomy
The collected error instances are systematically categorized into an error taxonomy. Key categories include: (a) negation handling -- failure to recognize that negation reverses sentence meaning, (b) lexical overlap bias -- over-reliance on surface word overlap while ignoring semantic differences, (c) length sensitivity -- performance degradation for sentence pairs with large length discrepancies, and (d) numerical reasoning -- inability to distinguish sentences differing only in numbers or quantities. Errors are further classified as shared (common across models) vs. model-specific.
3
Error-Informed Ensemble Design
Based on the error analysis, ensemble strategies are designed to maximize complementarity. Multiple techniques are compared: (a) simple averaging of embedding similarity scores, (b) weighted combination where weights reflect each model's reliability on specific error categories, and (c) selective ensemble that dynamically chooses which models to include based on input characteristics. The analysis-guided approach focuses on pairing models with complementary error profiles -- for instance, if Model A struggles with length sensitivity but handles negation well, and Model B shows the opposite pattern, they form an ideal ensemble pair.
4
Validation & Comparison
The error-informed ensemble is evaluated against individual models and naive (non-informed) ensemble baselines on STS benchmarks. Per-category error analysis is repeated on ensemble outputs to verify that targeted error types are effectively mitigated. This closed-loop validation confirms that the diagnostic insights translate into actual performance improvements, not just theoretical complementarity.

Why Error-Informed Ensemble Differs from Naive Ensemble:

  • Naive averaging treats all models equally regardless of their error profiles, diluting the contribution of models that excel on particular error types.
  • Uniform weighting assigns fixed weights based on overall performance, ignoring that a model might be the best choice for negation errors but the worst for length sensitivity.
  • Error-informed selection uses the diagnostic taxonomy to strategically weight or select models based on which error categories are most relevant for a given input, achieving targeted error reduction.

Experimental Results

Experiments are conducted on standard STS benchmarks, comparing individual contrastive learning models, naive ensemble baselines, and the proposed error-informed ensemble approach.

Error Distribution Across Models

Error CategoryShared Across ModelsModel-SpecificEnsemble Reducibility
Negation handlingHighLowLimited (fundamental limitation)
Lexical overlap biasMediumMediumModerate
Length sensitivityLowHighHigh (strong complementarity)
Numerical reasoningMediumMediumModerate

The error distribution reveals a critical insight: error categories with high model-specificity (such as length sensitivity) are precisely those where ensemble approaches provide the greatest benefit, since different models can compensate for each other's weaknesses. In contrast, shared errors like negation handling represent a fundamental limitation of current contrastive learning paradigms that ensemble cannot address.

Ensemble Performance Comparison

ApproachSTS PerformanceError Reduction
Individual models (best single)Baseline--
Simple averaging ensembleImproved over baselineModerate (uniform across categories)
Weighted ensemble (uniform)Improved over simple averagingModerate (overall-weighted)
Error-informed ensembleBest overall performanceTargeted (category-specific)

Why It Matters

This work makes contributions along two complementary dimensions -- diagnostic understanding and practical improvement of contrastive sentence embeddings:

Practical Takeaway: For practitioners using contrastive sentence embeddings, the error-informed ensemble can be deployed as a lightweight inference-time enhancement without any model retraining. The diagnostic taxonomy also serves as a guide for selecting which models to combine: prioritize models with complementary error profiles (e.g., one strong on length-varied pairs, another on lexically overlapping pairs) rather than simply choosing the top-N highest-scoring models.

Representation Learning