One-Line Summary
A systematic comparison of unsupervised contrastive learning methods (ConSERT and SimCSE) across three Korean pre-trained language models (KoBERT, KR-BERT, KLUE-BERT), revealing that KLUE-BERT provides the most stable backbone and token shuffling is the most effective data augmentation for Korean sentence embeddings.
Background & Motivation
Sentence embeddings are critical for many NLP applications, from semantic search to clustering. Recent unsupervised contrastive learning methods like SimCSE and ConSERT have dramatically improved English sentence representations by learning to pull semantically similar sentences closer in the embedding space while pushing dissimilar ones apart -- all without labeled data.
What is Contrastive Learning for Sentences?
Contrastive learning aims to learn an embedding space where semantically similar sentences are placed close together and dissimilar sentences are pushed apart. Given a sentence x, the framework creates a positive pair (x, x+) -- either through data augmentation (ConSERT) or dropout noise (SimCSE) -- and treats all other sentences in the mini-batch as negatives. The contrastive loss function encourages the model to maximize agreement between positive pairs while minimizing agreement with negatives, resulting in sentence embeddings that capture semantic similarity without any labeled data.
Key Challenges for Korean:
- Lack of Korean-specific evaluation: Most contrastive learning research focused on English; it was unclear how well these methods transfer to Korean with its distinct linguistic properties such as agglutinative morphology and flexible word order.
- Multiple Korean PLMs with different designs: Several Korean pre-trained models existed -- KoBERT (SentencePiece tokenization), KR-BERT (character-level tokenization), and KLUE-BERT (morpheme-aware WordPiece) -- each with fundamentally different tokenization strategies. No study had compared them as backbones for contrastive sentence learning.
- Data augmentation uncertainty: ConSERT's augmentation strategies (token shuffling, feature cutoff, token cutoff, dropout) had not been evaluated for Korean, where agglutinative morphology may affect their effectiveness differently than in English.
- Potential overfitting concerns: With limited Korean STS evaluation data (the KorSTS benchmark being the primary resource), there was a risk of methods overfitting to validation sets without genuine generalization.
This study addresses these gaps by providing the first comprehensive comparison of unsupervised contrastive learning approaches specifically designed for Korean sentence representations, covering 3 backbone models, 2 learning frameworks, and 4 augmentation strategies.
Proposed Method
The study systematically evaluates two prominent unsupervised contrastive learning frameworks -- ConSERT and SimCSE -- across three Korean pre-trained language models, using Korean STS benchmarks for evaluation.
1
Backbone Model Selection
Three Korean pre-trained language models are used as backbones, each with distinct design choices:
- KoBERT (SKTBrain): Uses SentencePiece tokenization, trained on Korean Wikipedia and news data.
- KR-BERT: Employs character-level tokenization that can handle any Korean character sequence, including spacing variations and typos.
- KLUE-BERT: Uses morpheme-aware WordPiece tokenization, trained on the large-scale KLUE corpus (news, reviews, encyclopedias, etc.), designed for the KLUE benchmark suite.
These differences in tokenization granularity and pre-training data diversity allow systematic analysis of how backbone design affects downstream contrastive learning.
2
Contrastive Learning Frameworks
ConSERT (Yan et al., 2021) generates positive pairs through explicit data augmentation at the input or representation level. Given a sentence, two different augmented views are created, and the model is trained to maximize agreement between these views using a contrastive loss (NT-Xent). SimCSE (Gao et al., 2021) takes a simpler approach: the same input is passed through the encoder twice with different dropout masks, treating the two resulting representations as a positive pair. Both methods use in-batch negatives for the contrastive objective.
3
Data Augmentation Strategies for ConSERT
Four augmentation strategies are systematically compared, each modifying a different aspect of the input or representation:
- Token Shuffling: Randomly permutes the order of input tokens while preserving the token set -- tests whether the model can learn order-invariant semantic features.
- Feature Cutoff: Zeros out random dimensions of the token embedding vectors -- forces the model to distribute semantic information across multiple dimensions.
- Token Cutoff: Randomly removes a subset of input tokens -- tests robustness to missing information.
- Dropout: Applies standard dropout to token representations as a form of augmentation -- the simplest perturbation strategy.
All four strategies are evaluated across all three backbone models, yielding 12 ConSERT configurations plus 3 SimCSE configurations (one per backbone).
4
Korean STS Evaluation Protocol
Models are evaluated on the Korean Semantic Textual Similarity (KorSTS) benchmark using Spearman's rank correlation coefficient. Crucially, both validation (dev) and test set results are reported side by side. This dual reporting is designed to expose potential overfitting: a model that achieves high dev scores but significantly lower test scores may be overfitting to the validation distribution rather than learning genuinely transferable sentence representations.
Experimental Results
Experiments are conducted with ConSERT (4 augmentation strategies) and SimCSE across three Korean PLMs, totaling 15 experimental configurations. Performance is measured via Spearman's correlation (x100) on the KorSTS benchmark.
ConSERT: Full Augmentation Strategy Comparison
| Backbone | Augmentation | Dev (Spearman) | Test (Spearman) | Gap (Dev-Test) |
| KoBERT | Shuffle | 78.66 | 68.82 | 9.84 |
| KoBERT | Feature Cutoff | 73.78 | 65.25 | 8.53 |
| KoBERT | Token Cutoff | 72.41 | 64.18 | 8.23 |
| KoBERT | Dropout | 71.95 | 63.50 | 8.45 |
| KR-BERT | Shuffle | 78.37 | 72.02 | 6.35 |
| KR-BERT | Feature Cutoff | 77.25 | 68.52 | 8.73 |
| KR-BERT | Token Cutoff | 75.80 | 67.91 | 7.89 |
| KR-BERT | Dropout | 74.62 | 67.03 | 7.59 |
| KLUE-BERT | Shuffle | 80.05 | 73.15 | 6.90 |
| KLUE-BERT | Feature Cutoff | 78.92 | 72.85 | 6.07 |
| KLUE-BERT | Token Cutoff | 77.60 | 71.42 | 6.18 |
| KLUE-BERT | Dropout | 76.88 | 70.95 | 5.93 |
SimCSE Results by Backbone
| Backbone | Dev (Spearman) | Test (Spearman) | Gap (Dev-Test) |
| KoBERT | 75.33 | 66.04 | 9.29 |
| KR-BERT | 77.49 | 70.81 | 6.68 |
| KLUE-BERT | 79.62 | 73.08 | 6.54 |
Best Configuration per Backbone (Test Set)
| Backbone | Best Method | Test (Spearman) |
| KoBERT | ConSERT + Shuffle | 68.82 |
| KR-BERT | ConSERT + Shuffle | 72.02 |
| KLUE-BERT | ConSERT + Shuffle | 73.15 |
Augmentation Strategy Ranking (Averaged Across Backbones):
Token Shuffling > Feature Cutoff > Token Cutoff > Dropout. Token shuffling's strong performance in Korean may be explained by the relatively flexible word order of Korean syntax -- shuffling tokens produces augmented views that remain plausible to some degree, providing a useful training signal without destroying semantics.
- KLUE-BERT is the most stable backbone: Across both ConSERT and SimCSE, KLUE-BERT consistently achieved the highest and most stable performance, likely due to its larger and more diverse pre-training corpus (news, reviews, encyclopedias) and morpheme-aware tokenization.
- Token shuffling is the best augmentation: Among ConSERT's augmentation strategies, token shuffling consistently outperformed all alternatives across all backbone models. This may relate to Korean's flexible SOV word order, making shuffled sequences less destructive to semantics than in rigid-order languages like English.
- Dev-test gap reveals overfitting: A notable gap between development and test scores (5-10 points) was observed across methods. KLUE-BERT exhibited the smallest gaps (5.9-6.9 points), while KoBERT showed the largest (8.2-9.8 points), suggesting that stronger backbones also generalize better.
- SimCSE is competitive despite simplicity: SimCSE's dropout-only approach (no explicit augmentation) matched or approached ConSERT's best configurations. On KLUE-BERT, SimCSE (73.08) nearly matched ConSERT + Shuffle (73.15), a difference of only 0.07 points.
- Backbone matters more than method: The choice of pre-trained backbone model had a larger impact on performance than the choice between ConSERT and SimCSE. KLUE-BERT's worst configuration (Dropout, 70.95) still outperformed KoBERT's best (Shuffle, 68.82) by over 2 points.
- KR-BERT's character-level tokenization helps: KR-BERT consistently outperformed KoBERT despite being less well-known, suggesting that character-level tokenization may better handle Korean's rich morphological variations for contrastive learning.
Why It Matters
This study provides essential guidance for building Korean sentence embedding systems using contrastive learning:
- First systematic Korean CL benchmark: By comparing ConSERT and SimCSE across multiple Korean PLMs under unified conditions (same training data, same evaluation protocol), this work establishes the first reliable reference point for unsupervised Korean sentence representation learning.
- Practical model selection guidance: The finding that KLUE-BERT with token shuffling yields the best results (73.15 Spearman on KorSTS test) gives practitioners a clear starting point for Korean sentence embedding tasks. The near-equal performance of SimCSE suggests it as a simpler alternative when augmentation complexity is undesirable.
- Overfitting warning for Korean NLP: The significant dev-test performance gap (up to 9.8 points) highlights a critical issue in Korean NLP evaluation infrastructure, urging the community to develop larger and more diverse STS benchmarks for Korean beyond KorSTS.
- Tokenization insights: The results demonstrate that tokenization strategy significantly impacts contrastive learning quality for Korean. Morpheme-aware approaches (KLUE-BERT) and character-level approaches (KR-BERT) both outperform subword-based tokenization (KoBERT), informing future Korean PLM design.
- Foundation for future work: The systematic framework established here can be extended to evaluate newer contrastive methods (e.g., DiffCSE, PromptBERT) and larger Korean language models as they become available.
Representation Learning
Multilingual