EN KO
← All Publications

Comparison and Analysis of Unsupervised Contrastive Learning Approaches for Korean Sentence Representations

The 34th Annual Conference on Human and Cognitive Language Technology (HCLT 2022)
Young Hyun Yoo, Kyumin Lee, Minjin Jeon, Jii Cha, Kangsan Kim, Taeuk Kim

One-Line Summary

A systematic comparison of unsupervised contrastive learning methods (ConSERT and SimCSE) across three Korean pre-trained language models (KoBERT, KR-BERT, KLUE-BERT), revealing that KLUE-BERT provides the most stable backbone and token shuffling is the most effective data augmentation for Korean sentence embeddings.

Background & Motivation

Sentence embeddings are critical for many NLP applications, from semantic search to clustering. Recent unsupervised contrastive learning methods like SimCSE and ConSERT have dramatically improved English sentence representations by learning to pull semantically similar sentences closer in the embedding space while pushing dissimilar ones apart -- all without labeled data.

What is Contrastive Learning for Sentences?

Contrastive learning aims to learn an embedding space where semantically similar sentences are placed close together and dissimilar sentences are pushed apart. Given a sentence x, the framework creates a positive pair (x, x+) -- either through data augmentation (ConSERT) or dropout noise (SimCSE) -- and treats all other sentences in the mini-batch as negatives. The contrastive loss function encourages the model to maximize agreement between positive pairs while minimizing agreement with negatives, resulting in sentence embeddings that capture semantic similarity without any labeled data.

Key Challenges for Korean:

  • Lack of Korean-specific evaluation: Most contrastive learning research focused on English; it was unclear how well these methods transfer to Korean with its distinct linguistic properties such as agglutinative morphology and flexible word order.
  • Multiple Korean PLMs with different designs: Several Korean pre-trained models existed -- KoBERT (SentencePiece tokenization), KR-BERT (character-level tokenization), and KLUE-BERT (morpheme-aware WordPiece) -- each with fundamentally different tokenization strategies. No study had compared them as backbones for contrastive sentence learning.
  • Data augmentation uncertainty: ConSERT's augmentation strategies (token shuffling, feature cutoff, token cutoff, dropout) had not been evaluated for Korean, where agglutinative morphology may affect their effectiveness differently than in English.
  • Potential overfitting concerns: With limited Korean STS evaluation data (the KorSTS benchmark being the primary resource), there was a risk of methods overfitting to validation sets without genuine generalization.

This study addresses these gaps by providing the first comprehensive comparison of unsupervised contrastive learning approaches specifically designed for Korean sentence representations, covering 3 backbone models, 2 learning frameworks, and 4 augmentation strategies.

Proposed Method

The study systematically evaluates two prominent unsupervised contrastive learning frameworks -- ConSERT and SimCSE -- across three Korean pre-trained language models, using Korean STS benchmarks for evaluation.

1
Backbone Model Selection
Three Korean pre-trained language models are used as backbones, each with distinct design choices:
  • KoBERT (SKTBrain): Uses SentencePiece tokenization, trained on Korean Wikipedia and news data.
  • KR-BERT: Employs character-level tokenization that can handle any Korean character sequence, including spacing variations and typos.
  • KLUE-BERT: Uses morpheme-aware WordPiece tokenization, trained on the large-scale KLUE corpus (news, reviews, encyclopedias, etc.), designed for the KLUE benchmark suite.
These differences in tokenization granularity and pre-training data diversity allow systematic analysis of how backbone design affects downstream contrastive learning.
2
Contrastive Learning Frameworks
ConSERT (Yan et al., 2021) generates positive pairs through explicit data augmentation at the input or representation level. Given a sentence, two different augmented views are created, and the model is trained to maximize agreement between these views using a contrastive loss (NT-Xent). SimCSE (Gao et al., 2021) takes a simpler approach: the same input is passed through the encoder twice with different dropout masks, treating the two resulting representations as a positive pair. Both methods use in-batch negatives for the contrastive objective.
3
Data Augmentation Strategies for ConSERT
Four augmentation strategies are systematically compared, each modifying a different aspect of the input or representation:
  • Token Shuffling: Randomly permutes the order of input tokens while preserving the token set -- tests whether the model can learn order-invariant semantic features.
  • Feature Cutoff: Zeros out random dimensions of the token embedding vectors -- forces the model to distribute semantic information across multiple dimensions.
  • Token Cutoff: Randomly removes a subset of input tokens -- tests robustness to missing information.
  • Dropout: Applies standard dropout to token representations as a form of augmentation -- the simplest perturbation strategy.
All four strategies are evaluated across all three backbone models, yielding 12 ConSERT configurations plus 3 SimCSE configurations (one per backbone).
4
Korean STS Evaluation Protocol
Models are evaluated on the Korean Semantic Textual Similarity (KorSTS) benchmark using Spearman's rank correlation coefficient. Crucially, both validation (dev) and test set results are reported side by side. This dual reporting is designed to expose potential overfitting: a model that achieves high dev scores but significantly lower test scores may be overfitting to the validation distribution rather than learning genuinely transferable sentence representations.

Experimental Results

Experiments are conducted with ConSERT (4 augmentation strategies) and SimCSE across three Korean PLMs, totaling 15 experimental configurations. Performance is measured via Spearman's correlation (x100) on the KorSTS benchmark.

ConSERT: Full Augmentation Strategy Comparison

BackboneAugmentationDev (Spearman)Test (Spearman)Gap (Dev-Test)
KoBERTShuffle78.6668.829.84
KoBERTFeature Cutoff73.7865.258.53
KoBERTToken Cutoff72.4164.188.23
KoBERTDropout71.9563.508.45
KR-BERTShuffle78.3772.026.35
KR-BERTFeature Cutoff77.2568.528.73
KR-BERTToken Cutoff75.8067.917.89
KR-BERTDropout74.6267.037.59
KLUE-BERTShuffle80.0573.156.90
KLUE-BERTFeature Cutoff78.9272.856.07
KLUE-BERTToken Cutoff77.6071.426.18
KLUE-BERTDropout76.8870.955.93

SimCSE Results by Backbone

BackboneDev (Spearman)Test (Spearman)Gap (Dev-Test)
KoBERT75.3366.049.29
KR-BERT77.4970.816.68
KLUE-BERT79.6273.086.54

Best Configuration per Backbone (Test Set)

BackboneBest MethodTest (Spearman)
KoBERTConSERT + Shuffle68.82
KR-BERTConSERT + Shuffle72.02
KLUE-BERTConSERT + Shuffle73.15

Augmentation Strategy Ranking (Averaged Across Backbones):

Token Shuffling > Feature Cutoff > Token Cutoff > Dropout. Token shuffling's strong performance in Korean may be explained by the relatively flexible word order of Korean syntax -- shuffling tokens produces augmented views that remain plausible to some degree, providing a useful training signal without destroying semantics.

Why It Matters

This study provides essential guidance for building Korean sentence embedding systems using contrastive learning:

Links

Representation Learning Multilingual