EN KO
← All Publications

Don't Just Scratch the Surface: Enhancing Word Representations for Korean with Hanja

EMNLP-IJCNLP 2019
Taeuk Kim, Kang Min Yoo, Sang-goo Lee

One-Line Summary

Hanja-level SISG enriches Korean word embeddings by incorporating Hanja (Chinese character) n-grams into the Skip-gram scoring function, optionally initializing them with pre-trained Chinese embeddings for cross-lingual transfer, and demonstrates gains on word analogy, word similarity, news headline generation, and sentiment analysis.

Paper overview
Figure 1. A Korean word showing its form and multi-level meanings. The Sino-Korean word "사회맞춤형" consists of Hangul phonograms (KR) and Hanja logograms (HJ). Although Hanja annotation is optional, it offers deeper insight into the word meaning due to its association with Chinese characters (CN).

Background & Motivation

Korean and Chinese share a deep historical and cultural connection. A set of logograms very similar to Chinese characters, called Hanja, served as the sole medium for written Korean until Hangul was created in 1443. As a result, a substantial portion of Korean words are Sino-Korean (한자어) -- words of Chinese origin that can be written in both Hanja and Hangul, with the latter now commonplace in modern Korean.

Phonograms vs. Logograms: Korean Hangul characters are phonograms -- they encode pronunciation but not meaning. In contrast, Hanja characters are logograms, where each character carries its own lexical meaning. For example, the word "사회맞춤형" contains the Hanja sequence 社會 (society) and 型 (type/style), while the Hangul syllables "맞춤" (customized) have no Hanja equivalent. This semantic richness of Hanja is invisible to standard word embeddings that only operate on Hangul surface forms.

Limitation of Existing Subword Methods: Prior approaches for Korean word representations -- character-level (syllable) n-grams (Bojanowski et al., 2017) and jamo-level n-grams (Park et al., 2018) -- capture orthographic patterns at the surface level but miss the deeper semantic structure encoded in Hanja origins. The agglutinative nature of Korean further limits the effectiveness of these surface-only methods, as sub-character and inter-character information specific to Korean understanding remains uncaptured.

Key Insight -- Cross-Lingual Transfer via Hanja: Since Hanja characters share deep roots with Chinese characters (with many having one-to-one correspondence), mapping Korean words to their Hanja equivalents creates a bridge to Chinese. This enables character-level cross-lingual knowledge transfer: pre-trained Chinese character embeddings (Li et al., 2018) can be used to initialize Hanja n-gram vectors, injecting semantic knowledge across languages without any parallel corpus. This is the first work to introduce character-level cross-lingual transfer learning based on etymological grounds.

The paper's core hypothesis is simple yet powerful: native Koreans intuitively use Hanja to resolve the ambiguity of Sino-Korean words, because each Hanja logogram contains more lexical meaning than its Hangul phonogram counterpart. Can we replicate this human heuristic in word embeddings?

Proposed Method

The proposed model, Hanja-level SISG (Hanja-level Subword Information Skip-Gram), extends the Skip-gram framework by progressively incorporating three levels of subword information into the scoring function. The architecture builds on existing work -- SG (Mikolov et al., 2013), SISG (Bojanowski et al., 2017), and Jamo-level SISG (Park et al., 2018) -- adding a new Hanja n-gram level on top.

1
Automatic Hanja Annotation
The entire Korean corpus is automatically annotated with Hanja using the Hanjaro tagger, the state-of-the-art public Hanja tagger. Each Sino-Korean word is decomposed into its constituent Hanja sequences. For example, in "사회맞춤형", the tagger identifies two Hanja sequences: 社會 (from "사회") and 型 (from "형"), while "맞춤" (pure Korean) has no Hanja mapping.
2
Multi-Level Scoring Function
The Skip-gram scoring function is extended hierarchically. Level 1 (SISG): character-level (syllable) n-grams from 1 to 6 are added to the base word vector. Level 2 (Jamo-level SISG): jamo-level n-grams from 3 to 5 are added, capturing sub-character structure (e.g., ㄱ, ㅏ, ㄴ). Level 3 (Hanja-level SISG): Hanja n-grams extracted from the annotated Hanja sequences are added. Each Hanja sequence is bounded by special begin/end tokens, and n-grams of configurable length (1-3 or 1-4) are extracted. The final scoring function sums dot products of all n-gram vectors with the context word vector.
3
Cross-Lingual Initialization (Optional)
Since Hanja and simplified Chinese characters can be mapped by one-to-one correspondence, the Hanja n-gram vectors can be initialized with state-of-the-art pre-trained Chinese character embeddings (Li et al., 2018) before Skip-gram training begins. This transfers semantic knowledge from large Chinese corpora into the Korean embedding space. The variant without Chinese initialization (randomly initialized Hanja) is denoted SISG(cjhr), while the full model is SISG(cjh).

Experimental Results

The method is evaluated on intrinsic tasks (word analogy and word similarity) and two downstream tasks (Korean news headline generation and sentiment analysis). The training corpus is based on the dataset from Park et al. (2018) with additional data cleansing (removing non-Korean sentences, unifying number tags).

Word Analogy Test

Using the Korean word analogy dataset (10,000 quadruples with semantic and syntactic categories from Park et al., 2018), the metric is cosine distance (lower is better) between predicted and target analogy vectors.

ModelSemanticSyntacticAll (Avg.)
SG0.420.490.45
SISG(c) -- syllable n-grams0.450.590.52
SISG(cj) -- + jamo n-grams0.390.480.44
SISG(cjh3) -- + Hanja (1-3)0.340.450.39
SISG(cjh4) -- + Hanja (1-4)0.340.450.40
SISG(cjhr) -- random init0.350.460.40

Word Similarity Test (Korean WS353)

Evaluates correlation between word vector distances and human-annotated similarity scores. Higher Pearson/Spearman correlations are better.

ModelPearsonSpearman
SG0.600.62
SISG(c) -- syllable n-grams0.620.61
SISG(cj) -- + jamo n-grams0.660.67
SISG(cjh3) -- + Hanja (1-3)0.630.63
SISG(cjh4) -- + Hanja (1-4)0.620.61
SISG(cjhr) -- random init0.650.64

Korean News Headline Generation

A novel downstream task using 840,205 Korean news articles (published Jan--Feb 2017, covering balanced categories such as politics, sports, world). An encoder-decoder model (bidirectional LSTM encoder + LSTM decoder, hidden size 512, with Bahdanau attention) generates a headline from the first three sentences of the article body. Word embeddings initialize the encoder. Split is 8:1:1 for train/validation/test.

EmbeddingsBLEU-1BLEU-2BLEU-3BLEU-4PPL
None (random)26.027.763.081.385.335
SG30.3310.204.291.984.122
SISG(c)31.3410.964.692.193.942
SISG(cj)31.7811.174.802.253.938
SISG(cjh3)32.0311.254.832.273.941
SISG(cjh4)32.0211.344.922.303.909

Sentiment Analysis (NSMC)

Evaluated on the Naver Sentiment Movie Corpus (200K movie reviews, positive/negative labels; split 100K/50K/50K). A basic LSTM encoder (hidden size 300) with feed-forward + softmax classifier is used to isolate the effect of input embeddings. Results are averaged over 3 runs with different random seeds.

EmbeddingsAccuracyPrecisionRecallF1
SISG(c)77.4375.8980.4178.08
SISG(cj)83.1682.3684.6683.50
SISG(cjh3)81.6181.2382.2881.75
SISG(cjh4)82.2582.5781.7782.17

Why It Matters

This paper makes a compelling case that language-specific etymological knowledge is a valuable and underexploited resource for building better word representations. The key contributions and implications are:

Links

Representation Learning