Hanja-level SISG enriches Korean word embeddings by incorporating Hanja (Chinese character) n-grams into the Skip-gram scoring function, optionally initializing them with pre-trained Chinese embeddings for cross-lingual transfer, and demonstrates gains on word analogy, word similarity, news headline generation, and sentiment analysis.
Korean and Chinese share a deep historical and cultural connection. A set of logograms very similar to Chinese characters, called Hanja, served as the sole medium for written Korean until Hangul was created in 1443. As a result, a substantial portion of Korean words are Sino-Korean (한자어) -- words of Chinese origin that can be written in both Hanja and Hangul, with the latter now commonplace in modern Korean.
Phonograms vs. Logograms: Korean Hangul characters are phonograms -- they encode pronunciation but not meaning. In contrast, Hanja characters are logograms, where each character carries its own lexical meaning. For example, the word "사회맞춤형" contains the Hanja sequence 社會 (society) and 型 (type/style), while the Hangul syllables "맞춤" (customized) have no Hanja equivalent. This semantic richness of Hanja is invisible to standard word embeddings that only operate on Hangul surface forms.
Limitation of Existing Subword Methods: Prior approaches for Korean word representations -- character-level (syllable) n-grams (Bojanowski et al., 2017) and jamo-level n-grams (Park et al., 2018) -- capture orthographic patterns at the surface level but miss the deeper semantic structure encoded in Hanja origins. The agglutinative nature of Korean further limits the effectiveness of these surface-only methods, as sub-character and inter-character information specific to Korean understanding remains uncaptured.
Key Insight -- Cross-Lingual Transfer via Hanja: Since Hanja characters share deep roots with Chinese characters (with many having one-to-one correspondence), mapping Korean words to their Hanja equivalents creates a bridge to Chinese. This enables character-level cross-lingual knowledge transfer: pre-trained Chinese character embeddings (Li et al., 2018) can be used to initialize Hanja n-gram vectors, injecting semantic knowledge across languages without any parallel corpus. This is the first work to introduce character-level cross-lingual transfer learning based on etymological grounds.
The paper's core hypothesis is simple yet powerful: native Koreans intuitively use Hanja to resolve the ambiguity of Sino-Korean words, because each Hanja logogram contains more lexical meaning than its Hangul phonogram counterpart. Can we replicate this human heuristic in word embeddings?
The proposed model, Hanja-level SISG (Hanja-level Subword Information Skip-Gram), extends the Skip-gram framework by progressively incorporating three levels of subword information into the scoring function. The architecture builds on existing work -- SG (Mikolov et al., 2013), SISG (Bojanowski et al., 2017), and Jamo-level SISG (Park et al., 2018) -- adding a new Hanja n-gram level on top.
The method is evaluated on intrinsic tasks (word analogy and word similarity) and two downstream tasks (Korean news headline generation and sentiment analysis). The training corpus is based on the dataset from Park et al. (2018) with additional data cleansing (removing non-Korean sentences, unifying number tags).
Using the Korean word analogy dataset (10,000 quadruples with semantic and syntactic categories from Park et al., 2018), the metric is cosine distance (lower is better) between predicted and target analogy vectors.
| Model | Semantic | Syntactic | All (Avg.) |
|---|---|---|---|
| SG | 0.42 | 0.49 | 0.45 |
| SISG(c) -- syllable n-grams | 0.45 | 0.59 | 0.52 |
| SISG(cj) -- + jamo n-grams | 0.39 | 0.48 | 0.44 |
| SISG(cjh3) -- + Hanja (1-3) | 0.34 | 0.45 | 0.39 |
| SISG(cjh4) -- + Hanja (1-4) | 0.34 | 0.45 | 0.40 |
| SISG(cjhr) -- random init | 0.35 | 0.46 | 0.40 |
Evaluates correlation between word vector distances and human-annotated similarity scores. Higher Pearson/Spearman correlations are better.
| Model | Pearson | Spearman |
|---|---|---|
| SG | 0.60 | 0.62 |
| SISG(c) -- syllable n-grams | 0.62 | 0.61 |
| SISG(cj) -- + jamo n-grams | 0.66 | 0.67 |
| SISG(cjh3) -- + Hanja (1-3) | 0.63 | 0.63 |
| SISG(cjh4) -- + Hanja (1-4) | 0.62 | 0.61 |
| SISG(cjhr) -- random init | 0.65 | 0.64 |
A novel downstream task using 840,205 Korean news articles (published Jan--Feb 2017, covering balanced categories such as politics, sports, world). An encoder-decoder model (bidirectional LSTM encoder + LSTM decoder, hidden size 512, with Bahdanau attention) generates a headline from the first three sentences of the article body. Word embeddings initialize the encoder. Split is 8:1:1 for train/validation/test.
| Embeddings | BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 | PPL |
|---|---|---|---|---|---|
| None (random) | 26.02 | 7.76 | 3.08 | 1.38 | 5.335 |
| SG | 30.33 | 10.20 | 4.29 | 1.98 | 4.122 |
| SISG(c) | 31.34 | 10.96 | 4.69 | 2.19 | 3.942 |
| SISG(cj) | 31.78 | 11.17 | 4.80 | 2.25 | 3.938 |
| SISG(cjh3) | 32.03 | 11.25 | 4.83 | 2.27 | 3.941 |
| SISG(cjh4) | 32.02 | 11.34 | 4.92 | 2.30 | 3.909 |
Evaluated on the Naver Sentiment Movie Corpus (200K movie reviews, positive/negative labels; split 100K/50K/50K). A basic LSTM encoder (hidden size 300) with feed-forward + softmax classifier is used to isolate the effect of input embeddings. Results are averaged over 3 runs with different random seeds.
| Embeddings | Accuracy | Precision | Recall | F1 |
|---|---|---|---|---|
| SISG(c) | 77.43 | 75.89 | 80.41 | 78.08 |
| SISG(cj) | 83.16 | 82.36 | 84.66 | 83.50 |
| SISG(cjh3) | 81.61 | 81.23 | 82.28 | 81.75 |
| SISG(cjh4) | 82.25 | 82.57 | 81.77 | 82.17 |
This paper makes a compelling case that language-specific etymological knowledge is a valuable and underexploited resource for building better word representations. The key contributions and implications are: