EN KO
← All Publications

A Syllable-based Technique for Word Embeddings of Korean Words

The 1st Workshop on Subword and Character Level Models in NLP (SCLeM 2017) at EMNLP 2017
Sanghyuk Choi, Taeuk Kim, Jinseok Seol, Sang-goo Lee

One-Line Summary

A CNN-based word representation model that uses Korean syllables as the basic embedding unit, producing morphologically meaningful word vectors that are robust to out-of-vocabulary words and capture the agglutinative structure of Korean.

Syllable-based word embedding model
Figure 1. Syllable-based word embedding model with convolutional and max pooling layers. Each word is decomposed into its constituent syllables, embedded, passed through multi-width convolution filters, and max-pooled to produce a fixed-dimensional word vector.

Background & Motivation

Word embeddings have become a foundational component in NLP tasks such as named entity recognition, machine translation, and sentiment analysis. Standard models like Word2Vec (Skip-gram, CBOW) and GloVe treat each word as an atomic unit, mapping it to a single vector. This works reasonably well for languages with limited morphological variation, but becomes problematic for morphologically rich and agglutinative languages like Korean.

The Korean Vocabulary Explosion Problem: In Korean, a single root morpheme can combine with approximately 60 different bound morphemes (postpositions, verb endings, honorifics, etc.), each producing a distinct surface form. For example, the noun "학교" (school) can appear as "학교가", "학교를", "학교에서", "학교에서의" and dozens more. Traditional word-level embeddings treat each of these as a completely separate vocabulary entry, leading to massive vocabularies, sparse training data per word, and frequent out-of-vocabulary (OOV) failures.

Prior subword approaches offer partial solutions but have significant drawbacks for Korean:

The key insight is that Korean Hangul syllable blocks are linguistically meaningful units that exist between the too-fine character level and the too-coarse word level. Moreover, the total number of distinct syllables in practical use is only around 1,000 -- orders of magnitude smaller than the word vocabulary -- making syllable-level representation both semantically rich and computationally efficient.

Proposed Method: Syllable-CNN

The model constructs word representations by composing trained syllable vectors through a convolutional neural network. The architecture draws on insights from character-level CNN models (such as CharCNN by Kim et al., 2016) but adapts the approach to Korean's syllabic writing system:

1
Syllable Embedding Layer
Each Korean syllable is mapped to a d-dimensional dense vector via a learnable embedding matrix. Given a word consisting of n syllables (s1, s2, ..., sn), the corresponding syllable vectors are looked up and concatenated in sequence to form an n × d matrix that serves as the input to the convolutional layers. With ~1,000 unique syllables in the corpus, this embedding table is compact yet expressive.
2
Multi-Width Convolutional Layers
Multiple 1D convolution filters with varying window widths (1, 2, 3, and 4 syllables) are applied over the syllable embedding matrix. Width-1 filters capture individual syllable features; width-2 filters capture bi-syllable patterns (common in Sino-Korean compound words); width-3 and width-4 filters detect longer compositional patterns. Each filter width uses 80 filters, producing 80-dimensional feature maps per width. A max-over-time pooling operation then selects the most activated feature from each filter, yielding a fixed-length vector of 320 dimensions (80 × 4 widths) regardless of word length.
3
Skip-gram Joint Training
The entire model -- syllable embeddings, convolutional filters, and pooling -- is trained end-to-end using the Skip-gram objective with negative sampling (k=5 negative samples). Given a target word, the CNN composes its syllable vectors into a word-level representation, which is then used to predict surrounding context words within a fixed window. Backpropagation through the CNN updates both the convolution parameters and the syllable embedding table jointly, allowing the model to learn syllable representations that are optimized for word-level distributional semantics.

Key Architectural Advantage: Because the word representation is computed from syllable vectors rather than stored as a fixed lookup, the model can generate embeddings for any word composed of known syllables -- including words never seen during training. This compositional property directly addresses the OOV problem that plagues word-level models in Korean.

Experimental Results

The model was evaluated on a Korean News corpus spanning 2012-2014, containing approximately 2.7 million tokens, an 11,000-word vocabulary, and only ~1,000 unique syllables. The word vector dimension was set to 320 (80 filters × 4 widths). The baseline is the standard Skip-gram model trained on the same corpus.

Word Similarity Evaluation

ModelPearson Correlation (WS353-Sim)
Skip-gram (baseline)0.583
Syllable-CNN (proposed)0.634

On the Korean-translated WordSim-353 Similarity subset, Syllable-CNN achieves a +0.051 improvement in Pearson correlation over the Skip-gram baseline. This improvement is attributed to the model's ability to exploit shared syllables between semantically related words -- for example, "경제" (economy) and "경영" (management) share the syllable "경" (managing/governing), which the CNN captures through its compositional process.

OOV Robustness: Handling Unseen Words

A critical advantage of the syllable-based approach is its ability to produce meaningful embeddings for out-of-vocabulary words. The paper demonstrates this with neologisms and compound words absent from training data:

OOV Query WordNearest Neighbors (Syllable-CNN)
구글신 ("God Google")구글 (Google), and semantically related terms
갤노트 ("Galaxy Note", abbreviated)갤럭시 (Galaxy), 노트 (Note), and related tech terms

Because these OOV words share syllables with known words (e.g., "구글신" contains "구글" = Google), the CNN composes their syllable vectors into representations that are close to the expected semantic neighborhood. The standard Skip-gram model simply cannot produce vectors for these words at all.

Morphological Structure Analysis

Why It Matters

This work pioneered the use of syllables as the fundamental unit for Korean word representation, offering a linguistically motivated alternative to both character-level and morpheme-level approaches. Its contributions extend in several directions:

Links

Representation Learning Multilingual