EN KO
← All Publications

Research on Features for Effective Cross-Lingual Transfer in Korean

The 35th Annual Conference on Human and Cognitive Language Technology (HCLT 2023)
Taejun Yun, Taeuk Kim

One-Line Summary

Through regression analysis on diverse linguistic features across multiple NLP tasks, this study identifies which characteristics of source languages most effectively predict cross-lingual transfer performance to Korean, revealing that structural properties like word order and morphological type outweigh genealogical relatedness, and proposes improved source-language selection strategies over existing methods such as LANGRANK.

Background & Motivation

Cross-lingual transfer learning leverages multilingual pretrained language models (e.g., mBERT, XLM-R) to build NLP systems for low-resource languages by training on data-rich source languages. However, transfer performance to Korean varies significantly depending on the choice of source language, raising a fundamental question: what characteristics make a source language effective for cross-lingual transfer to Korean?

Why Korean Presents Unique Challenges:

  • Agglutinative morphology: Korean attaches multiple grammatical markers as suffixes (e.g., "학교에서부터" = school + from + starting), creating complex word forms that differ fundamentally from analytic languages like English or Chinese. This complicates subword tokenization and cross-lingual alignment in multilingual models.
  • SOV word order: Korean follows Subject-Object-Verb order, shared by only a subset of world languages (e.g., Japanese, Turkish), making word-order alignment with SVO languages like English problematic. Since syntactic structure influences how multilingual models align representations, this mismatch directly affects transfer quality.
  • Topic-prominent structure: Korean marks discourse topics explicitly with particles like "은/는," a feature absent in many Indo-European languages, affecting how information is structured in text and complicating cross-lingual sentence alignment.
  • Unique script: Korean uses Hangul, a featural alphabet that encodes phonological information systematically. While this provides internal consistency, it limits character-level overlap with most other languages, making subword vocabulary overlap a particularly important factor.
  • Limited prior work: While cross-lingual transfer features have been studied broadly (e.g., Lin et al., 2019; Lauscher et al., 2020), no prior work had conducted a focused, systematic analysis with Korean specifically as the target language, despite Korean being one of the most widely spoken languages with growing NLP demand.

Understanding which linguistic and typological features drive transfer success is essential for practitioners who need to select optimal source languages and configurations when building Korean NLP systems, rather than defaulting to English -- a choice that may be significantly suboptimal given the substantial typological distance between English and Korean.

Proposed Method

The study employs a systematic regression-based framework to quantify the relationship between source-language features and cross-lingual transfer performance to Korean, using multilingual pretrained models across multiple NLP tasks. By treating source-language selection as a feature prediction problem, the approach moves beyond heuristic language choice toward data-driven, quantitative guidance.

1
Feature Extraction
A comprehensive set of linguistic features is extracted for each candidate source language, spanning multiple dimensions: (a) typological features from WALS (World Atlas of Language Structures) -- including word order (SVO/SOV/VSO), morphological type (analytic/agglutinative/fusional), case marking system, and agreement patterns; (b) subword vocabulary overlap computed between each source language and Korean using the tokenizer vocabulary of multilingual models; (c) lexical similarity measures based on cognate overlap and loanword frequency; (d) geographic and genetic distance -- physical distance between language communities and language-family tree distance; and (e) dataset-level statistics such as training set size and label distribution.
2
Cross-Lingual Transfer Experiments
Controlled zero-shot cross-lingual transfer experiments are conducted from multiple source languages to Korean using multilingual pretrained models (mBERT, XLM-R). Each experiment isolates the effect of individual features by keeping other variables (model architecture, hyperparameters, data size) constant across languages. Multiple downstream NLP tasks are evaluated: Named Entity Recognition (NER), Part-of-Speech (POS) tagging, and text classification, chosen to span different levels of linguistic analysis from token-level to document-level.
3
Regression Analysis
Statistical regression models are fitted to quantify the predictive power of each feature on transfer performance. Both simple linear regression (per-feature) and multiple regression (combined features) are used. Correlation coefficients, R-squared values, and feature importance scores are computed to rank which characteristics are most indicative of successful transfer to Korean. This analysis reveals not just which features matter, but their relative contribution and potential interactions.
4
Source Language Selection Comparison
Existing automated methodologies for selecting optimal source languages -- including LANGRANK (a learned ranking model) and other feature-based selection approaches from the literature -- are compared against the regression-based findings. The study evaluates whether these existing methods accurately identify the best source languages for Korean, and proposes improved selection strategies that better account for Korean-specific linguistic properties such as agglutinative morphology and SOV word order.

Experimental Results

The study systematically evaluates which features best predict transfer performance to Korean and compares source-language selection strategies across multiple NLP tasks, providing quantitative evidence for principled source-language selection.

Feature Predictiveness for Transfer to Korean

Feature CategoryPredictive StrengthKey Observation
Subword Vocabulary OverlapHighStrong positive correlation with transfer performance across all tasks; most universally predictive single feature
Word Order Similarity (SOV)HighEspecially predictive for syntactic tasks (POS tagging); SOV languages consistently outperform SVO/VSO sources
Morphological TypeModerate-HighAgglutinative languages (Japanese, Turkish, Finnish) transfer better than analytic or fusional ones
Language FamilyLow-ModerateSurprisingly weak predictor when isolated from structural features; genetic relatedness is confounded with typological similarity
Geographic DistanceLowProximate languages benefit mainly through shared structural features and cultural borrowing, not proximity itself

Key Findings by Task

TaskMost Important FeatureObservation
NERSubword OverlapEntity recognition benefits most from shared subword vocabulary, as named entities often share surface forms across languages
POS TaggingWord OrderSyntactic tasks highly sensitive to structural alignment; SOV source languages yield significantly higher accuracy
Text ClassificationSubword OverlapSemantic tasks rely more on representation-level similarity, with subword overlap serving as a proxy for embedding alignment

Why It Matters

This work provides actionable insights for building better Korean NLP systems through cross-lingual transfer, with broader implications for multilingual NLP methodology:

Links

Multilingual