Research on Features for Effective Cross-Lingual Transfer in Korean

One-Line Summary

Through regression analysis on diverse linguistic features across multiple NLP tasks, this study identifies which characteristics of source languages most effectively predict cross-lingual transfer performance to Korean, revealing that structural properties like word order and morphological type outweigh genealogical relatedness, and proposes improved source-language selection strategies over existing methods such as LANGRANK.

Background & Motivation

Cross-lingual transfer learning leverages multilingual pretrained language models (e.g., mBERT, XLM-R) to build NLP systems for low-resource languages by training on data-rich source languages. However, transfer performance to Korean varies significantly depending on the choice of source language, raising a fundamental question: what characteristics make a source language effective for cross-lingual transfer to Korean?

Why Korean Presents Unique Challenges:

Agglutinative morphology: Korean attaches multiple grammatical markers as suffixes (e.g., "학교에서부터" = school + from + starting), creating complex word forms that differ fundamentally from analytic languages like English or Chinese. This complicates subword tokenization and cross-lingual alignment in multilingual models.
SOV word order: Korean follows Subject-Object-Verb order, shared by only a subset of world languages (e.g., Japanese, Turkish), making word-order alignment with SVO languages like English problematic. Since syntactic structure influences how multilingual models align representations, this mismatch directly affects transfer quality.
Topic-prominent structure: Korean marks discourse topics explicitly with particles like "은/는," a feature absent in many Indo-European languages, affecting how information is structured in text and complicating cross-lingual sentence alignment.
Unique script: Korean uses Hangul, a featural alphabet that encodes phonological information systematically. While this provides internal consistency, it limits character-level overlap with most other languages, making subword vocabulary overlap a particularly important factor.
Limited prior work: While cross-lingual transfer features have been studied broadly (e.g., Lin et al., 2019; Lauscher et al., 2020), no prior work had conducted a focused, systematic analysis with Korean specifically as the target language, despite Korean being one of the most widely spoken languages with growing NLP demand.

Understanding which linguistic and typological features drive transfer success is essential for practitioners who need to select optimal source languages and configurations when building Korean NLP systems, rather than defaulting to English -- a choice that may be significantly suboptimal given the substantial typological distance between English and Korean.

Proposed Method

The study employs a systematic regression-based framework to quantify the relationship between source-language features and cross-lingual transfer performance to Korean, using multilingual pretrained models across multiple NLP tasks. By treating source-language selection as a feature prediction problem, the approach moves beyond heuristic language choice toward data-driven, quantitative guidance.

Feature Extraction

A comprehensive set of linguistic features is extracted for each candidate source language, spanning multiple dimensions: (a) typological features from WALS (World Atlas of Language Structures) -- including word order (SVO/SOV/VSO), morphological type (analytic/agglutinative/fusional), case marking system, and agreement patterns; (b) subword vocabulary overlap computed between each source language and Korean using the tokenizer vocabulary of multilingual models; (c) lexical similarity measures based on cognate overlap and loanword frequency; (d) geographic and genetic distance -- physical distance between language communities and language-family tree distance; and (e) dataset-level statistics such as training set size and label distribution.

Cross-Lingual Transfer Experiments

Controlled zero-shot cross-lingual transfer experiments are conducted from multiple source languages to Korean using multilingual pretrained models (mBERT, XLM-R). Each experiment isolates the effect of individual features by keeping other variables (model architecture, hyperparameters, data size) constant across languages. Multiple downstream NLP tasks are evaluated: Named Entity Recognition (NER), Part-of-Speech (POS) tagging, and text classification, chosen to span different levels of linguistic analysis from token-level to document-level.

Regression Analysis

Statistical regression models are fitted to quantify the predictive power of each feature on transfer performance. Both simple linear regression (per-feature) and multiple regression (combined features) are used. Correlation coefficients, R-squared values, and feature importance scores are computed to rank which characteristics are most indicative of successful transfer to Korean. This analysis reveals not just which features matter, but their relative contribution and potential interactions.

Source Language Selection Comparison

Existing automated methodologies for selecting optimal source languages -- including LANGRANK (a learned ranking model) and other feature-based selection approaches from the literature -- are compared against the regression-based findings. The study evaluates whether these existing methods accurately identify the best source languages for Korean, and proposes improved selection strategies that better account for Korean-specific linguistic properties such as agglutinative morphology and SOV word order.

Experimental Results

The study systematically evaluates which features best predict transfer performance to Korean and compares source-language selection strategies across multiple NLP tasks, providing quantitative evidence for principled source-language selection.

Feature Predictiveness for Transfer to Korean

Feature Category	Predictive Strength	Key Observation
Subword Vocabulary Overlap	High	Strong positive correlation with transfer performance across all tasks; most universally predictive single feature
Word Order Similarity (SOV)	High	Especially predictive for syntactic tasks (POS tagging); SOV languages consistently outperform SVO/VSO sources
Morphological Type	Moderate-High	Agglutinative languages (Japanese, Turkish, Finnish) transfer better than analytic or fusional ones
Language Family	Low-Moderate	Surprisingly weak predictor when isolated from structural features; genetic relatedness is confounded with typological similarity
Geographic Distance	Low	Proximate languages benefit mainly through shared structural features and cultural borrowing, not proximity itself

Key Findings by Task

Task	Most Important Feature	Observation
NER	Subword Overlap	Entity recognition benefits most from shared subword vocabulary, as named entities often share surface forms across languages
POS Tagging	Word Order	Syntactic tasks highly sensitive to structural alignment; SOV source languages yield significantly higher accuracy
Text Classification	Subword Overlap	Semantic tasks rely more on representation-level similarity, with subword overlap serving as a proxy for embedding alignment

Structural similarity > genealogical relatedness: Language family membership alone is not a strong predictor; languages that share structural properties with Korean (agglutinative morphology, SOV order) transfer better regardless of genetic distance. This finding challenges the intuitive assumption that related languages always transfer best.
Japanese and Turkish as top source languages: These languages consistently rank among the best sources for Korean across all evaluated tasks, owing to shared SOV word order and agglutinative morphology. Japanese additionally benefits from significant lexical borrowing (Sino-Japanese vocabulary), while Turkish demonstrates that typological similarity alone -- without geographic proximity or shared vocabulary -- can drive strong transfer.
Task-dependent feature importance: Syntactic tasks (POS tagging) are most sensitive to word order similarity, while semantic tasks (NER, classification) depend more on subword vocabulary overlap. This means there is no single "best" source language across all tasks.
English is not optimal: Despite being the most common source language in practice, English (SVO, analytic morphology) is suboptimal for transfer to Korean compared to typologically closer alternatives. This finding has direct practical implications, as many multilingual NLP pipelines default to English as the training language.
Improved source selection: The proposed regression-based approach identifies better source languages than existing automated selection methods like LANGRANK, particularly for Korean-specific transfer scenarios where the unique combination of SOV order and agglutinative morphology is underweighted by general-purpose selection tools.
Feature interactions matter: Combining multiple features in a regression model yields better predictions than any single feature alone, suggesting that effective source-language selection requires considering a holistic profile of typological properties rather than relying on any one dimension.

Why It Matters

This work provides actionable insights for building better Korean NLP systems through cross-lingual transfer, with broader implications for multilingual NLP methodology:

Practical source language guidance: Rather than defaulting to English, practitioners should prioritize typologically similar languages like Japanese and Turkish that share Korean's SOV word order and agglutinative morphology, leading to measurably better transfer performance. This is immediately applicable to any team building Korean NLP tools in low-resource scenarios.
Task-aware selection: The finding that feature importance varies by task type means that the optimal source language may differ depending on whether one is building a NER system (prioritize subword overlap), a POS tagger (prioritize word order), or a text classifier -- one size does not fit all, and task-specific source selection can yield meaningful gains.
Beyond language families: The demonstration that structural similarity outweighs genealogical relatedness challenges the common assumption that closely related languages are always the best transfer sources. Since Korean is a language isolate (or controversially linked to Altaic languages), this finding is especially relevant -- it shows that structural "neighbors" like Turkish can be more useful than any purported genetic relative.
Methodological contribution: The regression-based analysis framework can be adapted to study cross-lingual transfer to other target languages, providing a reusable methodology for the multilingual NLP community to identify optimal source languages in a principled, data-driven manner.

Links

KoreaScience

Multilingual