EN KO
← All Publications

Enhanced Zero-Shot Cross-Lingual Transfer with the Diversification of Source Languages

Korea Computer Congress 2023 (KCC 2023)
Seong Hoon Lim, Taeuk Kim

One-Line Summary

By fine-tuning multilingual models on typologically diverse source languages rather than English alone, zero-shot cross-lingual transfer performance improves significantly -- especially for distant target languages -- yielding more language-agnostic representations without any additional annotation effort.

Background & Motivation

Zero-shot cross-lingual transfer is the dominant paradigm for extending NLP capabilities to low-resource languages: a multilingual pretrained model (e.g., mBERT, XLM-R) is fine-tuned on labeled data in a single source language and then applied directly to unseen target languages. In practice, the source language is almost always English.

Why English-only training is problematic:

  • Structural bias: English has relatively fixed SVO word order and limited morphology; fine-tuning exclusively on English may reinforce these structural assumptions, hurting transfer to SOV, VSO, or morphologically rich languages.
  • Typological distance: Languages far from English (e.g., Korean, Turkish, Finnish) share fewer syntactic and morphological properties, making English a poor proxy for learning universal features.
  • Wasted multilingual data: Labeled datasets exist in many languages (e.g., XNLI covers 15 languages), yet the standard practice discards all non-English annotations during fine-tuning.
  • Representation collapse: Monolingual fine-tuning can degrade the cross-lingual alignment built during pretraining, a phenomenon sometimes called "catastrophic forgetting" of multilingual structure.

This work asks a simple but impactful question: Can we improve cross-lingual transfer by training on multiple, typologically diverse source languages instead of English alone? The hypothesis is that exposing the model to varied linguistic structures during fine-tuning will produce representations that generalize better across the full spectrum of target languages.

Typological Diversity Dimensions

Languages vary along multiple typological axes, each of which influences what structural patterns a model learns during fine-tuning:

DimensionEnglishDiverse SourcesImpact on Transfer
Word orderSVO (fixed)SVO, SOV, VSO, VOSOrder-independent feature learning
MorphologyAnalytic (limited inflection)Agglutinative, fusional, isolatingSubword-level generalization
Case systemMinimal (pronoun case only)Rich case marking (e.g., Finnish, Turkish)Relational encoding beyond position
ScriptLatinLatin, Cyrillic, CJK, Arabic, etc.Script-independent representations

English covers only a narrow band of this typological space. By including languages from multiple regions of this space, the model is forced to learn features that are truly universal rather than English-specific.

Proposed Method

The approach replaces the standard single-source (English-only) fine-tuning pipeline with a multi-source diversified training strategy. The key design decisions are as follows:

1
Typological Language Selection
Source languages are selected to maximize typological diversity across multiple dimensions: language family (e.g., Indo-European, Uralic, Altaic, Sino-Tibetan), dominant word order (SVO, SOV, VSO), morphological typology (isolating, agglutinative, fusional), and script system. This ensures the model encounters a broad range of linguistic phenomena during fine-tuning. The selection process prioritizes coverage -- choosing languages that fill gaps in the typological space rather than adding redundant representatives from the same family.
2
Multi-Source Data Combination
Labeled training data from the selected source languages is combined into a single training set. Balancing strategies (e.g., proportional or equal sampling) are applied to prevent high-resource languages from dominating the training signal, ensuring that each language contributes meaningfully to the learned representations. This balancing is crucial: without it, data-rich languages like English or Chinese would overwhelm smaller-resource languages, negating the diversity benefit.
3
Unified Multilingual Fine-Tuning
A single multilingual pretrained model (e.g., XLM-R) is fine-tuned on the combined multi-source data in a standard supervised fashion. This avoids the complexity of multi-model ensembles or language-pair-specific adapters, keeping the approach simple and practical for real-world deployment. The unified training encourages the model to find shared feature representations across languages, strengthening the cross-lingual alignment from pretraining rather than degrading it.

Design Principles:

  • No architectural changes: The method uses the same model architecture and training procedure as single-source fine-tuning; only the training data composition changes.
  • No additional annotation: It leverages existing multilingual labeled datasets (e.g., XNLI, PAWS-X) that are already available but typically ignored in favor of English-only training.
  • Scalable: Adding more source languages is straightforward and does not require any language-specific components or hyperparameter tuning.
  • Preserves cross-lingual alignment: Multi-source fine-tuning reinforces, rather than degrades, the multilingual structure built during pretraining -- mitigating the catastrophic forgetting problem that single-source fine-tuning can cause.

Experimental Results

Experiments compare single-source (English-only) and multi-source (diversified) fine-tuning on cross-lingual NLU benchmarks, evaluating zero-shot transfer to multiple target languages with a multilingual pretrained model.

Key Findings

ComparisonObservation
Multi-source vs. English-onlyMulti-source training consistently outperforms English-only fine-tuning across target languages
Distant target languagesThe largest gains appear for typologically distant languages (e.g., Korean, Turkish) where English is a poor proxy
Close target languagesLanguages closely related to English (e.g., German, French) still benefit, though improvements are more modest
Task complexityTasks requiring deeper syntactic/semantic understanding show larger improvements from diversification

Impact by Typological Distance

Target Language GroupEnglish-Only TransferDiversified TransferRelative Gain
Close to English (Germanic, Romance)StrongSlightly improvedSmall
Moderate distance (Slavic, Semitic)ModerateImprovedModerate
Distant (Korean, Turkish, Finnish)WeakSubstantially improvedLargest

Why It Matters

This study challenges the default assumption in multilingual NLP that English is the optimal (or only necessary) source language for cross-lingual transfer, and offers a practical alternative:

Practical Recommendation: For researchers and practitioners building cross-lingual NLP systems, this work recommends a simple change to the standard pipeline: instead of fine-tuning on English alone, include training data from 3-5 typologically diverse languages covering different word orders, morphological types, and language families. The resulting model will transfer better to a wider range of target languages, with the greatest benefits for the languages that need it most -- those that are least similar to English.

Multilingual