Enhanced Zero-Shot Cross-Lingual Transfer with the Diversification of Source Languages

One-Line Summary

By fine-tuning multilingual models on typologically diverse source languages rather than English alone, zero-shot cross-lingual transfer performance improves significantly -- especially for distant target languages -- yielding more language-agnostic representations without any additional annotation effort.

Background & Motivation

Zero-shot cross-lingual transfer is the dominant paradigm for extending NLP capabilities to low-resource languages: a multilingual pretrained model (e.g., mBERT, XLM-R) is fine-tuned on labeled data in a single source language and then applied directly to unseen target languages. In practice, the source language is almost always English.

Why English-only training is problematic:

Structural bias: English has relatively fixed SVO word order and limited morphology; fine-tuning exclusively on English may reinforce these structural assumptions, hurting transfer to SOV, VSO, or morphologically rich languages.
Typological distance: Languages far from English (e.g., Korean, Turkish, Finnish) share fewer syntactic and morphological properties, making English a poor proxy for learning universal features.
Wasted multilingual data: Labeled datasets exist in many languages (e.g., XNLI covers 15 languages), yet the standard practice discards all non-English annotations during fine-tuning.
Representation collapse: Monolingual fine-tuning can degrade the cross-lingual alignment built during pretraining, a phenomenon sometimes called "catastrophic forgetting" of multilingual structure.

This work asks a simple but impactful question: Can we improve cross-lingual transfer by training on multiple, typologically diverse source languages instead of English alone? The hypothesis is that exposing the model to varied linguistic structures during fine-tuning will produce representations that generalize better across the full spectrum of target languages.

Typological Diversity Dimensions

Languages vary along multiple typological axes, each of which influences what structural patterns a model learns during fine-tuning:

Dimension	English	Diverse Sources	Impact on Transfer
Word order	SVO (fixed)	SVO, SOV, VSO, VOS	Order-independent feature learning
Morphology	Analytic (limited inflection)	Agglutinative, fusional, isolating	Subword-level generalization
Case system	Minimal (pronoun case only)	Rich case marking (e.g., Finnish, Turkish)	Relational encoding beyond position
Script	Latin	Latin, Cyrillic, CJK, Arabic, etc.	Script-independent representations

English covers only a narrow band of this typological space. By including languages from multiple regions of this space, the model is forced to learn features that are truly universal rather than English-specific.

Proposed Method

The approach replaces the standard single-source (English-only) fine-tuning pipeline with a multi-source diversified training strategy. The key design decisions are as follows:

1

Typological Language Selection

Source languages are selected to maximize typological diversity across multiple dimensions: language family (e.g., Indo-European, Uralic, Altaic, Sino-Tibetan), dominant word order (SVO, SOV, VSO), morphological typology (isolating, agglutinative, fusional), and script system. This ensures the model encounters a broad range of linguistic phenomena during fine-tuning. The selection process prioritizes coverage -- choosing languages that fill gaps in the typological space rather than adding redundant representatives from the same family.

2

Multi-Source Data Combination

Labeled training data from the selected source languages is combined into a single training set. Balancing strategies (e.g., proportional or equal sampling) are applied to prevent high-resource languages from dominating the training signal, ensuring that each language contributes meaningfully to the learned representations. This balancing is crucial: without it, data-rich languages like English or Chinese would overwhelm smaller-resource languages, negating the diversity benefit.

3

Unified Multilingual Fine-Tuning

A single multilingual pretrained model (e.g., XLM-R) is fine-tuned on the combined multi-source data in a standard supervised fashion. This avoids the complexity of multi-model ensembles or language-pair-specific adapters, keeping the approach simple and practical for real-world deployment. The unified training encourages the model to find shared feature representations across languages, strengthening the cross-lingual alignment from pretraining rather than degrading it.

Design Principles:

No architectural changes: The method uses the same model architecture and training procedure as single-source fine-tuning; only the training data composition changes.
No additional annotation: It leverages existing multilingual labeled datasets (e.g., XNLI, PAWS-X) that are already available but typically ignored in favor of English-only training.
Scalable: Adding more source languages is straightforward and does not require any language-specific components or hyperparameter tuning.
Preserves cross-lingual alignment: Multi-source fine-tuning reinforces, rather than degrades, the multilingual structure built during pretraining -- mitigating the catastrophic forgetting problem that single-source fine-tuning can cause.

Experimental Results

Experiments compare single-source (English-only) and multi-source (diversified) fine-tuning on cross-lingual NLU benchmarks, evaluating zero-shot transfer to multiple target languages with a multilingual pretrained model.

Key Findings

Comparison	Observation
Multi-source vs. English-only	Multi-source training consistently outperforms English-only fine-tuning across target languages
Distant target languages	The largest gains appear for typologically distant languages (e.g., Korean, Turkish) where English is a poor proxy
Close target languages	Languages closely related to English (e.g., German, French) still benefit, though improvements are more modest
Task complexity	Tasks requiring deeper syntactic/semantic understanding show larger improvements from diversification

Impact by Typological Distance

Target Language Group	English-Only Transfer	Diversified Transfer	Relative Gain
Close to English (Germanic, Romance)	Strong	Slightly improved	Small
Moderate distance (Slavic, Semitic)	Moderate	Improved	Moderate
Distant (Korean, Turkish, Finnish)	Weak	Substantially improved	Largest

Consistent improvement: Diversified source training yields gains across the board, not just for specific language pairs, confirming that the benefit stems from more language-agnostic representations rather than surface-level feature overlap.
Typological distance matters most: The further a target language is from English, the greater the relative improvement from diversification -- exactly the scenario where English-only training is weakest.
Minimal overhead: The multi-source approach adds negligible computational cost since training data size can be kept constant via balanced sampling, meaning the gains come essentially "for free."
Syntactic and semantic gains: Tasks involving complex syntactic structures or cross-lingual semantic alignment benefit disproportionately, suggesting that diverse source exposure helps the model learn deeper structural patterns rather than relying on superficial lexical cues.
Preserved alignment: Unlike English-only fine-tuning, multi-source training maintains or even strengthens the cross-lingual alignment established during pretraining, reducing the catastrophic forgetting effect.

Why It Matters

This study challenges the default assumption in multilingual NLP that English is the optimal (or only necessary) source language for cross-lingual transfer, and offers a practical alternative:

Low-hanging fruit for practitioners: Simply changing which languages are included in the training set -- using already-available multilingual data -- can improve transfer performance without any model modifications or extra annotation.
Better coverage of low-resource languages: Languages that are typologically distant from English have historically been the worst served by cross-lingual transfer; this approach specifically targets that gap.
Rethinking the English-centric paradigm: The results provide concrete evidence that the field's heavy reliance on English-only fine-tuning leaves performance on the table, motivating more thoughtful source language selection in future multilingual research.
Foundation for future work: The findings open directions for investigating optimal language selection criteria, adaptive sampling strategies, and the interaction between source diversity and model scale.

Practical Recommendation: For researchers and practitioners building cross-lingual NLP systems, this work recommends a simple change to the standard pipeline: instead of fine-tuning on English alone, include training data from 3-5 typologically diverse languages covering different word orders, morphological types, and language families. The resulting model will transfer better to a wider range of target languages, with the greatest benefits for the languages that need it most -- those that are least similar to English.