A comprehensive analysis showing that training multilingual models with three linguistically diverse source languages -- selected via typological features rather than data size -- significantly improves cross-lingual transfer to unseen target languages.
Cross-lingual transfer (XLT) is a crucial technique for bringing NLP capabilities to low-resource languages by leveraging labeled data from resource-rich source languages. The standard practice -- Single-Source Language Training (SSLT) -- typically fine-tunes a multilingual model on English data alone and then applies it to target languages. While effective, this approach leaves substantial room for improvement, particularly for typologically distant target languages like Thai, Finnish, or Korean.
Previous research has shown that multilingual language models can separate language-specific information from language-agnostic features in their internal representations. This raises a natural question: can training on multiple source languages simultaneously strengthen these language-agnostic features and improve transfer? Although some prior studies have used multiple source languages, they lacked systematic investigation into why certain combinations work, how many languages are optimal, and what criteria should guide language selection.
Key Hypothesis: Using multiple source languages in cross-lingual transfer leads to increased mingling of embedding spaces for different languages, producing more language-agnostic representations and thus stronger transfer to unseen target languages. However, arbitrary language combinations do not guarantee improvements -- careful selection based on linguistic diversity is essential.
The authors systematically investigate Multi-Source Language Training (MSLT) by controlling total data volume across conditions to ensure fair comparison. For example, SSLT with English uses 1,000 samples, while MSLT with English and Spanish uses 500 samples from each -- keeping the total training budget constant. This isolates the effect of language diversity from data quantity.
Experiments were conducted across six benchmarks with 8 target languages. The source language pool consisted of Arabic (ar), German (de), English (en), Spanish (es), French (fr), Russian (ru), and Chinese (zh).
Regardless of the task type or data quantity, task performance remarkably improves as the number of source languages increases from 1 to 3. Performance then plateaus or slightly declines with more than 3 languages, establishing 3 source languages as the practical optimum.
| Selection Method | WikiANN | XNLI | XCOPA | XWinograd | XStoryCloze |
|---|---|---|---|---|---|
| Pretraining Data Size | Rank 31 | Rank 16 | Rank 12 | Rank 14 | Rank 22 |
| Vocabulary Coverage | Rank 31 | Rank 26 | Rank 18 | Rank 20 | Rank 15 |
| Lang2Vec - Syntax | Rank 3 | Rank 2 | Rank 4 | Rank 7 | Rank 2 |
| Lang2Vec - Phonology | Rank 8 | Rank 7 | Rank 27 | Rank 1 | Rank 18 |
| Lang2Vec - Inventory | Rank 2 | Rank 2 | Rank 11 | Rank 3 | Rank 10 |
| Writing System Configuration | WikiANN | XNLI |
|---|---|---|
| All same script (e.g., Latin + Latin + Latin) | 72.26% | 80.36% |
| Two different scripts | 72.68% | 82.19% |
| All different scripts (e.g., Latin + Arabic + Hanzi) | 73.07% | 84.02% |
| Configuration | F1 Score |
|---|---|
| SSLT (English only) | 76.30 |
| MSLT (best pretraining-based selection) | 78.52 |
| MSLT (optimal Lang2Vec selection) | 87.05 |
This work provides the first comprehensive analysis of multi-source language training for cross-lingual transfer, transforming MSLT from an ad-hoc practice into a principled, evidence-based strategy. The key practical takeaway is clear: select 3 linguistically diverse source languages with different writing systems using Lang2Vec-based typological features, rather than relying on data size or vocabulary overlap heuristics.
The findings generalize robustly across model architectures (encoder-only XLM-RoBERTa and decoder-only BLOOM-7B), diverse NLP tasks (NER, natural language inference, paraphrase detection, commonsense reasoning), and training paradigms (standard fine-tuning, instruction-tuning, and parameter-efficient QLoRA). This broad applicability makes the paper a practical reference for anyone building NLP systems for low-resource languages, offering a simple yet effective way to significantly improve cross-lingual transfer performance without requiring additional data collection.