Analysis of Multi-Source Language Training in Cross-Lingual Transfer

One-Line Summary

A comprehensive analysis showing that training multilingual models with three linguistically diverse source languages -- selected via typological features rather than data size -- significantly improves cross-lingual transfer to unseen target languages.

Background & Motivation

Cross-lingual transfer (XLT) is a crucial technique for bringing NLP capabilities to low-resource languages by leveraging labeled data from resource-rich source languages. The standard practice -- Single-Source Language Training (SSLT) -- typically fine-tunes a multilingual model on English data alone and then applies it to target languages. While effective, this approach leaves substantial room for improvement, particularly for typologically distant target languages like Thai, Finnish, or Korean.

Previous research has shown that multilingual language models can separate language-specific information from language-agnostic features in their internal representations. This raises a natural question: can training on multiple source languages simultaneously strengthen these language-agnostic features and improve transfer? Although some prior studies have used multiple source languages, they lacked systematic investigation into why certain combinations work, how many languages are optimal, and what criteria should guide language selection.

Key Hypothesis: Using multiple source languages in cross-lingual transfer leads to increased mingling of embedding spaces for different languages, producing more language-agnostic representations and thus stronger transfer to unseen target languages. However, arbitrary language combinations do not guarantee improvements -- careful selection based on linguistic diversity is essential.

Proposed Method

The authors systematically investigate Multi-Source Language Training (MSLT) by controlling total data volume across conditions to ensure fair comparison. For example, SSLT with English uses 1,000 samples, while MSLT with English and Spanish uses 500 samples from each -- keeping the total training budget constant. This isolates the effect of language diversity from data quantity.

1

MSLT vs. SSLT Comparison

Train XLM-RoBERTa (Base and Large) and BLOOM-7B (with QLoRA) on six diverse tasks -- WikiANN NER (282 languages), XNLI (15 languages), PAWS-X (6 languages), XCOPA (11 languages), XWinograd (6 languages), and XStoryCloze (10 languages) -- with varying numbers of source languages (1 to 7). Visualize embeddings via t-SNE and measure alignment using CKA (Centered Kernel Alignment) similarity to assess how MSLT affects language-agnostic representation formation.

2

Language Selection Criteria

Systematically test and compare multiple language selection heuristics from a pool of 7 source languages (Arabic, German, English, Spanish, French, Russian, Chinese). Criteria include: (a) pretraining data size -- select languages most frequent in the pretraining corpus; (b) vocabulary coverage -- choose languages maximizing lexical overlap with the target; (c) Lang2Vec-based linguistic diversity using typological feature vectors across five dimensions: syntax, phonology, phonological inventory, language family, and geographic proximity; (d) embedding-based diversity derived from pretrained model representations.

3

Writing System Diversity Analysis

Analyze how the diversity of writing systems among selected source languages correlates with transfer performance. Categorize language combinations by whether they share the same script (e.g., all Latin) or use distinct writing systems (e.g., Latin + Arabic + Hanzi), and measure the performance difference across all tasks and target languages (Indonesian, Greek, Hebrew, Finnish, Thai, Turkish, Japanese, Korean).

4

Generalization Across Architectures

Validate findings across different model architectures (encoder-only XLM-R vs. decoder-only BLOOM-7B), training paradigms (standard fine-tuning vs. instruction-tuning), and parameter-efficient methods (full fine-tuning vs. QLoRA), ensuring the conclusions are not specific to one particular setup.

Experimental Results

Experiments were conducted across six benchmarks with 8 target languages. The source language pool consisted of Arabic (ar), German (de), English (en), Spanish (es), French (fr), Russian (ru), and Chinese (zh).

Optimal Number of Source Languages

Regardless of the task type or data quantity, task performance remarkably improves as the number of source languages increases from 1 to 3. Performance then plateaus or slightly declines with more than 3 languages, establishing 3 source languages as the practical optimum.

Language Selection Criteria Ranking

Selection Method	WikiANN	XNLI	XCOPA	XWinograd	XStoryCloze
Pretraining Data Size	Rank 31	Rank 16	Rank 12	Rank 14	Rank 22
Vocabulary Coverage	Rank 31	Rank 26	Rank 18	Rank 20	Rank 15
Lang2Vec - Syntax	Rank 3	Rank 2	Rank 4	Rank 7	Rank 2
Lang2Vec - Phonology	Rank 8	Rank 7	Rank 27	Rank 1	Rank 18
Lang2Vec - Inventory	Rank 2	Rank 2	Rank 11	Rank 3	Rank 10

Writing System Diversity

Writing System Configuration	WikiANN	XNLI
All same script (e.g., Latin + Latin + Latin)	72.26%	80.36%
Two different scripts	72.68%	82.19%
All different scripts (e.g., Latin + Arabic + Hanzi)	73.07%	84.02%

MSLT vs. SSLT on WikiANN NER

Configuration	F1 Score
SSLT (English only)	76.30
MSLT (best pretraining-based selection)	78.52
MSLT (optimal Lang2Vec selection)	87.05

Dramatic improvement with optimal selection: The best MSLT configuration achieves 87.05 F1 on WikiANN, a gain of over 10 points compared to SSLT with English alone (76.30), demonstrating the significant potential of carefully selected multi-source training
Large gap between best and worst: A notable performance gap of up to 10+ points exists between optimal and worst language combinations in most experiments; on XCOPA and XWinograd, the worst MSLT combinations even underperform the SSLT baseline
Linguistic diversity beats corpus statistics: Lang2Vec-based selection (syntax and inventory features) consistently ranks in the top 2-7 across tasks, while pretraining data size and vocabulary coverage rank 12-31, showing that typological properties are far more predictive of transfer success
Writing system diversity strongly correlates with performance: Combinations with all different writing systems consistently outperform same-script combinations, with XNLI showing a particularly large gap of 3.66 percentage points (84.02% vs. 80.36%)
Chinese, Arabic, and German dominate optimal combinations: These three languages appear 17, 15, and 15 times respectively across the top-5 combinations for all tasks, likely because they represent maximally diverse writing systems and typological features
Embedding alignment improves with MSLT: t-SNE visualizations and CKA analysis confirm that MSLT with diverse languages (e.g., English + Spanish + German) produces significantly better language-agnostic alignment for unseen languages compared to both the original XLM-R and SSLT-finetuned XLM-R

Why It Matters

This work provides the first comprehensive analysis of multi-source language training for cross-lingual transfer, transforming MSLT from an ad-hoc practice into a principled, evidence-based strategy. The key practical takeaway is clear: select 3 linguistically diverse source languages with different writing systems using Lang2Vec-based typological features, rather than relying on data size or vocabulary overlap heuristics.

The findings generalize robustly across model architectures (encoder-only XLM-RoBERTa and decoder-only BLOOM-7B), diverse NLP tasks (NER, natural language inference, paraphrase detection, commonsense reasoning), and training paradigms (standard fine-tuning, instruction-tuning, and parameter-efficient QLoRA). This broad applicability makes the paper a practical reference for anyone building NLP systems for low-resource languages, offering a simple yet effective way to significantly improve cross-lingual transfer performance without requiring additional data collection.

Links

ACL Anthology arXiv