EN KO
← All Publications

Analysis of Multi-Source Language Training in Cross-Lingual Transfer

ACL 2024
Seong Hoon Lim, Taejun Yun, Jinhyeon Kim, Jihun Choi, Taeuk Kim

One-Line Summary

A comprehensive analysis showing that training multilingual models with three linguistically diverse source languages -- selected via typological features rather than data size -- significantly improves cross-lingual transfer to unseen target languages.

Overview of the effectiveness of Multi-Source Language Training in cross-lingual transfer
Figure 1. Overview of the effectiveness of Multi-Source Language Training (MSLT) in cross-lingual transfer. As more sophisticated approaches for MSLT are adopted, improved performance can be expected (from bottom to top).

Background & Motivation

Cross-lingual transfer (XLT) is a crucial technique for bringing NLP capabilities to low-resource languages by leveraging labeled data from resource-rich source languages. The standard practice -- Single-Source Language Training (SSLT) -- typically fine-tunes a multilingual model on English data alone and then applies it to target languages. While effective, this approach leaves substantial room for improvement, particularly for typologically distant target languages like Thai, Finnish, or Korean.

Previous research has shown that multilingual language models can separate language-specific information from language-agnostic features in their internal representations. This raises a natural question: can training on multiple source languages simultaneously strengthen these language-agnostic features and improve transfer? Although some prior studies have used multiple source languages, they lacked systematic investigation into why certain combinations work, how many languages are optimal, and what criteria should guide language selection.

Key Hypothesis: Using multiple source languages in cross-lingual transfer leads to increased mingling of embedding spaces for different languages, producing more language-agnostic representations and thus stronger transfer to unseen target languages. However, arbitrary language combinations do not guarantee improvements -- careful selection based on linguistic diversity is essential.

Conceptual illustration of the advantages of MSLT over SSLT
Figure 3. A conceptual illustration of the advantages of MSLT over SSLT. Training with multiple source languages leads to more language-independent representations, enabling a more robust decision boundary applicable across different languages.

Proposed Method

The authors systematically investigate Multi-Source Language Training (MSLT) by controlling total data volume across conditions to ensure fair comparison. For example, SSLT with English uses 1,000 samples, while MSLT with English and Spanish uses 500 samples from each -- keeping the total training budget constant. This isolates the effect of language diversity from data quantity.

1
MSLT vs. SSLT Comparison
Train XLM-RoBERTa (Base and Large) and BLOOM-7B (with QLoRA) on six diverse tasks -- WikiANN NER (282 languages), XNLI (15 languages), PAWS-X (6 languages), XCOPA (11 languages), XWinograd (6 languages), and XStoryCloze (10 languages) -- with varying numbers of source languages (1 to 7). Visualize embeddings via t-SNE and measure alignment using CKA (Centered Kernel Alignment) similarity to assess how MSLT affects language-agnostic representation formation.
2
Language Selection Criteria
Systematically test and compare multiple language selection heuristics from a pool of 7 source languages (Arabic, German, English, Spanish, French, Russian, Chinese). Criteria include: (a) pretraining data size -- select languages most frequent in the pretraining corpus; (b) vocabulary coverage -- choose languages maximizing lexical overlap with the target; (c) Lang2Vec-based linguistic diversity using typological feature vectors across five dimensions: syntax, phonology, phonological inventory, language family, and geographic proximity; (d) embedding-based diversity derived from pretrained model representations.
3
Writing System Diversity Analysis
Analyze how the diversity of writing systems among selected source languages correlates with transfer performance. Categorize language combinations by whether they share the same script (e.g., all Latin) or use distinct writing systems (e.g., Latin + Arabic + Hanzi), and measure the performance difference across all tasks and target languages (Indonesian, Greek, Hebrew, Finnish, Thai, Turkish, Japanese, Korean).
4
Generalization Across Architectures
Validate findings across different model architectures (encoder-only XLM-R vs. decoder-only BLOOM-7B), training paradigms (standard fine-tuning vs. instruction-tuning), and parameter-efficient methods (full fine-tuning vs. QLoRA), ensuring the conclusions are not specific to one particular setup.

Experimental Results

Experiments were conducted across six benchmarks with 8 target languages. The source language pool consisted of Arabic (ar), German (de), English (en), Spanish (es), French (fr), Russian (ru), and Chinese (zh).

Optimal Number of Source Languages

Regardless of the task type or data quantity, task performance remarkably improves as the number of source languages increases from 1 to 3. Performance then plateaus or slightly declines with more than 3 languages, establishing 3 source languages as the practical optimum.

Language Selection Criteria Ranking

Selection MethodWikiANNXNLIXCOPAXWinogradXStoryCloze
Pretraining Data SizeRank 31Rank 16Rank 12Rank 14Rank 22
Vocabulary CoverageRank 31Rank 26Rank 18Rank 20Rank 15
Lang2Vec - SyntaxRank 3Rank 2Rank 4Rank 7Rank 2
Lang2Vec - PhonologyRank 8Rank 7Rank 27Rank 1Rank 18
Lang2Vec - InventoryRank 2Rank 2Rank 11Rank 3Rank 10

Writing System Diversity

Writing System ConfigurationWikiANNXNLI
All same script (e.g., Latin + Latin + Latin)72.26%80.36%
Two different scripts72.68%82.19%
All different scripts (e.g., Latin + Arabic + Hanzi)73.07%84.02%

MSLT vs. SSLT on WikiANN NER

ConfigurationF1 Score
SSLT (English only)76.30
MSLT (best pretraining-based selection)78.52
MSLT (optimal Lang2Vec selection)87.05

Why It Matters

This work provides the first comprehensive analysis of multi-source language training for cross-lingual transfer, transforming MSLT from an ad-hoc practice into a principled, evidence-based strategy. The key practical takeaway is clear: select 3 linguistically diverse source languages with different writing systems using Lang2Vec-based typological features, rather than relying on data size or vocabulary overlap heuristics.

The findings generalize robustly across model architectures (encoder-only XLM-RoBERTa and decoder-only BLOOM-7B), diverse NLP tasks (NER, natural language inference, paraphrase detection, commonsense reasoning), and training paradigms (standard fine-tuning, instruction-tuning, and parameter-efficient QLoRA). This broad applicability makes the paper a practical reference for anyone building NLP systems for low-resource languages, offering a simple yet effective way to significantly improve cross-lingual transfer performance without requiring additional data collection.

Links

Representation Learning