X-SNS: Cross-Lingual Transfer Prediction through Sub-Network Similarity

One-Line Summary

A model-oriented method that predicts cross-lingual transfer performance by measuring how much two languages share internal sub-network structure within a multilingual model, achieving 4.6% average improvement in source language ranking without requiring any external linguistic resources.

Paper overview — **Figure 1.** X-SNS overview: For each language, a binary sub-network is extracted based on Fisher Information scores, and the Jaccard similarity between sub-networks serves as a proxy for cross-lingual transfer compatibility.

Background & Motivation

Cross-lingual transfer (XLT) enables multilingual language models to perform well on tasks in unseen languages without target-language labeled data. While English is typically used as the default source language, recent evidence shows that selecting the most appropriate source language for a given target can significantly amplify transfer effectiveness -- in fact, X-SNS demonstrates that non-English sources outperform English in 11 out of 15 cases with an average improvement of 1.8 points.

Limitations of Existing Approaches: Prior methods for predicting transfer compatibility depend on external resources: Lang2Vec requires typological features from the WALS database, lexical divergence uses subword distribution statistics, and embedding similarity only captures surface-level representation overlap. None of these directly examine how the model internally processes different languages. X-SNS addresses this gap with a model-oriented approach that peers inside the network to measure structural language similarity.

Proposed Method

X-SNS proposes using sub-network similarity between language pairs as a proxy for predicting cross-lingual transfer compatibility. The core idea is that if two languages activate similar parameters within a multilingual model, knowledge learned from one should transfer well to the other.

1

Fisher-Based Sub-Network Extraction

For each language, compute the approximated Fisher Information for every model parameter using raw text. The Fisher score quantifies how sensitive each parameter is to a given language's data. Parameters in the top p% (default: 15%, matching the masked language modeling ratio) are selected to form a binary sub-network vector where 1 indicates an important parameter and 0 otherwise.

2

Jaccard Similarity Computation

Measure the structural overlap between two languages' binary sub-networks using the Jaccard similarity coefficient: |s_source ∩ s_target| / |s_source ∪ s_target|. Higher Jaccard similarity indicates that the model processes both languages through largely the same internal pathways.

3

Source Language Ranking

Given a target language, rank all candidate source languages by their sub-network similarity score. The top-ranked source is predicted to yield the best zero-shot transfer performance. This ranking can also be used in multi-source settings by selecting the top-k languages for disjoint multilingual training.

Key advantage: The method requires only a moderate amount of raw text (as few as 256 examples suffice for near-optimal performance) from candidate languages -- no labeled data, external linguistic databases, or typological annotations needed. The sub-networks are extracted using masked language modeling, making the approach fully unsupervised.

Experimental Results

X-SNS is evaluated on five tasks from the XTREME benchmark using XLM-RoBERTa Base, covering 7 to 20 languages per task. NDCG@3 measures how well each method ranks source languages for zero-shot transfer.

Task (Dataset)	Lang2Vec	Embedding	X-SNS
NER (WikiANN, 17 langs)	62.35	76.06	78.12
POS (UD 2.8, 20 langs)	78.06	74.65	83.73
NLI (XNLI, 15 langs)	59.77	63.15	68.73
PI (PAWS-X, 7 langs)	86.81	83.51	89.82
QA (TyDiQA, 8 langs)	84.52	86.00	87.95

In the regression framework, X-SNS as a single feature outperforms multiple linguistic features from typological databases:

Feature Set	NER (RMSE)	QA (RMSE)
X-POS + MER (linguistic)	7.18	7.40
X-SNS + MER (ours)	5.12	5.80

Consistent superiority: X-SNS achieves the best NDCG@3 across all five tasks, outperforming both linguistic (Lang2Vec) and model-based (Embedding similarity) baselines by 4.6% on average
Data efficiency: Performance converges with as few as 256-1024 raw text examples per language, making the method practical even for low-resource scenarios
Cross-model robustness: Validated on mBERT (Pearson 86.83 vs. EMB 81.21) and XLM-R, though the advantage diminishes on encoder-decoder models like mT5
Multi-source transfer: Top-3 languages selected by X-SNS outperform alternatives in disjoint multilingual training settings for POS tagging and achieve comparable results on NER
Low-resource advantage: Non-English sources chosen by X-SNS outperform English in 11 out of 15 language-task pairs, with an average gain of 1.8 F1/accuracy points

Why It Matters

X-SNS provides a practical, model-grounded mechanism for source language selection in cross-lingual transfer. Unlike methods that depend on external typological knowledge (which may be incomplete or unavailable for many languages), X-SNS looks directly at how the model internally represents languages, making it applicable to any language with raw text data. The method's data efficiency (near-optimal with just 256 examples) and fully unsupervised nature make it particularly valuable for deploying multilingual systems to low-resource languages, where choosing the right source language can make the difference between successful and failed cross-lingual transfer. The findings also provide a deeper understanding of how multilingual models organize linguistic knowledge internally -- languages that share more sub-network structure genuinely transfer knowledge more effectively.

Links

ACL Anthology