Budget-Xfer: Budget-Constrained Source Language Selection for Cross-Lingual Transfer to African Languages

arXiv cs.CL / 3/31/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces Budget-Xfer, a framework for selecting multiple source languages and allocating a fixed annotation budget for cross-lingual transfer to low-resource African languages.
By modeling source selection as a budget-constrained resource allocation problem, the study aims to disentangle language-selection effects from the confounding impact of total training data.
Experiments on named entity recognition and sentiment analysis for Hausa, Yoruba, and Swahili (288 runs using two multilingual models) show multi-source transfer substantially beats single-source transfer, with Cohen’s d ranging from 0.80 to 1.98.
The authors find that, among multi-source allocation strategies, performance differences are generally modest and statistically non-significant.
They also report that using embedding similarity as a selection proxy is task-dependent: random source selection performs better for NER, while similarity-based selection is not superior for sentiment analysis.

Abstract

Cross-lingual transfer learning enables NLP for low-resource languages by leveraging labeled data from higher-resource sources, yet existing comparisons of source language selection strategies do not control for total training data, confounding language selection effects with data quantity effects. We introduce Budget-Xfer, a framework that formulates multi-source cross-lingual transfer as a budget-constrained resource allocation problem. Given a fixed annotation budget B, our framework jointly optimizes which source languages to include and how much data to allocate from each. We evaluate four allocation strategies across named entity recognition and sentiment analysis for three African target languages (Hausa, Yoruba, Swahili) using two multilingual models, conducting 288 experiments. Our results show that (1) multi-source transfer significantly outperforms single-source transfer (Cohen's d = 0.80 to 1.98), driven by a structural budget underutilization bottleneck; (2) among multi-source strategies, differences are modest and non-significant; and (3) the value of embedding similarity as a selection proxy is task-dependent, with random selection outperforming similarity-based selection for NER but not sentiment analysis.

Black Hat Asia

AI Business

[D] How does distributed proof of work computing handle the coordination needs of neural network training?

Reddit r/MachineLearning

Claude Code's Entire Source Code Was Just Leaked via npm Source Maps — Here's What's Inside

Dev.to

BYOK is not just a pricing model: why it changes AI product trust

Dev.to

AI Citation Registries and Identity Persistence Across Records

Dev.to

Budget-Xfer: Budget-Constrained Source Language Selection for Cross-Lingual Transfer to African Languages

Key Points

Abstract

Related Articles

Black Hat Asia

[D] How does distributed proof of work computing handle the coordination needs of neural network training?

Claude Code's Entire Source Code Was Just Leaked via npm Source Maps — Here's What's Inside

BYOK is not just a pricing model: why it changes AI product trust

AI Citation Registries and Identity Persistence Across Records

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer