Instructing Large Language Models for Low-Resource Languages: A Systematic Study for Basque

arXiv cs.CL / 3/16/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research

共有:

Key Points

The paper investigates instruction tuning for Basque, a low-resource language, using only target-language corpora, open-weight multilingual backbones, and synthetic instructions sampled from the backbone.
It presents a comprehensive set of experiments exploring different component combinations, evaluated on benchmarks and human preferences from 1,680 participants.
Key findings show that target-language corpora are essential, synthetic instructions yield robust models, and an instruction-tuned backbone outperforms a base non-instructed model.
Scaling to Llama 3.1 Instruct 70B as backbone brings Basque models close to frontier models of larger sizes without Basque-specific instructions.
The work releases code, models, instruction datasets, and human preferences to enable full reproducibility in future low-resource language research.

Abstract

Instructing language models with user intent requires large instruction datasets, which are only available for a limited set of languages. In this paper, we explore alternatives to conventional instruction adaptation pipelines in low-resource scenarios. We assume a realistic scenario for low-resource languages, where only the following are available: corpora in the target language, existing open-weight multilingual base and instructed backbone LLMs, and synthetically generated instructions sampled from the instructed backbone. We present a comprehensive set of experiments for Basque that systematically study different combinations of these components evaluated on benchmarks and human preferences from 1,680 participants. Our conclusions show that target language corpora are essential, with synthetic instructions yielding robust models, and, most importantly, that using as backbone an instruction-tuned model outperforms using a base non-instructed model. Scaling up to Llama 3.1 Instruct 70B as backbone, our model comes near frontier models of much larger sizes for Basque, without using any Basque instructions. We release code, models, instruction datasets, and human preferences to support full reproducibility in future research on low-resource language adaptation. https://github.com/hitz-zentroa/latxa-instruct

ベテランの若手育成負担を減らせ、PLC制御の「ラダー図」をAIで生成

日経XTECH

Let AI Control Your Real Browser — Not a Throwaway One

Dev.to

How I Launched a Steam Store Page in 10 Days using Spec-Driven Development (SDD)

Dev.to

AI's Economic Impact Falls Short: Addressing the Gap Between Investment and Measurable Growth

Dev.to

Google Stitch 2.0: Import Any Website's Design System Into Your AI-Generated App

Dev.to

Instructing Large Language Models for Low-Resource Languages: A Systematic Study for Basque

Key Points

Abstract

Related Articles

ベテランの若手育成負担を減らせ、PLC制御の「ラダー図」をAIで生成

Let AI Control Your Real Browser — Not a Throwaway One

How I Launched a Steam Store Page in 10 Days using Spec-Driven Development (SDD)

AI's Economic Impact Falls Short: Addressing the Gap Between Investment and Measurable Growth

Google Stitch 2.0: Import Any Website's Design System Into Your AI-Generated App

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer