A Catalog of Basque Dialectal Resources: Online Collections and Standard-to-Dialectal Adaptations

arXiv cs.CL / 3/27/2026

💬 OpinionSignals & Early TrendsIdeas & Deep Analysis

共有:

Key Points

The paper compiles a systematic catalog of contemporary Basque dialectal NLP resources, addressing data scarcity by aggregating currently available dialectal data online and via standard-to-dialect adaptations.
It distinguishes two resource types: data originally written in dialects (e.g., news, radio content, informal tweets, and reference materials like dictionaries/atlases/grammar/video) and data adapted from standard Basque into dialects.
For manual adaptation, the authors created a high-quality parallel gold evaluation dataset by manually adapting the XNLI test split into Western, Central, and Navarrese-Lapurdian dialects.
For automatic adaptation, they evaluate an automatically adapted physical commonsense dataset (BasPhyCowest) with additional native-speaker review to judge whether it can replace fully manual “silver” data creation.

Abstract

Recent research on dialectal NLP has identified data scarcity as a primary limitation. To address this limitation, this paper presents a catalog of contemporary Basque dialectal data and resources, offering a systematic and comprehensive compilation of the dialectal data currently available in Basque. Two types of data sources have been distinguished: online data originally written in some dialect, and standard-to-dialect adapted data. The former includes all dialectal data that can be found online, such as news and radio sites, informal tweets, as well as online resources such as dictionaries, atlases, grammar rules, or videos. The latter consists of data that has been adapted from the standard variety to dialectal varieties, either manually or automatically. Regarding the manual adaptation, the test split of the XNLI Natural Language Inference dataset was manually adapted into three Basque dialects: Western, Central, and Navarrese-Lapurdian, yielding a high-quality parallel gold standard evaluation dataset. With respect to the automatic dialectal adaptation, the automatically adapted physical commonsense dataset (BasPhyCowest) underwent additional manual evaluation by native speakers to assess its quality and determine whether it could serve as a viable substitute for full manual adaptation (i.e., silver data creation).

GDPR and AI Training Data: What You Need to Know Before Training on Personal Data

Dev.to

Edge-to-Cloud Swarm Coordination for heritage language revitalization programs with embodied agent feedback loops

Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Sector HQ Daily AI Intelligence - March 27, 2026

Dev.to

Data Sovereignty Rules and Enterprise AI

Dev.to

A Catalog of Basque Dialectal Resources: Online Collections and Standard-to-Dialectal Adaptations

Key Points

Abstract

Related Articles

GDPR and AI Training Data: What You Need to Know Before Training on Personal Data

Edge-to-Cloud Swarm Coordination for heritage language revitalization programs with embodied agent feedback loops

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Sector HQ Daily AI Intelligence - March 27, 2026

Data Sovereignty Rules and Enterprise AI

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer