L-ReLF: A Framework for Lexical Dataset Creation

arXiv cs.CL / 4/1/2026

📰 NewsIdeas & Deep AnalysisTools & Practical UsageModels & Research

共有:

Key Points

The paper introduces L-ReLF, a reproducible framework for creating high-quality, structured lexical datasets for underserved languages with standardized terminology.
It targets key low-resource challenges by detailing steps for source identification, applying OCR (noting its bias toward Modern Standard Arabic), and performing rigorous post-processing for error correction and data-model standardization.
The output dataset is designed to be fully compatible with Wikidata Lexemes, enabling consistent lexical data integration for collaborative knowledge platforms.
The methodology is presented as generalizable so other language communities can follow the same pipeline to generate foundational datasets for downstream NLP tasks like machine translation and morphological analysis.

Abstract

This paper introduces the L-ReLF (Low-Resource Lexical Framework), a novel, reproducible methodology for creating high-quality, structured lexical datasets for underserved languages. The lack of standardized terminology, exemplified by Moroccan Darija, poses a critical barrier to knowledge equity in platforms like Wikipedia, often forcing editors to rely on inconsistent, ad-hoc methods to create new words in their language. Our research details the technical pipeline developed to overcome these challenges. We systematically address the difficulties of working with low-resource data, including source identification, utilizing Optical Character Recognition (OCR) despite its bias towards Modern Standard Arabic, and rigorous post-processing to correct errors and standardize the data model. The resulting structured dataset is fully compatible with Wikidata Lexemes, serving as a vital technical resource. The L-ReLF methodology is designed for generalizability, offering other language communities a clear path to build foundational lexical data for downstream NLP applications, such as Machine Translation and morphological analysis.