Mining Large Language Models for Low-Resource Language Data: Comparing Elicitation Strategies for Hausa and Fongbe

arXiv cs.CL / 4/15/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper tests whether strategic prompting can elicit usable text-data from commercial LLMs for low-resource languages, focusing on Hausa and Fongbe.
  • It compares six elicitation task types across GPT-4o Mini and Gemini 2.5 Flash, finding that GPT-4o Mini can extract 6–41x more usable target-language words per API call.
  • The study shows that “best” prompting strategies are language-dependent: Hausa performs better with functional text/dialogue elicitation, while Fongbe needs more constrained-generation prompts.
  • The authors publish the generated corpora and code, enabling other researchers and developers to reproduce and extend the elicitation approach.

Abstract

Large language models (LLMs) are trained on data contributed by low-resource language communities, yet the linguistic knowledge encoded in these models remains accessible only through commercial APIs. This paper investigates whether strategic prompting can extract usable text data from LLMs for two West African languages: Hausa (Afroasiatic, approximately 80 million speakers) and Fongbe (Niger-Congo, approximately 2 million speakers). We systematically compare six elicitation task types across two commercial LLMs (GPT-4o Mini and Gemini 2.5 Flash). GPT-4o Mini extracts 6-41 times more usable target-language words per API call than Gemini. Optimal strategies differ by language: Hausa benefits from functional text and dialogue, while Fongbe requires constrained generation prompts. We release all generated corpora and code.