LLM Probe: Evaluating LLMs for Low-Resource Languages

arXiv cs.CL / 4/1/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • The paper introduces LLM Probe, a lexicon-based framework for evaluating LLM capabilities in low-resource, morphologically rich languages using standardized linguistic probes.
  • It assesses models across four task areas: lexical alignment, part-of-speech recognition, morphosyntactic probing, and translation accuracy.
  • The authors create and release a manually annotated bilingual benchmark dataset for a low-resource Semitic language, including POS, grammatical gender, and morphosyntactic feature annotations with high inter-annotator agreement.
  • Experimental results across causal and sequence-to-sequence models show tradeoffs: sequence-to-sequence models tend to perform better on morphosyntax and translation, while causal models are stronger on lexical alignment but weaker on translation.
  • The work argues that linguistically grounded evaluation is necessary to understand LLM limitations in under-resourced settings and includes open-source release of the framework and dataset for reproducible benchmarking.

Abstract

Despite rapid advances in large language models (LLMs), their linguistic abilities in low-resource and morphologically rich languages are still not well understood due to limited annotated resources and the absence of standardized evaluation frameworks. This paper presents LLM Probe, a lexicon-based assessment framework designed to systematically evaluate the linguistic skills of LLMs in low-resource language environments. The framework analyzes models across four areas of language understanding: lexical alignment, part-of-speech recognition, morphosyntactic probing, and translation accuracy. To illustrate the framework, we create a manually annotated benchmark dataset using a low-resource Semitic language as a case study. The dataset comprises bilingual lexicons with linguistic annotations, including part-of-speech tags, grammatical gender, and morphosyntactic features, which demonstrate high inter-annotator agreement to ensure reliable annotations. We test a variety of models, including causal language models and sequence-to-sequence architectures. The results reveal notable differences in performance across various linguistic tasks: sequence-to-sequence models generally excel in morphosyntactic analysis and translation quality, whereas causal models demonstrate strong performance in lexical alignment but exhibit weaker translation accuracy. Our results emphasize the need for linguistically grounded evaluation to better understand LLM limitations in low-resource settings. We release LLM Probe and the accompanying benchmark dataset as open-source tools to promote reproducible benchmarking and to support the development of more inclusive multilingual language technologies.