MultiBLiMP 1.0: A Massively Multilingual Benchmark of Linguistic Minimal Pairs

arXiv cs.CL / 5/1/2026

📰 NewsSignals & Early TrendsModels & Research

Key Points

  • MultiBLiMP 1.0 is a multilingual benchmark focused on linguistic minimal pairs, covering 101 languages and two types of subject–verb agreement.
  • The dataset contains over 128,000 automatically generated minimal pairs, built using an end-to-end pipeline grounded in Universal Dependencies and UniMorph resources.
  • The benchmark is designed to assess how well LLMs handle grammatical distinctions across a very large set of languages.
  • The release indicates that current state-of-the-art methods still struggle with modeling low-resource languages, revealing clear limitations.
  • MultiBLiMP 1.0 represents an unusually large scale for multilingual evaluation of language understanding and agreement behavior.

Abstract

We introduce MultiBLiMP 1.0, a massively multilingual benchmark of linguistic minimal pairs, covering 101 languages and 2 types of subject-verb agreement, containing more than 128,000 minimal pairs. Our minimal pairs are created using a fully automated pipeline, leveraging the large-scale linguistic resources of Universal Dependencies and UniMorph. MultiBLiMP 1.0 evaluates abilities of LLMs at an unprecedented multilingual scale, and highlights the shortcomings of the current state-of-the-art in modelling low-resource languages.