The Russian Legislative Corpus

arXiv cs.CL / 4/29/2026

💬 OpinionDeveloper Stack & InfrastructureModels & Research

Key Points

  • The article introduces a large, comprehensive corpus of Russian legislation covering 1991 to 2025, totaling 304,382 legal texts and about 194.4 million tokens.
  • It provides two dataset versions: a basic release with simple metadata and a detailed release that includes original texts plus Universal Dependencies CoNLL-U conversions.
  • The detailed version enriches the data with linguistic annotations such as parts of speech, morphological features, and syntactic dependency relations.
  • The corpus is positioned as a resource for working with Russian legal language in downstream research and development tasks requiring structured, annotated text.

Abstract

We present a comprehensive corpus of Russian primary and secondary legislation adopted between 1991 and 2025, comprising 304,382 texts (194,425,905 tokens). The corpus is available in two versions: the basic version contains texts with simple metadata, while the detailed version includes both the original texts and their equivalents converted to the Universal Dependencies CoNLL-U format, annotated with parts of speech, morphological features, and syntactic dependencies.