Samas\=amayik: A Parallel Dataset for Hindi-Sanskrit Machine Translation

arXiv cs.CL / 3/26/2026

📰 NewsSignals & Early TrendsModels & Research

Key Points

  • The paper introduces Samasamayik, a new large-scale parallel dataset containing 92,196 Hindi–Sanskrit sentence pairs curated for machine translation research.
  • Unlike many existing Sanskrit resources that emphasize classical poetry or historical texts, the dataset compiles contemporary and diverse materials such as spoken tutorials, children’s magazines, radio conversations, and instructional content.
  • The authors evaluate the dataset’s usefulness by fine-tuning three translation models—ByT5, NLLB, and IndicTrans-v2—and show clear gains on in-domain test data.
  • They report that models trained with Samasamayik achieve comparable performance on other standard test sets, positioning the dataset as a strong new baseline for Hindi–Sanskrit MT.
  • A comparison with existing corpora indicates low semantic and lexical overlap, suggesting the dataset is novel and non-redundant for low-resource Indic language translation.

Abstract

We release Samas\=amayik, a novel, meticulously curated, large-scale Hindi-Sanskrit corpus, comprising 92,196 parallel sentences. Unlike most data available in Sanskrit, which focuses on classical era text and poetry, this corpus aggregates data from diverse sources covering contemporary materials, including spoken tutorials, children's magazines, radio conversations, and instruction materials. We benchmark this new dataset by fine-tuning three complementary models - ByT5, NLLB and IndicTrans-v2, to demonstrate its utility. Our experiments demonstrate that models trained on the Samasamayik corpus achieve significant performance gains on in-domain test data, while achieving comparable performance on other widely used test sets, establishing a strong new performance baseline for contemporary Hindi-Sanskrit translation. Furthermore, a comparative analysis against existing corpora reveals minimal semantic and lexical overlap, confirming the novelty and non-redundancy of our dataset as a robust new resource for low-resource Indic language MT.