Samas\=amayik: A Parallel Dataset for Hindi-Sanskrit Machine Translation

arXiv cs.CL / 3/26/2026

📰 NewsSignals & Early TrendsModels & Research

共有:

Key Points

The paper introduces Samasamayik, a new large-scale parallel dataset containing 92,196 Hindi–Sanskrit sentence pairs curated for machine translation research.
Unlike many existing Sanskrit resources that emphasize classical poetry or historical texts, the dataset compiles contemporary and diverse materials such as spoken tutorials, children’s magazines, radio conversations, and instructional content.
The authors evaluate the dataset’s usefulness by fine-tuning three translation models—ByT5, NLLB, and IndicTrans-v2—and show clear gains on in-domain test data.
They report that models trained with Samasamayik achieve comparable performance on other standard test sets, positioning the dataset as a strong new baseline for Hindi–Sanskrit MT.
A comparison with existing corpora indicates low semantic and lexical overlap, suggesting the dataset is novel and non-redundant for low-resource Indic language translation.

Abstract

We release Samas\=amayik, a novel, meticulously curated, large-scale Hindi-Sanskrit corpus, comprising 92,196 parallel sentences. Unlike most data available in Sanskrit, which focuses on classical era text and poetry, this corpus aggregates data from diverse sources covering contemporary materials, including spoken tutorials, children's magazines, radio conversations, and instruction materials. We benchmark this new dataset by fine-tuning three complementary models - ByT5, NLLB and IndicTrans-v2, to demonstrate its utility. Our experiments demonstrate that models trained on the Samasamayik corpus achieve significant performance gains on in-domain test data, while achieving comparable performance on other widely used test sets, establishing a strong new performance baseline for contemporary Hindi-Sanskrit translation. Furthermore, a comparative analysis against existing corpora reveals minimal semantic and lexical overlap, confirming the novelty and non-redundancy of our dataset as a robust new resource for low-resource Indic language MT.

Speaking of VoxtralResearchVoxtral TTS: A frontier, open-weights text-to-speech model that’s fast, instantly adaptable, and produces lifelike speech for voice agents.

Mistral AI Blog

Why I Switched from Cloud AI to a Dedicated AI Box (And Why You Should Too)

Dev.to

How to Use MiMo V2 API for Free in 2026: Complete Guide

Dev.to

From Chaos to Compliance: AI Automation for the Mobile Kitchen

Dev.to

MCP in AI Explained (with a Real Example)

Dev.to

Samas\=amayik: A Parallel Dataset for Hindi-Sanskrit Machine Translation

Key Points

Abstract

Related Articles

Speaking of VoxtralResearchVoxtral TTS: A frontier, open-weights text-to-speech model that’s fast, instantly adaptable, and produces lifelike speech for voice agents.

Why I Switched from Cloud AI to a Dedicated AI Box (And Why You Should Too)

How to Use MiMo V2 API for Free in 2026: Complete Guide

From Chaos to Compliance: AI Automation for the Mobile Kitchen

MCP in AI Explained (with a Real Example)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer