IWLV-Ramayana: A Sarga-Aligned Parallel Corpus of Valmiki's Ramayana Across Indian Languages

arXiv cs.CL / 4/16/2026

📰 NewsIdeas & Deep AnalysisTools & Practical UsageModels & Research

共有:

Key Points

The paper introduces the IWLV Ramayana Corpus, a sarga (chapter)-aligned parallel dataset of Valmiki’s Ramayana across multiple Indian languages.
It currently offers complete English and Malayalam layers, with Hindi, Tamil, Kannada, and Telugu layers actively being produced.
The corpus is released in structured JSONL and includes explicit provenance metadata to support traceability and scholarly reuse.
The authors position the dataset for comparative literature, corpus linguistics, digital humanities, and multilingual NLP applications.
They claim it is the first sarga-aligned multilingual parallel corpus for the Valmiki Ramayana with machine-readable format and provenance metadata.

Abstract

The Ramayana is among the most influential literary traditions of South and Southeast Asia, transmitted across numerous linguistic and cultural contexts over two millennia. Despite extensive scholarship on regional Ramayana traditions, computational resources enabling systematic cross-linguistic analysis remain limited. This paper introduces the IWLV Ramayana Corpus, a structured parallel corpus aligning Valmiki's Ramayana across multiple Indian languages at the level of the sarga (chapter). The corpus currently includes complete English and Malayalam layers, with Hindi, Tamil, Kannada, and Telugu layers in active production. The dataset is distributed in structured JSONL format with explicit provenance metadata, enabling applications in comparative literature, corpus linguistics, digital humanities, and multilingual natural language processing. To our knowledge, this is the first sarga-aligned multilingual parallel corpus of the Valmiki Ramayana with explicit provenance metadata and machine-readable format.