SiPaKosa: A Comprehensive Corpus of Canonical and Classical Buddhist Texts in Sinhala and Pali

arXiv cs.CL / 4/1/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

SiPaKosa is a new bilingual (Sinhala and Pali) Buddhist text corpus with about 786K sentences and 9.25M words, combining 16 copyright-cleared historical documents with complete Tripitaka canonical texts web-scraped from repositories.
The corpus creation pipeline used high-quality OCR via Google Document AI for historical manuscripts, plus systematic web scraping for canonical material, followed by quality control and rich metadata annotation.
Data is organized into language-specific subcorpora, including Sinhala and Mixed Sinhala-Pali, enabling targeted research across linguistic variants.
The authors evaluate ten pretrained language models on the corpus and find perplexity ranges from 1.09 to 189.67, with proprietary models outperforming open-source models by roughly 3–6×.
The dataset is positioned to support domain-adapted language model pretraining, historical/linguistic analysis, and information-retrieval systems for Buddhist scholarship.
The work is explicitly announced as a new arXiv release (arXiv:2603.29221v1), increasing availability of high-quality, culturally focused training and evaluation data.

Abstract

SiPaKosa is a comprehensive corpus of Sinhala and Pali doctrinal texts comprising approximately 786K sentences and 9.25M words, incorporating 16 copyright-cleared historical Buddhist documents alongside the complete web-scraped Tripitaka canonical texts. The corpus was created through high-quality OCR using Google Document AI on historical manuscripts, combined with systematic web scraping of canonical repositories, followed by rigorous quality control and metadata annotation. The corpus is organised into language-specific subcorpora: Sinhala and Mixed Sinhala-Pali. We evaluate the performance of language models using ten pretrained models, with perplexity scores ranging from 1.09 to 189.67 on our corpus. This analysis shows that proprietary models significantly outperform open-source alternatives by factors of three to six times. This corpus supports the pretraining of domain-adapted language models, facilitates historical language analysis, and aids in the development of information retrieval systems for Buddhist scholarship while preserving Sinhala cultural heritage.