SiPaKosa: A Comprehensive Corpus of Canonical and Classical Buddhist Texts in Sinhala and Pali
arXiv cs.CL / 4/1/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- SiPaKosa is a new bilingual (Sinhala and Pali) Buddhist text corpus with about 786K sentences and 9.25M words, combining 16 copyright-cleared historical documents with complete Tripitaka canonical texts web-scraped from repositories.
- The corpus creation pipeline used high-quality OCR via Google Document AI for historical manuscripts, plus systematic web scraping for canonical material, followed by quality control and rich metadata annotation.
- Data is organized into language-specific subcorpora, including Sinhala and Mixed Sinhala-Pali, enabling targeted research across linguistic variants.
- The authors evaluate ten pretrained language models on the corpus and find perplexity ranges from 1.09 to 189.67, with proprietary models outperforming open-source models by roughly 3–6×.
- The dataset is positioned to support domain-adapted language model pretraining, historical/linguistic analysis, and information-retrieval systems for Buddhist scholarship.
- The work is explicitly announced as a new arXiv release (arXiv:2603.29221v1), increasing availability of high-quality, culturally focused training and evaluation data.
Related Articles

Black Hat Asia
AI Business

Show HN: 1-Bit Bonsai, the First Commercially Viable 1-Bit LLMs
Dev.to

I Built an AI Agent That Can Write Its Own Tools When It Gets Stuck
Dev.to

How to Create AI Videos in 20 Minutes (3 Free Tools, Zero Experience)
Dev.to

Agent Self-Discovery: How AI Agents Find Their Own Wallets
Dev.to