SiDiaC-v.2.0: Sinhala Diachronic Corpus Version 2.0
arXiv cs.CL / 3/12/2026
📰 NewsModels & Research
Key Points
- SiDiaC-v.2.0 is the largest Sinhala diachronic corpus to date, covering 1800–1955 publication dates and 5th–20th century written dates.
- It contains 244k words across 185 literary works with thorough filtering, preprocessing, and copyright compliance checks, and a subset of 59 documents totaling 70k words annotated by their written dates.
- Texts were digitised using Google Document AI OCR and post-processed to fix formatting, address code-mixing, include special tokens, and repair malformed tokens, with syntactic annotation and text normalisation strategies informed by FarPaHC, SiDiaC-v.1.0, and CCOHA.
- The corpus uses two-layer genre categorization (primary: Non-Fiction vs Fiction; secondary: Religious, History, Poetry, Language, and Medical) to support Sinhala NLP and build on prior work despite limited resources.
Related Articles

PearlOS. We gave swarm intelligence a local desktop environment and code control to self-evolve. Has been pretty incredible to see so far. Open source and free if you want your own.
Reddit r/LocalLLaMA
QwenDean-4B | fine-tuned SLM for UIGen; our first attempt, looking for feedback!
Reddit r/LocalLLaMA
acestep.cpp: portable C++17 implementation of ACE-Step 1.5 music generation using GGML. Runs on CPU, CUDA, ROCm, Metal, Vulkan
Reddit r/LocalLLaMA

**Introducing SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding**
Hugging Face Blog

Newest GPU server in the lab! 72gb ampere vram!
Reddit r/LocalLLaMA