Multilingual TinyStories: A Synthetic Combinatorial Corpus of Indic Children's Stories for Training Small Language Models
arXiv cs.CL / 3/17/2026
📰 NewsTools & Practical UsageModels & Research
Key Points
- The authors introduce Multilingual TinyStories, a synthetic corpus of children's stories in 17 Indian languages designed to train small language models.
- A hybrid curation pipeline combines the Sarvam-M language model with combinatorial prompt engineering and Google Translate API for broad cross-lingual expansion.
- The release comprises 132,942 stories totaling over 93.9 million tokens, strictly localized to native scripts.
- The dataset aims to address data scarcity in low-resource Indic languages by supporting multilingual modeling and transfer learning for SLMs.
- This resource serves as a foundational dataset for researchers and developers working on multilingual NLP in the Indic linguistic sphere.
Related Articles

ベテランの若手育成負担を減らせ、PLC制御の「ラダー図」をAIで生成
日経XTECH

Hey dev.to community – sharing my journey with Prompt Builder, Insta Posts, and practical SEO
Dev.to

Why Regex is Not Enough: Building a Deterministic "Sudo" Layer for AI Agents
Dev.to

Perplexity Hub
Dev.to

How to Build Passive Income with AI in 2026: A Developer's Practical Guide
Dev.to