Multilingual TinyStories: A Synthetic Combinatorial Corpus of Indic Children's Stories for Training Small Language Models

arXiv cs.CL / 3/17/2026

📰 NewsTools & Practical UsageModels & Research

共有:

Key Points

The authors introduce Multilingual TinyStories, a synthetic corpus of children's stories in 17 Indian languages designed to train small language models.
A hybrid curation pipeline combines the Sarvam-M language model with combinatorial prompt engineering and Google Translate API for broad cross-lingual expansion.
The release comprises 132,942 stories totaling over 93.9 million tokens, strictly localized to native scripts.
The dataset aims to address data scarcity in low-resource Indic languages by supporting multilingual modeling and transfer learning for SLMs.
This resource serves as a foundational dataset for researchers and developers working on multilingual NLP in the Indic linguistic sphere.

Abstract

The development of robust language models for low-resource languages is frequently bottlenecked by the scarcity of high-quality, coherent, and domain-appropriate training corpora. In this paper, we introduce the Multilingual TinyStories dataset, a large-scale, synthetically generated collection of children's stories encompassing 17 Indian languages. Designed specifically for the training and evaluation of Small Language Models (SLMs), the corpus provides simple, narrative-driven text strictly localized to native scripts. We detail our hybrid curation pipeline, which leverages the Sarvam-M language model and a novel combinatorial prompt engineering framework for native generation, coupled with the Google Translate API for large-scale cross-lingual expansion. Through strict programmatic filtering, we compiled 132,942 stories and over 93.9 million tokens in our release, serving as a foundational resource for multilingual language modeling and transfer learning in the Indic linguistic sphere.