Naamah: A Large Scale Synthetic Sanskrit NER Corpus via DBpedia Seeding and LLM Generation

arXiv cs.AI / 4/30/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper addresses the lack of annotated Sanskrit resources for Named Entity Recognition (NER), which hinders digitisation of classical literature.
It introduces Naamah, a large-scale “silver standard” Sanskrit NER corpus containing 102,942 sentences, created via a pipeline that seeds entities from DBpedia and generates additional data with an LLM.
The generation uses a 24B-parameter hybrid reasoning model to produce grammatically natural and syntactically diverse training examples, aiming to improve over error-prone generic LLM augmentation.
The authors benchmark two transformer models—XLM-RoBERTa (multilingual) and IndicBERTv2 (parameter-efficient)—on the newly released dataset.
Overall, the work combines knowledge-base seeding and structured LLM generation to create higher-quality training data for classical-grammar-sensitive NLP tasks.

Abstract

The digitisation of classical Sanskrit literature is impeded by a scarcity of annotated resources, particularly for Named Entity Recognition. While recent methodologies utilise generic Large Language Models (LLMs) for data augmentation, these approaches remain prone to error and often lack the reasoning depth required for classical grammar. In this work, we introduce Naamah, a high quality silver standard Sanskrit NER dataset comprising 102,942 sentences. We propose a methodology that combines entity extraction from DBpedia with the generative capabilities of a 24B parameter hybrid reasoning model to create grammatically natural and synthetically diverse training data. We utilize this dataset to benchmark two transformer architectures: the massive multilingual XLM RoBERTa and the parameter efficient IndicBERTv2.