Naamah: A Large Scale Synthetic Sanskrit NER Corpus via DBpedia Seeding and LLM Generation
arXiv cs.AI / 4/30/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper addresses the lack of annotated Sanskrit resources for Named Entity Recognition (NER), which hinders digitisation of classical literature.
- It introduces Naamah, a large-scale “silver standard” Sanskrit NER corpus containing 102,942 sentences, created via a pipeline that seeds entities from DBpedia and generates additional data with an LLM.
- The generation uses a 24B-parameter hybrid reasoning model to produce grammatically natural and syntactically diverse training examples, aiming to improve over error-prone generic LLM augmentation.
- The authors benchmark two transformer models—XLM-RoBERTa (multilingual) and IndicBERTv2 (parameter-efficient)—on the newly released dataset.
- Overall, the work combines knowledge-base seeding and structured LLM generation to create higher-quality training data for classical-grammar-sensitive NLP tasks.
Related Articles

Building a Local AI Agent (Part 2): Six UX and UI Design Challenges
Dev.to

We Built a DNS-Based Discovery Protocol for AI Agents — Here's How It Works
Dev.to

Your first business opportunity in 3 commands: /register_directory in @biznode_bot, wait for matches, then /my_pulse to view...
Dev.to

Building AI Evaluation Pipelines: Automating LLM Testing from Dataset to CI/CD
Dev.to

Function Calling Harness 2: CoT Compliance from 9.91% to 100%
Dev.to