Creating and Evaluating Figurative Language Dataset for Sindhi
arXiv cs.CL / 5/5/2026
📰 NewsDeveloper Stack & InfrastructureModels & Research
Key Points
- The paper introduces SiNFluD, a new benchmark dataset specifically designed for Sindhi figurative language classification.
- The dataset is built by collecting raw Sindhi text from blogs, social media, and literary sources, then preparing it for human annotation.
- Two native annotators label the data using Doccano, reaching an inter-annotator agreement of 0.81.
- Baseline experiments are reported using 5-fold and 10-fold cross-validation, and the study evaluates mBERT, XLM-RoBERTa, XLM-RoBERTa-XL, and SetFit for few-shot fine-tuning.
- The results show that the pretrained XLM-RoBERTa-XL model delivers the best overall performance on the benchmark.
Related Articles

Why Retail Chargeback Recovery Could Be AgentHansa's First Real PMF
Dev.to
Struggling with Qwen3.6 27B / 35B locally (3090) slow responses, breaking code looking for better setup + auto model switching
Reddit r/LocalLLaMA

Last Week in AI #340 - OpenAI vs Musk + Microsoft, DeepSeek v4, Vision Banana
Last Week in AI

Trying to train tiny LLMs on length constrained reddit posts summarization task using GRPO on 3xMac Minis - updates!
Reddit r/LocalLLaMA

Uber Shares What Happens When 1.500 AI Agents Hit Production
Reddit r/artificial