Evaluating Large Language Models' Responses to Sexual and Reproductive Health Queries in Nepali
arXiv cs.CL / 3/25/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that while LLMs are increasingly used for anonymous Sexual and Reproductive Health (SRH) questions, existing evaluations often miss usability and safety criteria—especially in low-resource languages like Nepali.
- It introduces the LLM Evaluation Framework (LEAF), which scores model responses on accuracy as well as usability gaps (relevance, adequacy, cultural appropriateness) and safety gaps (safety, sensitivity, confidentiality).
- Using LEAF, the researchers manually annotated expert evaluations of responses to 14K Nepali SRH queries from 9K+ users, finding that only 35.1% of responses were “proper” (accurate, adequate, and without major usability/safety gaps).
- The results show that different ChatGPT versions can have similar accuracy but still vary meaningfully in usability and safety performance.
- The authors position LEAF as a reusable, cross-domain evaluation approach for sensitive, culturally dependent topics where usability and safety are critical.
Related Articles
CRM Development That Drives Growth
Dev.to

Karpathy's Autoresearch: Improving Agentic Coding Skills
Dev.to
How to Write AI Prompts That Actually Work
Dev.to
[D] Any other PhD students feel underprepared and that the bar is too low?
Reddit r/MachineLearning
Automating the Perfect Pitch: An AI Framework for Boutique PR
Dev.to