Evaluating Large Language Models' Responses to Sexual and Reproductive Health Queries in Nepali

arXiv cs.CL / 3/25/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that while LLMs are increasingly used for anonymous Sexual and Reproductive Health (SRH) questions, existing evaluations often miss usability and safety criteria—especially in low-resource languages like Nepali.
  • It introduces the LLM Evaluation Framework (LEAF), which scores model responses on accuracy as well as usability gaps (relevance, adequacy, cultural appropriateness) and safety gaps (safety, sensitivity, confidentiality).
  • Using LEAF, the researchers manually annotated expert evaluations of responses to 14K Nepali SRH queries from 9K+ users, finding that only 35.1% of responses were “proper” (accurate, adequate, and without major usability/safety gaps).
  • The results show that different ChatGPT versions can have similar accuracy but still vary meaningfully in usability and safety performance.
  • The authors position LEAF as a reusable, cross-domain evaluation approach for sensitive, culturally dependent topics where usability and safety are critical.

Abstract

As Large Language Models (LLMs) become integrated into daily life, they are increasingly used for personal queries, including Sexual and Reproductive Health (SRH), allowing users to chat anonymously without fear of judgment. However, current evaluation methods primarily focus on accuracy, often for objective queries in high-resource languages, and lack criteria to assess usability and safety, especially for low-resource languages and culturally sensitive domains like SRH. This paper introduces LLM Evaluation Framework (LEAF), that conducts assessments across multiple criteria: accuracy, language, usability gaps (including relevance, adequacy, and cultural appropriateness), and safety gaps (safety, sensitivity, and confidentiality). Using the LEAF framework, we assessed 14K SRH queries in Nepali from over 9K users. Responses were manually annotated by SRH experts according to the framework. Results revealed that only 35.1% of the responses were "proper", meaning they were accurate, adequate and had no major usability or safety related gaps. Insights include differences in performance between ChatGPT versions, such as similar accuracy but varying usability and safety aspects. This evaluation highlights significant limitations of current LLMs and underscores the need for improvement. The LEAF Framework is adaptable across domains and languages, particularly where usability and safety are critical, offering a pathway to better address sensitive topics.