Building Benchmarks from the Ground Up: Community-Centered Evaluation of LLMs in Healthcare Chatbot Settings

arXiv cs.CL / 3/16/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that current LLM benchmarks in healthcare often miss real-world user contexts and cultural practices, underscoring the need for contextually grounded evaluation.
It introduces Samiksha, a community-driven evaluation pipeline co-created with civil-society organizations and community members to guide what to evaluate, how to build benchmarks, and how outputs are scored.
The approach enables scalable benchmarking through automation while incorporating cultural awareness and community feedback into the evaluation process.
The authors demonstrate the pipeline in India's health domain, showing how multilingual LLMs handle nuanced community health queries and outlining a scalable path for inclusive LLM evaluation.

Abstract

Large Language Models (LLMs) are typically evaluated through general or domain-specific benchmarks testing capabilities that often lack grounding in the lived realities of end users. Critical domains such as healthcare require evaluations that extend beyond artificial or simulated tasks to reflect the everyday needs, cultural practices, and nuanced contexts of communities. We propose Samiksha, a community-driven evaluation pipeline co-created with civil-society organizations (CSOs) and community members. Our approach enables scalable, automated benchmarking through a culturally aware, community-driven pipeline in which community feedback informs what to evaluate, how the benchmark is built, and how outputs are scored. We demonstrate this approach in the health domain in India. Our analysis highlights how current multilingual LLMs address nuanced community health queries, while also offering a scalable pathway for contextually grounded and inclusive LLM evaluation.

AI's Economic Impact Falls Short: Addressing the Gap Between Investment and Measurable Growth

Dev.to

The Inception Loop: A Month in the Life of a Self-Improving AI Sidekick

Dev.to

The Editing Tax: Why AI 'Saves Time' Until It Doesn't — And How to Reduce Rework

Dev.to

AI Can Write Your Code. Who's Testing Your Thinking?

Dev.to

[R] Weekly digest: arXiv AI security papers translated for practitioners -- Cascade (cross-stack CVE+Rowhammer attacks on compound AI), LAMLAD (dual-LLM adversarial ML, 97% evasion), OpenClaw (4 vuln classes in agent frameworks)

Reddit r/MachineLearning

Building Benchmarks from the Ground Up: Community-Centered Evaluation of LLMs in Healthcare Chatbot Settings

Key Points

Abstract

Related Articles

AI's Economic Impact Falls Short: Addressing the Gap Between Investment and Measurable Growth

The Inception Loop: A Month in the Life of a Self-Improving AI Sidekick

The Editing Tax: Why AI 'Saves Time' Until It Doesn't — And How to Reduce Rework

AI Can Write Your Code. Who's Testing Your Thinking?

[R] Weekly digest: arXiv AI security papers translated for practitioners -- Cascade (cross-stack CVE+Rowhammer attacks on compound AI), LAMLAD (dual-LLM adversarial ML, 97% evasion), OpenClaw (4 vuln classes in agent frameworks)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer