Developing and evaluating a chatbot to support maternal health care

arXiv cs.AI / 3/16/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

The paper introduces a chatbot for maternal health in India that combines stage-aware triage, hybrid retrieval over guidelines, and evidence-conditioned generation from an LLM to handle short, code-mixed multilingual queries.
It provides an evaluation workflow for high-stakes deployment, including a labeled triage benchmark (N=150) with emergency recall metrics, a synthetic multi-evidence retrieval benchmark (N=100) with evidence labels, an LLM-as-judge comparison on real queries (N=781), and expert validation.
Findings indicate that trustworthy medical assistants in multilingual, noisy settings require defense-in-depth design and multi-method evaluation rather than reliance on a single model or metric.
The work reflects a multi-stakeholder collaboration among academia, a health tech company, a public health nonprofit, and a hospital, highlighting real-world deployment considerations in low-resource settings.

Abstract

The ability to provide trustworthy maternal health information using phone-based chatbots can have a significant impact, particularly in low-resource settings where users have low health literacy and limited access to care. However, deploying such systems is technically challenging: user queries are short, underspecified, and code-mixed across languages, answers require regional context-specific grounding, and partial or missing symptom context makes safe routing decisions difficult. We present a chatbot for maternal health in India developed through a partnership between academic researchers, a health tech company, a public health nonprofit, and a hospital. The system combines (1) stage-aware triage, routing high-risk queries to expert templates, (2) hybrid retrieval over curated maternal/newborn guidelines, and (3) evidence-conditioned generation from an LLM. Our core contribution is an evaluation workflow for high-stakes deployment under limited expert supervision. Targeting both component-level and end-to-end testing, we introduce: (i) a labeled triage benchmark (N=150) achieving 86.7% emergency recall, explicitly reporting the missed-emergency vs. over-escalation trade-off; (ii) a synthetic multi-evidence retrieval benchmark (N=100) with chunk-level evidence labels; (iii) LLM-as-judge comparison on real queries (N=781) using clinician-codesigned criteria; and (iv) expert validation. Our findings show that trustworthy medical assistants in multilingual, noisy settings require defense-in-depth design paired with multi-method evaluation, rather than any single model and evaluation method choice.

State of MCP Security 2026: We Scanned 15,923 AI Tools. Here's What We Found.

Dev.to

I Built a Zombie Process Killer Because Claude Code Ate 14GB of My RAM

Dev.to

Data Augmentation Using GANs

Dev.to

Building Safety Guardrails for LLM Customer Service That Actually Work in Production

Dev.to

The New AI Agent Primitive: Why Policy Needs Its Own Language (And Why YAML and Rego Fall Short)

Dev.to

Developing and evaluating a chatbot to support maternal health care

Key Points

Abstract

Related Articles

State of MCP Security 2026: We Scanned 15,923 AI Tools. Here's What We Found.

I Built a Zombie Process Killer Because Claude Code Ate 14GB of My RAM

Data Augmentation Using GANs

Building Safety Guardrails for LLM Customer Service That Actually Work in Production

The New AI Agent Primitive: Why Policy Needs Its Own Language (And Why YAML and Rego Fall Short)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer