Script Gap: Evaluating LLM Triage on Indian Languages in Native vs Romanized Scripts in a Real World Setting

arXiv cs.CL / 4/1/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper evaluates how romanized (Latin-script) versus native-script Indian-language inputs affect the reliability of leading LLMs in maternal and newborn healthcare triage.
Benchmarks on a real, user-generated health-query dataset across five Indian languages and Nepali find consistent performance degradation for romanized messages, with gaps up to 24 points across languages and models.
The authors propose an uncertainty-based selective routing approach to mitigate the “script gap,” improving handling of low-confidence romanized queries.
The study estimates that the observed degradation could translate into nearly 2 million excess triage errors at their partner maternal health organization alone, underscoring safety risks.
Overall, the findings reveal a safety blind spot where LLMs may seem to understand romanized text but still fail to triage reliably in high-stakes clinical settings.

Abstract

Large Language Models (LLMs) are increasingly deployed in high-stakes clinical applications in India. Speakers of Indian languages frequently communicate using romanized text rather than native scripts, yet existing research rarely quantifies or evaluates this orthographic variation in real world applications. We investigate how romanization impacts the reliability of LLMs in a critical domain: maternal and newborn healthcare triage. We benchmark leading LLMs on a real world dataset of user-generated health queries spanning five Indian languages and Nepali. Our results reveal consistent degradation in performance for romanized messages, with gap reaching up to 24 points across languages and models. We propose and evaluate an Uncertainty-based Selective Routing method to close this script gap. At our partner maternal health organization alone, this gap could cause nearly 2 million excess errors in triage. Our findings highlight a critical safety blind spot in LLM-based health systems: models that appear to understand romanized input may still fail to act on it reliably.