Evaluating Large Language Models on Historical Health Crisis Knowledge in Resource-Limited Settings: A Hybrid Multi-Metric Study

arXiv cs.CL / 3/24/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper evaluates several major LLMs (GPT-4, Gemini Pro, Llama 3, and Mistral-7B) on health-crisis question answering for COVID-19, dengue, Nipah, and Chikungunya in Bangladesh’s low-resource setting.
  • The authors build a QA dataset sourced from authoritative materials and assess outputs using multiple evaluation methods including semantic similarity, expert-model cross-evaluation, and Natural Language Inference (NLI).
  • Results show that LLMs can capture some epidemiological history and health-crisis knowledge, but they also exhibit notable reliability limitations.
  • The study concludes that while LLMs have promise for informing policy in resource-constrained environments, their risks must be carefully managed given variable performance.

Abstract

Large Language Models (LLMs) offer significant potential for delivering health information. However, their reliability in low-resource contexts remains uncertain. This study evaluates GPT-4, Gemini Pro, Llama~3, and Mistral-7B on health crisis-related enquiries concerning COVID-19, dengue, the Nipah virus, and Chikungunya in the low-resource context of Bangladesh. We constructed a question--answer dataset from authoritative sources and assessed model outputs through semantic similarity, expert-model cross-evaluation, and Natural Language Inference (NLI). Findings highlight both the strengths and limitations of LLMs in representing epidemiological history and health crisis knowledge, underscoring their promise and risks for informing policy in resource-constrained environments.