How Robust Are Large Language Models for Clinical Numeracy? An Empirical Study on Numerical Reasoning Abilities in Clinical Contexts

arXiv cs.CL / 4/14/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The study addresses a safety gap in deploying LLMs for clinical decision support by testing how robustly models handle patient measurements across heterogeneous clinical note formats rather than only arithmetic accuracy.
It introduces ClinicNumRobBench, a benchmark with 1,624 context-question instances covering four clinical numeracy skills—value retrieval, arithmetic computation, relational comparison, and aggregation—using 42 question templates and three semantically equivalent representations of longitudinal MIMIC-IV vital-sign records.
Experiments across 14 LLMs find that value retrieval is generally strong (most models >85% accuracy), whereas relational comparison and aggregation are much more difficult (some models <15%).
The results show fine-tuning on medical data can substantially worsen numeracy (over a 30% reduction vs. base models) and that performance degrades under note-style representation changes, indicating sensitivity to input format.
The authors provide the benchmark plus code/data for public use, positioning ClinicNumRobBench as a rigorous testbed for clinically reliable numerical reasoning.

Abstract

Large Language Models (LLMs) are increasingly being explored for clinical question answering and decision support, yet safe deployment critically requires reliable handling of patient measurements in heterogeneous clinical notes. Existing evaluations of LLMs for clinical numerical reasoning provide limited operation-level coverage, restricted primarily to arithmetic computation, and rarely assess the robustness of numerical understanding across clinical note formats. We introduce ClinicNumRobBench, a benchmark of 1,624 context-question instances with ground-truth answers that evaluates four main types of clinical numeracy: value retrieval, arithmetic computation, relational comparison, and aggregation. To stress-test robustness, ClinicNumRobBench presents longitudinal MIMIC-IV vital-sign records in three semantically equivalent representations, including a real-world note-style variant derived from the Open Patients dataset, and instantiates queries using 42 question templates. Experiments on 14 LLMs show that value retrieval is generally strong, with most models exceeding 85% accuracy, while relational comparison and aggregation remain challenging, with some models scoring below 15%. Fine-tuning on medical data can reduce numeracy relative to base models by over 30%, and performance drops under note-style variation indicate LLM sensitivity to format. ClinicNumRobBench offers a rigorous testbed for clinically reliable numerical reasoning. Code and data URL are available on https://github.com/MinhVuong2000/ClinicNumRobBench.