Beyond Semantic Similarity: A Component-Wise Evaluation Framework for Medical Question Answering Systems with Health Equity Implications

arXiv cs.CL / 4/22/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that existing evaluation metrics for medical question answering with LLMs over-rely on semantic similarity, which can miss true medical accuracy and health equity risks.
It introduces VB-Score (Verification-Based Score), a framework that separately evaluates entity recognition, semantic similarity, factual consistency, and structured information completeness.
The authors rigorously test three widely used LLMs on 48 public-health topics from authoritative sources and find a major mismatch between semantic and entity-level accuracy.
They report broadly severe failure patterns across all three models under the VB-Score criteria, including condition-based algorithmic discrimination: models perform about 13.8% worse on chronic-condition topics tied to older and minority populations.
The results suggest that prompt engineering cannot fix core limitations in medical entity extraction and raise concerns that semantic-only evaluation is insufficient for medical AI safety and equity.

Abstract

The use of Large Language Models (LLMs) to support patients in addressing medical questions is becoming increasingly prevalent. However, most of the measures currently used to evaluate the performance of these models in this context only measure how closely a model's answers match semantically, and therefore do not provide a true indication of the model's medical accuracy or of the health equity risks associated with it. To address these shortcomings, we present a new evaluation framework for medical question answering called VB-Score (Verification-Based Score) that provides a separate evaluation of the four components of entity recognition, semantic similarity, factual consistency, and structured information completeness for medical question-answering models. We perform rigorous reviews of the performance of three well-known and widely used LLMs on 48 public health-related topics taken from high-quality, authoritative information sources. Based on our analyses, we discover a major discrepancy between the models' semantic and entity accuracy. Our assessments of the performance of all three models show that each of them has almost uniformly severe performance failures when evaluated against our criteria. Our findings indicate alarming performance disparities across various public health topics, with most of the models exhibiting 13.8% lower performance (compared to an overall average) for all the public health topics that relate to chronic conditions that occur in older and minority populations, which indicates the existence of what's known as condition-based algorithmic discrimination. Our findings also demonstrate that prompt engineering alone does not compensate for basic architectural limitations on how these models perform in extracting medical entities and raise the question of whether semantic evaluation alone is a sufficient measure of medical AI safety.