CQA-Eval: Designing Reliable Evaluations of Multi-paragraph Clinical QA under Resource Constraints

arXiv cs.CL / 3/27/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes CQA-Eval, an evaluation framework and set of recommendations for reliably assessing multi-paragraph clinical QA systems when resources are limited and expert input is scarce.
  • Using physician-annotated examples covering 300 real patient questions answered by clinicians and LLMs, the study compares coarse answer-level evaluation with fine-grained sentence-level evaluation across correctness, relevance, and risk-disclosure dimensions.
  • Results show that inter-annotator agreement depends on the evaluation granularity and dimension: fine-grained improves correctness agreement, coarse improves relevance agreement, while risk-disclosure-related judgments remain inconsistent.
  • The authors also find that annotating only a small subset of sentences can achieve reliability comparable to coarse annotations, offering a cost-reduction strategy without substantially sacrificing evaluation consistency.

Abstract

Evaluating multi-paragraph clinical question answering (QA) systems is resource-intensive and challenging: accurate judgments require medical expertise and achieving consistent human judgments over multi-paragraph text is difficult. We introduce \framework, an evaluation framework and set of evaluation recommendations for limited-resource and high-expertise settings. Based on physician annotations of 300 real patient questions answered by physicians and LLMs, we compare coarse answer-level versus fine-grained sentence-level evaluation over the dimensions of correctness, relevance, and risk disclosure. We find that inter-annotator agreement (IAA) varies by dimension: fine-grained annotation improves agreement on correctness, coarse improves agreement on relevance, and judgments on communicates-risks remain inconsistent. Additionally, annotating only a small subset of sentences can provide reliability comparable to coarse annotations, reducing cost and effort.
広告