CQA-Eval: Designing Reliable Evaluations of Multi-paragraph Clinical QA under Resource Constraints
arXiv cs.CL / 3/27/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper proposes CQA-Eval, an evaluation framework and set of recommendations for reliably assessing multi-paragraph clinical QA systems when resources are limited and expert input is scarce.
- Using physician-annotated examples covering 300 real patient questions answered by clinicians and LLMs, the study compares coarse answer-level evaluation with fine-grained sentence-level evaluation across correctness, relevance, and risk-disclosure dimensions.
- Results show that inter-annotator agreement depends on the evaluation granularity and dimension: fine-grained improves correctness agreement, coarse improves relevance agreement, while risk-disclosure-related judgments remain inconsistent.
- The authors also find that annotating only a small subset of sentences can achieve reliability comparable to coarse annotations, offering a cost-reduction strategy without substantially sacrificing evaluation consistency.
広告
Related Articles

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to

I shipped Google's TurboQuant as a vLLM plugin 72 hours after the paper — here's what nobody else tested
Dev.to

We built a governance layer for AI-assisted development (with runtime validation and real system)
Dev.to
No AI system using the forward inference pass can ever be conscious.
Reddit r/artificial

What I wish I knew before running AI agents 24/7
Dev.to