When Metrics Disagree: Automatic Similarity vs. LLM-as-a-Judge for Clinical Dialogue Evaluation
arXiv cs.CL / 4/1/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper reports a LoRA domain-adaptation of Llama-2-7B using real patient–physician transcripts to improve clinical dialogue responses while retaining the base model’s knowledge.
- Model evaluation is done in two tracks: traditional lexical overlap metrics (BLEU/ROUGE) and an “LLM-as-a-Judge” approach where GPT-4 scores semantic quality.
- Results show the LoRA model improves substantially on lexical metrics, but GPT-4 judge scores exhibit a notable disagreement that only slightly favors the baseline’s conversational flow.
- The authors conclude that automatic metrics—whether lexical measures or LLM-based judges—may not reliably reflect clinical utility, highlighting the need for thorough human medical expert validation.
- The study frames metric disagreement as a safety-critical issue for healthcare LLM deployment and positions expert review as an indispensable final step.
Related Articles

Show HN: 1-Bit Bonsai, the First Commercially Viable 1-Bit LLMs
Dev.to

I Built an AI Agent That Can Write Its Own Tools When It Gets Stuck
Dev.to

Agent Self-Discovery: How AI Agents Find Their Own Wallets
Dev.to
[P] Federated Adversarial Learning
Reddit r/MachineLearning

The Inversion Error: Why Safe AGI Requires an Enactive Floor and State-Space Reversibility
Towards Data Science