Leveraging Computerized Adaptive Testing for Cost-effective Evaluation of Large Language Models in Medical Benchmarking
arXiv cs.CL / 2026/3/26
💬 オピニオンIdeas & Deep AnalysisTools & Practical UsageModels & Research
要点
- The paper proposes a computerized adaptive testing (CAT) framework using item response theory (IRT) to evaluate large language models on standardized medical knowledge more efficiently than static benchmarks.
- A two-phase methodology combines Monte Carlo simulation to tune CAT settings with an empirical study of 38 LLMs assessed via both a full item bank and an adaptive test that stops when reliability reaches a predefined threshold (standard error ≤ 0.3).
- CAT-based proficiency estimates closely match full-bank results, showing near-perfect correlation (r = 0.988) while requiring only about 1.3% of the items.
- The approach substantially reduces evaluation time (hours to minutes per model), token usage, and computational cost, while maintaining inter-model performance rankings.
- The authors position the method as a psychometrically grounded, low-cost benchmarking and monitoring tool rather than a replacement for real-world clinical validation or safety-focused prospective studies.



