Medmarks: A Comprehensive Open-Source LLM Benchmark Suite for Medical Tasks
arXiv cs.CL / 5/5/2026
📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research
Key Points
- Medmarks is a fully open-source LLM benchmark suite for medical tasks that addresses issues like benchmark saturation, restricted data access, and incomplete task coverage by providing 30 benchmarks across multiple medical capabilities.
- The authors systematically evaluate 61 models over 71 configurations using verifiable metrics and LLM-as-a-Judge, including tasks such as question answering, information extraction, medical calculations, and open-ended clinical reasoning.
- Results indicate that frontier reasoning models (Gemini 3 Pro Preview, GPT-5.1, and GPT-5.2) achieve the best overall performance, while medically fine-tuned models outperform generalist models.
- The study finds that many frontier proprietary models are more token efficient than open-weight alternatives, and it documents notable answer-order bias effects, especially for smaller models and Grok 4.
- A subset of the benchmarks (Medmarks-T) can be used as reinforcement learning environments for post-training LLMs aimed at medical reasoning, with the code released on GitHub.
Related Articles

Black Hat USA
AI Business

Why Retail Chargeback Recovery Could Be AgentHansa's First Real PMF
Dev.to

Anthropic Launches AI Services Company with Blackstone & Goldman Sachs
Dev.to

10 Ways AI Has Become Your Invisible Daily Companion in 2026
Dev.to

My ‘Busy’ Button Is a Chat Window: 8 Hours of Sorting & Broccoli Poetry
Dev.to