Optimization before Evaluation: Evaluation with Unoptimised Prompts Can be Misleading
arXiv cs.AI / 5/1/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- Current LLM evaluation frameworks typically use a single static prompt template for all models, which can diverge from real-world practice where prompts are optimized per model.
- The paper studies prompt optimization (PO) and finds that it can substantially change the evaluation outcomes and the resulting model rankings.
- Experiments on public academic benchmarks and internal industry benchmarks show that PO has a strong impact on which model appears best.
- The authors conclude that practitioners should perform prompt optimization separately for each model during evaluation to make fair and task-relevant comparisons.
- Overall, the study warns that evaluating with unoptimized prompts may lead to misleading conclusions about model quality.
Related Articles

Why Autonomous Coding Agents Keep Failing — And What Actually Works
Dev.to

Why Enterprise AI Pilots Fail
Dev.to

The PDF Feature Nobody Asked For (That I Use Every Day)
Dev.to

How to Fix OpenClaw Tool Calling Issues
Dev.to

Mistral's new flagship Medium 3.5 folds chat, reasoning, and code into one model
THE DECODER