Model Capability Dominates: Inference-Time Optimization Lessons from AIMO 3
arXiv cs.CL / 3/31/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper finds that majority voting across multiple LLM attempts can improve mathematical reasoning, but correlated errors reduce the effective benefit of additional samples.
- It proposes “Diverse Prompt Mixer,” assigning structurally different reasoning strategies to different voters to decorrelate errors, and tests this approach in the AIMO 3 setting.
- Despite running 3 models, 23+ experiments, and evaluating on 50 IMO-level problems under a 5-hour limit on a single H100 80GB, all intervention attempts fail to beat baseline approaches.
- High-temperature sampling already provides enough error decorrelation, while weaker prompt diversity strategies harm per-attempt accuracy more than they reduce correlation.
- Across a large model capability gap (~17 points) and multiple inference-time optimization methods tried, raw model capability dominates the outcome by about an order of magnitude.
Related Articles

Black Hat Asia
AI Business

Anthropic's Accidental Release of Claude Code's Source Code: Irretrievable and Publicly Accessible
Dev.to

Salesforce announces an AI-heavy makeover for Slack, with 30 new features
TechCrunch

Oracle’s Impersonal Mass Layoffs: Thousands Impacted in AI-Driven Cost Cuts
Dev.to

Claude Code's Compaction Engine: What the Source Code Actually Reveals
Dev.to