Model Capability Dominates: Inference-Time Optimization Lessons from AIMO 3

arXiv cs.CL / 3/31/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper finds that majority voting across multiple LLM attempts can improve mathematical reasoning, but correlated errors reduce the effective benefit of additional samples.
  • It proposes “Diverse Prompt Mixer,” assigning structurally different reasoning strategies to different voters to decorrelate errors, and tests this approach in the AIMO 3 setting.
  • Despite running 3 models, 23+ experiments, and evaluating on 50 IMO-level problems under a 5-hour limit on a single H100 80GB, all intervention attempts fail to beat baseline approaches.
  • High-temperature sampling already provides enough error decorrelation, while weaker prompt diversity strategies harm per-attempt accuracy more than they reduce correlation.
  • Across a large model capability gap (~17 points) and multiple inference-time optimization methods tried, raw model capability dominates the outcome by about an order of magnitude.

Abstract

Majority voting over multiple LLM attempts improves mathematical reasoning, but correlated errors limit the effective sample size. A natural fix: assign structurally different reasoning strategies to different voters to decorrelate errors. We test this Diverse Prompt Mixer in the AIMO~3 competition: 3 models, 23+ experiments, and 50 IMO-level problems on a single H100 80 GB with a 5-hour limit. Every intervention fails. High-temperature sampling already decorrelates errors sufficiently; weaker prompt strategies reduce per-attempt accuracy more than they reduce correlation. Across a 17-point model capability gap and every inference-time optimization we tried, model capability dominates by an order of magnitude.