Do We Need Frontier Models to Verify Mathematical Proofs?
arXiv cs.AI / 4/6/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper evaluates how well frontier and open-source LLMs can verify natural-language mathematical proofs, using verifier accuracy and self-consistency as key metrics.
- Results show smaller open-source models are close to frontier models in accuracy (within ~10%) but are markedly less consistent across repeated judgments (up to ~25% worse).
- Verifier accuracy is highly sensitive to prompt choice for all models, indicating that “verification” reliability depends not only on model capability but also on elicitation strategy.
- The authors find that smaller models can verify at frontier-level capability, but general judging prompts fail to reliably trigger those abilities.
- An LLM-guided prompt search produces an ensemble of specialized prompts that improves smaller models’ accuracy by up to 9.1% and self-consistency by up to 15.9%, enabling models like Qwen3.5-35B to match frontier models (e.g., Gemini 3.1 Pro) on proof verification.
Related Articles

Black Hat Asia
AI Business

How Bash Command Safety Analysis Works in AI Systems
Dev.to

How I Built an AI Agent That Earns USDC While I Sleep — A Complete Guide
Dev.to

How to Get Better Output from AI Tools (Without Burning Time and Tokens)
Dev.to

How I Added LangChain4j Without Letting It Take Over My Spring Boot App
Dev.to