Less Is More: Cognitive Load and the Single-Prompt Ceiling in LLM Mathematical Reasoning
arXiv cs.CL / 4/22/2026
📰 NewsSignals & Early TrendsTools & Practical UsageModels & Research
Key Points
- The paper reports a systematic prompt-engineering study for LLMs on formal mathematical reasoning tasks from the SAIR Equational Theories Stage 1 competition, involving implication checks over magmas.
- Across 40+ prompt variants (0–4,878 bytes) tested over four evaluation splits and three models (gpt-oss-120b, Llama 3.3 70B, Gemma 4 31B), the authors find a “single-prompt ceiling” where performance saturates.
- For gpt-oss-120b, accuracy plateaus in an empirical saturation region of roughly 60–79%, compared to a 59.75% no-cheatsheet baseline, despite extensive prompt engineering.
- The study attributes the ceiling to three factors: undecidability limits the encodable content for the TRUE case, overly complex rule systems can severely hurt weaker models, and prompt ordering can interact with attention in non-monotonic ways.
- The best prompt submission (AN45c, 2,252 bytes) reaches 79.25% accuracy on hard3 with strong TRUE recall (95.9%) and provides a +19.5 percentage-point gain over the baseline, with all prompts and evaluation materials released on GitHub.
Related Articles

Black Hat USA
AI Business

Autoencoders and Representation Learning in Vision
Dev.to
Every AI finance app wants your data. I didn’t trust that — so I built my own. Offline.
Dev.to

Control Claude with Just a URL. The Chrome Extension "Send to Claude" Is Incredibly Useful
Dev.to

Google Stitch 2.0: Senior-Level UI in Seconds, But Editing Still Breaks
Dev.to