Less Is More: Cognitive Load and the Single-Prompt Ceiling in LLM Mathematical Reasoning

arXiv cs.CL / 4/22/2026

📰 NewsSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • The paper reports a systematic prompt-engineering study for LLMs on formal mathematical reasoning tasks from the SAIR Equational Theories Stage 1 competition, involving implication checks over magmas.
  • Across 40+ prompt variants (0–4,878 bytes) tested over four evaluation splits and three models (gpt-oss-120b, Llama 3.3 70B, Gemma 4 31B), the authors find a “single-prompt ceiling” where performance saturates.
  • For gpt-oss-120b, accuracy plateaus in an empirical saturation region of roughly 60–79%, compared to a 59.75% no-cheatsheet baseline, despite extensive prompt engineering.
  • The study attributes the ceiling to three factors: undecidability limits the encodable content for the TRUE case, overly complex rule systems can severely hurt weaker models, and prompt ordering can interact with attention in non-monotonic ways.
  • The best prompt submission (AN45c, 2,252 bytes) reaches 79.25% accuracy on hard3 with strong TRUE recall (95.9%) and provides a +19.5 percentage-point gain over the baseline, with all prompts and evaluation materials released on GitHub.

Abstract

We present a systematic empirical study of prompt engineering for formal mathematical reasoning in the context of the SAIR Equational Theories Stage 1 competition. The task requires deciding whether one equational law implies another over all magmas -- a problem that is undecidable in general but decidable for FALSE via finite model search. Over five weeks, we designed, tested, and analyzed more than 40 prompt variants, ranging from 0 to 4,878 bytes, across four evaluation splits and three language models (gpt-oss-120b, Llama 3.3 70B, Gemma 4 31B). Our central finding is a single-prompt ceiling: despite substantial engineering effort, balanced hard accuracy plateaus in an empirical saturation region of approximately 60--79% for gpt-oss-120b, compared to a 59.75% no-cheatsheet baseline. We identify three mechanisms underlying this ceiling: (1) the mathematical undecidability of the TRUE case limits what any finite prompt can encode; (2) complex rule systems decrease performance on weaker models (Llama 3.3 70B collapses to 0% TRUE recall with prompts exceeding 2KB); and (3) prompt ordering effects interact with model attention in fragile, non-monotonic ways. Our best submission (AN45c, 2,252 bytes) achieves 79.25% accuracy on hard3 (n=400; 95% CI: [75.0%, 82.9%]), with TRUE recall of 95.9% and FALSE recall of 63.4%, representing a +19.5 percentage-point improvement over the no-cheatsheet baseline (59.75%). We release all prompt variants, evaluation scripts, and results at https://github.com/israelcazares/sair-prompt-engineering