Brevity Constraints Reverse Performance Hierarchies in Language Models

arXiv cs.AI / 4/2/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • Standard benchmark evaluations show a counterintuitive result where larger language models underperform smaller ones on 7.7% of problems by 28.4 percentage points, despite having 10–100x more parameters.
  • The study attributes this to spontaneous, scale-dependent verbosity that increases errors via overelaboration, rather than to inherent limitations of large models.
  • Causal interventions indicate the issue is correctable through prompt design: adding brevity constraints to large models improves accuracy by 26 percentage points and reduces the performance gaps by up to two-thirds.
  • Under brevity constraints, performance hierarchies reverse on mathematical reasoning and scientific knowledge benchmarks, giving large models a 7.7–15.9 percentage point advantage over smaller models.
  • The authors find inverse scaling is continuous across the full parameter range (0.5B–405B) and emphasize deployment impact: scale-aware prompt adaptation can improve accuracy while lowering computational costs.

Abstract

Standard evaluation protocols reveal a counterintuitive phenomenon: on 7.7% of benchmark problems spanning five datasets, larger language models underperform smaller ones by 28.4 percentage points despite 10-100x more parameters. Through systematic evaluation of 31 models (0.5B-405B parameters) across 1,485 problems, we identify the mechanism as spontaneous scale-dependent verbosity that introduces errors through overelaboration. Causal intervention experiments demonstrate this reflects correctable prompt design rather than fundamental capability limitations. Constraining large models to produce brief responses improves accuracy by 26 percentage points and reduces performance gaps by up to two-thirds. Most critically, brevity constraints completely reverse performance hierarchies on mathematical reasoning and scientific knowledge benchmarks, with large models achieving 7.7-15.9 percentage point advantages over small models -- direct inversions of the original gaps. These reversals prove large models possess superior latent capabilities that universal prompting masks. We validate findings through three independent contamination tests and demonstrate inverse scaling operates continuously across the full parameter spectrum, with dataset-specific optimal scales ranging from 0.5B to 3.0B parameters. Our results establish that maximizing large model performance requires scale-aware prompt engineering rather than universal evaluation protocols, with immediate implications for deployment: prompt adaptation simultaneously improves accuracy and reduces computational costs.