[R] GPT-5.4-mini regressed 22pp on vanilla prompting vs GPT-5-mini. Nobody noticed because benchmarks don't test this. Recursive Language Models solved it.

Reddit r/MachineLearning / 3/29/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • The post reports that GPT-5.4-mini shows a large regression in “vanilla” prompting accuracy, dropping from 69.5% to 47.2% across 12 tasks, and that standard benchmarks may miss this behavior.
  • It claims the recursive language models (RLM) approach fixes the issue by forcing the model to compute via structured steps (e.g., Python-based querying) rather than making bare guesses with minimal reasoning.
  • The author compares three variants: vanilla prompting, the official RLM implementation, and their “minRLM” implementation, with the latter regaining much of the lost accuracy.
  • The approach is described as more efficient (5.1× fewer tokens and 3.2× lower cost than the official RLM) and purportedly compatible with every model.
  • A related example is AIME 2025, where the vanilla behavior reportedly fails (80% vs 0%) but the REPL/RLM-style setup yields strong performance while reducing latency.

GPT-5.4-mini produces shorter, terser outputs by default. Vanilla accuracy dropped from 69.5% to 47.2% across 12 tasks (1,800 evals). The official RLM implementation dropped too (69.7% to 50.2%). Our implementation - where the model writes Python to query data instead of attending to all of it with task pattern matching and entropy - went from 72.7% to 69.5%. The architecture absorbed what the model couldn't.

Also: AIME 2025 is 80% vs 0% vanilla. Same pattern as GPT-5.2. The model outputs a bare guess with no reasoning; the REPL forces it to compute via code. Reducing latency while increasing accuracy.

5.1x fewer tokens than official RLM, while 3.2x cheaper. It works with every model.

https://github.com/avilum/minrlm

submitted by /u/cov_id19
[link] [comments]