[R] GPT-5.4-mini regressed 22pp on vanilla prompting vs GPT-5-mini. Nobody noticed because benchmarks don't test this. Recursive Language Models solved it.

Reddit r/MachineLearning / 3/29/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research

共有:

Key Points

The post reports that GPT-5.4-mini shows a large regression in “vanilla” prompting accuracy, dropping from 69.5% to 47.2% across 12 tasks, and that standard benchmarks may miss this behavior.
It claims the recursive language models (RLM) approach fixes the issue by forcing the model to compute via structured steps (e.g., Python-based querying) rather than making bare guesses with minimal reasoning.
The author compares three variants: vanilla prompting, the official RLM implementation, and their “minRLM” implementation, with the latter regaining much of the lost accuracy.
The approach is described as more efficient (5.1× fewer tokens and 3.2× lower cost than the official RLM) and purportedly compatible with every model.
A related example is AIME 2025, where the vanilla behavior reportedly fails (80% vs 0%) but the REPL/RLM-style setup yields strong performance while reducing latency.

GPT-5.4-mini produces shorter, terser outputs by default. Vanilla accuracy dropped from 69.5% to 47.2% across 12 tasks (1,800 evals). The official RLM implementation dropped too (69.7% to 50.2%). Our implementation - where the model writes Python to query data instead of attending to all of it with task pattern matching and entropy - went from 72.7% to 69.5%. The architecture absorbed what the model couldn't.

Also: AIME 2025 is 80% vs 0% vanilla. Same pattern as GPT-5.2. The model outputs a bare guess with no reasoning; the REPL forces it to compute via code. Reducing latency while increasing accuracy.

5.1x fewer tokens than official RLM, while 3.2x cheaper. It works with every model.

https://github.com/avilum/minrlm

submitted by /u/cov_id19
[link] [comments]

Black Hat Asia

AI Business

AutoGen vs CrewAI: A Comprehensive Benchmark and Selection Guide for 2026

Dev.to

64 Deepfake Laws Passed — And Investigators Still Can't Prove What's Real in Court

Dev.to

Building with TIAMAT: Live API Demos

Dev.to

[P] I trained an AI to play Resident Evil 4 Remake using Behavioral Cloning + LSTM

Reddit r/MachineLearning

[R] GPT-5.4-mini regressed 22pp on vanilla prompting vs GPT-5-mini. Nobody noticed because benchmarks don't test this. Recursive Language Models solved it.

Key Points

Related Articles

Black Hat Asia

AutoGen vs CrewAI: A Comprehensive Benchmark and Selection Guide for 2026

64 Deepfake Laws Passed — And Investigators Still Can't Prove What's Real in Court

Building with TIAMAT: Live API Demos

[P] I trained an AI to play Resident Evil 4 Remake using Behavioral Cloning + LSTM

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer