GPT-5.4-mini produces shorter, terser outputs by default. Vanilla accuracy dropped from 69.5% to 47.2% across 12 tasks (1,800 evals). The official RLM implementation dropped too (69.7% to 50.2%). Our implementation - where the model writes Python to query data instead of attending to all of it with task pattern matching and entropy - went from 72.7% to 69.5%. The architecture absorbed what the model couldn't.
Also: AIME 2025 is 80% vs 0% vanilla. Same pattern as GPT-5.2. The model outputs a bare guess with no reasoning; the REPL forces it to compute via code. Reducing latency while increasing accuracy.
5.1x fewer tokens than official RLM, while 3.2x cheaper. It works with every model.
[link] [comments]




![[P] I trained an AI to play Resident Evil 4 Remake using Behavioral Cloning + LSTM](/_next/image?url=https%3A%2F%2Fexternal-preview.redd.it%2FzgmJOxETuqgqlsgMxeBl7S4gZNDHf_K3U9w883ioT4M.jpeg%3Fwidth%3D320%26crop%3Dsmart%26auto%3Dwebp%26s%3Da63f97b9d03c40b846cd3eaac472e78050020a43&w=3840&q=75)