| The idea came from noticing how hard it is to tell what's actually going on when a model "solves" a coding problem. Is it reasoning through the problem or is it pattern matching against the enormous amount of Python and JavaScript it saw during training? The scary answer is that on standard benchmarks you genuinely cannot tell. To separate the two we used esoteric programming languages. Brainfuck, Befunge-98, Whitespace, Unlambda, Shakespeare. Same algorithmic problems as HumanEval across the same difficulty range, just in languages with almost zero training data. No rational pretraining pipeline would bother including Whitespace because there's no deployment value and it would probably hurt performance on mainstream tasks. There's nothing to game here. We tested GPT-5.2, O4-mini, Gemini 3 Pro, Qwen3-235B, and Kimi K2 with five prompting strategies including self-scaffolding, coder-critic pairs, and a ReAct pipeline. The best single result was 11.2% on Befunge-98 with self-scaffolding and Medium/Hard/Extra-Hard stayed at 0% across literally everything, every model, every language, every strategy. Few-shot gave +0.8 percentage points on average which is statistically indistinguishable from noise. Agentic systems (Claude Code, Codex) got 2-3x better than non-agentic approaches, but mostly from sharper feedback loops and context management rather than anything that looks like actual reasoning transfer. The error breakdown is what I find most interesting. On Brainfuck where there's some online presence, models produce valid syntax but fail on logic. On Whitespace where there's almost nothing, models can't even produce valid programs at all. The gap between some pretraining and basically none is really visible in the failure modes. This community spends a lot of time debating benchmark numbers and I think the honest takeaway from this work is that we need more evaluations where high scores are actually hard to fake. Not harder problems in Python, but evaluations where the economic incentive to game simply doesn't exist, where the only route to good performance is the model genuinely learning to generalize. EsoLang-Bench is our attempt at that template but we'd love to see others build on the idea, whether through new languages, new problem types, or entirely different OOD domains. Website: https://esolang-bench.vercel.app/ Paper: https://arxiv.org/abs/2603.09678 [link] [comments] |
We made a coding benchmark that's actually hard to fake. Best result across GPT-5.2, O4-mini, Gemini, Qwen, Kimi with every prompting trick we could think of: 11%.
Reddit r/LocalLLaMA / 3/16/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The article presents EsoLang-Bench, a benchmark using esoteric languages to distinguish genuine reasoning from pattern matching in AI models.
- It reports results across GPT-5.2, O4-mini, Gemini 3 Pro, Qwen3-235B, and Kimi K2, with the best single score of 11.2% on Befunge-98 using self-scaffolding, and 0% in many other configurations.
- It notes that agentic systems (Claude Code, Codex) outperform non-agentic ones due to sharper feedback loops and context management rather than true reasoning transfer.
- It argues for more evaluations where high scores are hard to fake and presents EsoLang-Bench as a template to minimize gaming incentives while testing genuine generalization.
Related Articles

Astral to Join OpenAI
Dev.to

PearlOS. We gave swarm intelligence a local desktop environment and code control to self-evolve. Has been pretty incredible to see so far. Open source and free if you want your own.
Reddit r/LocalLLaMA

Why Data is Important for LLM
Dev.to

Waymo hits 170 million miles while avoiding serious mayhem
The Verge

The Inference Market Is Consolidating. Agent Payments Are Still Nobody's Problem.
Dev.to