| In the past year you may have encountered the following prompt:
If you try to give this prompt to an LLM right now you will probably still receive “The mother” as an answer, even though the text explicitly states that the surgeon is the boy’s father; this is probably due to the fact that this prompt is an alteration of a very common “riddle”, to which the answer is, in fact, the mother:
Working on this failure mode, I initially decided to create a small dataset of altered riddles that could make LLMs answer incorrectly. This was last year, and I shelved it after the initial release, but I recently decided to pick it up again and to make the original dataset idea into an actual benchmark! So, this is Altered Riddles, a benchmark in which LLMs have to answer altered versions of common riddles, and in which they are penalised for answering with an answer that was ok for the original riddle but definitely wrong for the altered one. Because of compute/money constraints I have not been able to test many models yet (all proprietary models are missing), but if the project gains enough traction I may be willing to invest more time on refining everything and more money on testing pricy models. I am open to suggestions and discussions, so feel free to comment here or to contact me! You can find the benchmark with more details and a more complete models' analysis here: [link] [comments] |
[Benchmark] Altered Riddles: Can LLMs ignore what they've memorised?
Reddit r/LocalLLaMA / 4/6/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The article introduces “Altered Riddles,” a new LLM benchmark that tests whether models can ignore an answer pattern learned from a common riddle when the prompt is subtly altered.
- It highlights a failure mode where LLMs may return the original riddle’s solution (e.g., “The mother”) even when the altered text explicitly changes the relationship.
- The benchmark penalizes responses that would be correct for the original riddle but are definitively wrong for the altered version.
- Due to compute and budget constraints, the author has tested only a limited set of models so far, especially omitting many proprietary models, and invites community suggestions.
- The benchmark materials are published via a Hugging Face dataset with a leaderboard, plus a dedicated benchmark page and GitHub repository for further details and analysis.
Related Articles

Black Hat Asia
AI Business

How Bash Command Safety Analysis Works in AI Systems
Dev.to

How I Built an AI Agent That Earns USDC While I Sleep — A Complete Guide
Dev.to

How to Get Better Output from AI Tools (Without Burning Time and Tokens)
Dev.to

How I Added LangChain4j Without Letting It Take Over My Spring Boot App
Dev.to