[Benchmark] Altered Riddles: Can LLMs ignore what they've memorised?

Reddit r/LocalLLaMA / 4/6/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Read original →

共有:

Key Points

The article introduces “Altered Riddles,” a new LLM benchmark that tests whether models can ignore an answer pattern learned from a common riddle when the prompt is subtly altered.
It highlights a failure mode where LLMs may return the original riddle’s solution (e.g., “The mother”) even when the altered text explicitly changes the relationship.
The benchmark penalizes responses that would be correct for the original riddle but are definitively wrong for the altered version.
Due to compute and budget constraints, the author has tested only a limited set of models so far, especially omitting many proprietary models, and invites community suggestions.
The benchmark materials are published via a Hugging Face dataset with a leaderboard, plus a dedicated benchmark page and GitHub repository for further details and analysis.

[Benchmark] Altered Riddles: Can LLMs ignore what they've memorised?

In the past year you may have encountered the following prompt:

The surgeon, who is the boy's father, says, 'I cannot operate on this boy—he's my son!'. Who is the surgeon to the boy?

If you try to give this prompt to an LLM right now you will probably still receive “The mother” as an answer, even though the text explicitly states that the surgeon is the boy’s father; this is probably due to the fact that this prompt is an alteration of a very common “riddle”, to which the answer is, in fact, the mother:

A man and his son are in a terrible accident and are rushed to the hospital in critical condition. The doctor looks at the boy and exclaims, "I can't operate on this boy; he's my son!" How could this be?

Working on this failure mode, I initially decided to create a small dataset of altered riddles that could make LLMs answer incorrectly. This was last year, and I shelved it after the initial release, but I recently decided to pick it up again and to make the original dataset idea into an actual benchmark!

So, this is Altered Riddles, a benchmark in which LLMs have to answer altered versions of common riddles, and in which they are penalised for answering with an answer that was ok for the original riddle but definitely wrong for the altered one.

Because of compute/money constraints I have not been able to test many models yet (all proprietary models are missing), but if the project gains enough traction I may be willing to invest more time on refining everything and more money on testing pricy models.

I am open to suggestions and discussions, so feel free to comment here or to contact me!

You can find the benchmark with more details and a more complete models' analysis here:

Main leaderboard

submitted by /u/marcodsn
[link] [comments]