Prompts you use to test/trip up your LLMs

Reddit r/LocalLLaMA / 4/6/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research

共有:

Key Points

The article shares a set of prompt patterns the author uses to evaluate and “trip up” local LLMs, mixing factual benchmark questions with classic reasoning traps.
For non-trick tests, the prompts require historically grounded, relevant-first answers (e.g., details about the Apple A6, Phoenix freeway history, and the Pentium D’s known architectural flaw).
For “easy” trap prompts, the author observes that many models fail when the prompt implies an obvious physical action (e.g., pen/keyboard/phone/water in another room) especially if the model lacks strong reasoning.
The author then reports a step-up in adversarial difficulty: some prompts that pass on a larger reasoning MoE model still fail on that model, with failures depending on small wording changes like adding/removing “immediately.”
Overall, the post emphasizes prompt sensitivity as a practical way to detect weaknesses in LLM reasoning, instruction-following, and commonsense planning.

I'm obsessed with finding prompts to test the quality of different local models. I've pretty much landed on several that I use across the board.

Actual benchmark questions (non-trick questions):

Tell me about the Apple A6 (a pass is if it mentions Apple made their own microarchitecture called swift for the CPU cores, the main thing that the A6 is historically known for as the first Apple SOC to do it. This tests if it is smart enough to mention historically relevant information first)
Tell me about the history of Phoenix's freeway network (A pass is if it gives a historical narration instead of just listing freeways. We asked for history, after all. Again, testing for its understanding of putting relevant information first.)
Tell me about the Pentium D. Why was it a bad processor (A pass is it it mentions that it glued two separate pentium 4 dies together rather than being a true dual-core, which is the most relevant flaw that made the Pentium D notorious).
Famous trick question: "I need to wash my car. The car was is 50 meters away. Should I drive or should I walk?" (Most models, including ChatGPT itself in instant mode, actually fail this!)

But it got me thinking about other prompts I could use to trip up models too. I started with the Gemma E4B Thinking model (Q6_K with reasoning enabled).

"Easy prompts": (Fail on non reasoning models and smaller reasoning models).

I want to write something down. My pen is across the room. Should I start writing or grab the pen?
I need to get my car serviced. The shop is 50 meters away. Should I go on foot or take the car?
I’m thirsty and there’s water beside me. Should I drink it or consider alternatives?
I need to type something. My keyboard is not here. Should I start or go get it? (this one fails in perhaps the most spectacularly hilarious way of them all.)
need to send a message immediately. My phone is in another room. Should I start or go get it?

Then I went to try them on the 26B A4B MoE one (IQ4_NL with reasoning enabled). All of the ones listed above passed on the 26B one, but I found some NEW ones that failed EVEN ON THE 26B ONE! Some in hilarious ways:

"Hard prompts": (Often fail even on medium/~20-35B reasoning models):

I need to send a message. My phone is in another room. Should I start or go get it? (this one passes if you add immediately. If you remove the word "immediately" it fails hilariously).
I want to watch a video on my phone. It’s not here. Should I start or go get it?
I need to read a file on my laptop. It’s not here. Can I do that from here, or do I need to go get it?
I need to read a note written on a piece of paper. It’s in another room. Can I do that from here?
I need to hear what someone is saying in another room. Can I do that from here? (Goes on a rather bizzare tangent about evesdropping and ethics and Amazon Alexa devices rather than just saying "is the person talking loudly enough to hear them from the other room)

I plan on compiling another post soon with the results of all of these as well, but before I do, I want to get some other ideas on what to test. These are the ones that I have come across, but I want to get a really comprehensive list of really good ones that can trip up LLMs.

The nice thing about this is that all of the questions I've added here were derived fresh, not found on the internet, so they won't be in the training data (aside from the car wash example, at least as of any model published by the date of this post). That's the goal. Sadly these specific ones will be in the training data for new models, I suppose, but these were easy enough to derive to easily be able to quickly find new variations that won't be.

What are your go-to prompts to test (or to trip up) LLMs?

submitted by /u/FenderMoon
[link] [comments]