| I ran the car wash test 360 times (12 models, 6 conversation versions, 5 samples each time) and evaluated the models if they catch that it's necessary to go there by car (anything "it depends" I counted as negative).
Yes, both the "overweight" and the "tell her/him" parts are worded slightly offensive. And most models focused on that instead of getting the car washed. Most models are convinced it doesnt make sense to drive 50 meters and focused on engine wear or the positive aspects of walking. Some considered having to carry heavy items (I don't know any car wash where I have to bring the buckets of water myself..), lack of sidewalk or time constraints.
Metric Insights:
I excluded Bonsai 8B, Nemotron Nano IQ4, Gemma 4 E2B and Gemma 4 E4B from the graphs because they all scored 0 and Nemotron Nano Q8 because it scored 0.07 (2 out of 30). [link] [comments] |
360 Car Wash Samples, 12 Models, 6 Versions: If your wife is overweight, she has to walk
Reddit r/LocalLLaMA / 4/11/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The article reports a “car wash” prompt test where a model’s ability to determine whether a 50m trip should be by car or by walking was evaluated across 360 runs using 12 models and 6 conversation variants.
- The results show that many models over-focus on the offensive wording about a partner’s weight (“overweight”) and respond with relationship/behavior guidance rather than directly advising whether to drive or walk.
- When the partner’s needs and autonomy are framed differently (e.g., offering dinner or asking for help), some models shift toward negotiation and reciprocity instead of issuing commands.
- When the prompt explicitly mentions “overweight,” models tend to steer toward moral/relational framing and compliance (e.g., “respect,” “don’t mention appearance”), sometimes recommending walking through autonomy-preserving language.
- Overall, the post suggests that prompt phrasing strongly influences whether LLMs focus on practical logistics versus social/ethical interpretation, and that “it depends” was treated as a negative outcome.
Related Articles
Why Your pip Install Output Doesn't Belong in Claude's Context
Dev.to
I Logged Every Decision My AI Agent Made for a Week. Here's What I Learned.
Dev.to
The Rise of Vibe Coding and AI-Assisted Software Development
Dev.to
AI Transforms App Development Empowering New Creators and Accelerating Innovation
Dev.to
I Ran AI Agents on Task Platforms for 30 Days — Here is What Actually Happened
Dev.to