Prompt Complexity Dilutes Structured Reasoning: A Follow-Up Study on the Car Wash Problem

arXiv cs.AI / 3/17/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The study tests STAR reasoning inside a 60+ line production prompt on Claude Sonnet 4.6 and reports STAR reaches 100% in isolation but drops to 0-30% in complex prompts.
The authors attribute the drop to directives that force a conclusion-first output, reversing the intended reasoning order that gives STAR its effectiveness.
In one instance, the model produced a short "Short answer: Walk." before applying STAR, yet the STAR reasoning correctly identified the constraint, illustrating that the model can reason but is steered toward the wrong answer by the prompt.
Cross-model comparisons show STAR-only performance improving from 85% to 100% with model upgrades, suggesting that newer models amplify structured reasoning in isolation even without prompt changes.
The results imply that structured reasoning frameworks may not transfer from isolated testing to real-world, multi-instruction prompts, making the reasoning-then-conclusion order a first-class design variable.

Abstract

In a previous study [Jo, 2026], STAR reasoning (Situation, Task, Action, Result) raised car wash problem accuracy from 0% to 85% on Claude Sonnet 4.5, and to 100% with additional prompt layers. This follow-up asks: does STAR maintain its effectiveness in a production system prompt? We tested STAR inside InterviewMate's 60+ line production prompt, which had evolved through iterative additions of style guidelines, format instructions, and profile features. Three conditions, 20 trials each, on Claude Sonnet 4.6: (A) production prompt with Anthropic profile, (B) production prompt with default profile, (C) original STAR-only prompt. C scored 100% (verified at n=100). A and B scored 0% and 30%. Prompt complexity dilutes structured reasoning. STAR achieves 100% in isolation but degrades to 0-30% when surrounded by competing instructions. The mechanism: directives like "Lead with specifics" force conclusion-first output, reversing the reason-then-conclude order that makes STAR effective. In one case, the model output "Short answer: Walk." then executed STAR reasoning that correctly identified the constraint -- proving the model could reason correctly but had already committed to the wrong answer. Cross-model comparison shows STAR-only improved from 85% (Sonnet 4.5) to 100% (Sonnet 4.6) without prompt changes, suggesting model upgrades amplify structured reasoning in isolation. These results imply structured reasoning frameworks should not be assumed to transfer from isolated testing to complex prompt environments. The order in which a model reasons and concludes is a first-class design variable.

Astral to Join OpenAI

Dev.to

PearlOS. We gave swarm intelligence a local desktop environment and code control to self-evolve. Has been pretty incredible to see so far. Open source and free if you want your own.

Reddit r/LocalLLaMA

Why Data is Important for LLM

Dev.to

The Inference Market Is Consolidating. Agent Payments Are Still Nobody's Problem.

Dev.to

YouTube's Deepfake Shield for Politicians Changes Evidence Forever

Dev.to

Prompt Complexity Dilutes Structured Reasoning: A Follow-Up Study on the Car Wash Problem

Key Points

Abstract

Related Articles

Astral to Join OpenAI

PearlOS. We gave swarm intelligence a local desktop environment and code control to self-evolve. Has been pretty incredible to see so far. Open source and free if you want your own.

Why Data is Important for LLM

The Inference Market Is Consolidating. Agent Payments Are Still Nobody's Problem.

YouTube's Deepfake Shield for Politicians Changes Evidence Forever

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer