Prompt Optimization Is a Coin Flip: Diagnosing When It Helps in Compound AI Systems

arXiv cs.AI / 4/17/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The study finds that, in compound AI systems, prompt optimization is often statistically no better than random guessing, with many runs performing worse than zero-shot baselines.
  • Across 72 optimization runs on Claude Haiku and additional tests on Amazon Nova Lite, a large fraction of results fall below zero-shot, indicating frequent optimization failure.
  • The researchers test two common assumptions behind tools such as TextGrad and DSPy—whether individual prompts matter and whether agent prompts require joint optimization—and find no meaningful interaction (agent coupling) between prompts.
  • Optimization tends to help only for tasks with exploitable output structure—cases where the model can produce the needed format but does not naturally default to it.
  • They propose a practical diagnostic approach: an $80 ANOVA pre-test to check for agent coupling, plus a short headroom test to predict whether optimization is likely to be worthwhile.

Abstract

Prompt optimization in compound AI systems is statistically indistinguishable from a coin flip: across 72 optimization runs on Claude Haiku (6 methods \times 4 tasks \times 3 repeats), 49% score below zero-shot; on Amazon Nova Lite, the failure rate is even higher. Yet on one task, all six methods improve over zero-shot by up to +6.8 points. What distinguishes success from failure? We investigate with 18,000 grid evaluations and 144 optimization runs, testing two assumptions behind end-to-end optimization tools like TextGrad and DSPy: (A) individual prompts are worth optimizing, and (B) agent prompts interact, requiring joint optimization. Interaction effects are never significant (p > 0.52, all F < 1.0), and optimization helps only when the task has exploitable output structure -- a format the model can produce but does not default to. We provide a two-stage diagnostic: an \$80 ANOVA pre-test for agent coupling, and a 10-minute headroom test that predicts whether optimization is worthwhile -- turning a coin flip into an informed decision.