Stop Tweaking Prompts: Build a Feedback Loop Instead

Dev.to / 3/31/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical Usage

Key Points

  • The article argues that iterative prompt “tweaking” is essentially random walk optimization that often leads to contradictions and only marginal improvements.
  • It explains why prompt tweaking fails—lack of a baseline, undefined success criteria, and the stochastic nature of model outputs.
  • It proposes a quick five-minute “feedback loop” workflow: define acceptance criteria for what counts as “good,” then test the prompt on three representative inputs.
  • It shows how to score outputs against the criteria to pinpoint exactly what broke (e.g., JSON validity, bullet-count constraints, or edge cases like positive feedback).
  • It advises fixing the prompt by addressing the specific failure modes identified by the scored tests rather than making ad hoc changes.

Here's a pattern I see constantly: a developer writes a prompt, gets mediocre output, tweaks a word, runs it again, tweaks another word, runs it again. Thirty minutes later, the prompt is a mess of contradictions and the output is marginally better.

This is prompt tweaking. It feels productive. It isn't.

The alternative is a feedback loop — and it takes five minutes to set up.

What's Wrong With Tweaking

Tweaking is random walk optimization. You change one thing, observe the result, and decide if it's "better" based on gut feel. Problems:

  1. No baseline. You can't tell if version 12 is better than version 3 because you didn't save version 3's output.
  2. No criteria. "Better" is undefined. Is shorter better? More detailed? More accurate? You're optimizing for a moving target.
  3. No reproduction. Models are stochastic. The same prompt can give different results on different runs. One good output doesn't mean the prompt is good.

The Feedback Loop (5-Minute Setup)

Step 1: Define "Good"

Before you touch the prompt, write down what a good output looks like. Be specific:

## Acceptance Criteria
- Output is valid JSON
- Contains exactly 3 bullet points
- Each bullet is under 20 words
- No marketing language ("revolutionary", "game-changing")
- Captures the main complaint, not a summary of everything

This takes 60 seconds and saves you from chasing your tail.

Step 2: Create 3 Test Inputs

Pick three real inputs that represent your actual use case. Not toy examples — real data:

inputs/
  input-1.txt   # Short feedback, one complaint
  input-2.txt   # Long feedback, multiple issues
  input-3.txt   # Edge case: positive feedback (no complaints)

Step 3: Run All Three, Score Against Criteria

Run your prompt against all three inputs. For each output, check it against your acceptance criteria:

input-1: ✅ JSON ✅ 3 bullets ✅ under 20 words ✅ no marketing ✅ main complaint
input-2: ✅ JSON ❌ 4 bullets ✅ under 20 words ✅ no marketing ✅ main complaint
input-3: ✅ JSON ✅ 3 bullets ✅ under 20 words ❌ "amazing" ❌ no complaint to find

Now you know exactly what's broken: bullet count enforcement and the positive-feedback edge case.

Step 4: Fix What Failed

Change the prompt to address the specific failures. Not "make it better" — fix the bullet count issue:

Return EXACTLY 3 bullet points. Not 2, not 4.
If the feedback is positive with no complaints, return:
[{"point": "No actionable complaints identified"}]

Step 5: Re-Run, Re-Score

Run all three inputs again. Check the criteria. If everything passes, you're done. If something new breaks, fix that.

Why This Works

The feedback loop replaces intuition with information. Instead of "hmm, that looks better," you get "2 out of 3 inputs pass all criteria."

You also build an eval set as a side effect. Next time the model updates or you change the prompt, run the same three inputs and see if anything regressed. You just got regression testing for free.

The Time Math

Approach Time Spent Confidence
Tweaking for 30 min 30 min Low ("it seems better?")
Feedback loop 10 min setup + 5 min per iteration High (pass/fail per criteria)

The feedback loop is faster and gives you reusable test infrastructure.

Practical Tips

  • Start with 3 inputs, not 30. You can always add more later. Three is enough to catch most issues.
  • Write criteria before the prompt. It forces you to think about what you actually want.
  • Save every prompt version. Just prompt-v1.md\, prompt-v2.md\. You'll want to diff them later.
  • Automate the loop when it matters. If this prompt runs in production, turn your test inputs into a script that runs on CI.

The One-Liner

If you're spending more than 5 minutes tweaking a prompt by hand, you don't have a prompt problem — you have a process problem. Build the loop.

What's your prompt testing setup? I'm curious whether people run evals or mostly go by feel.