Why Evaluation Is Needed
Conventional software was fine with "works or doesn't" tests, but LLMs are probabilistic and answer quality varies continuously. Running in production without an evaluation framework is like flying a plane without instruments.
3 Kinds of Evaluation
1. Golden Set (Manually Defined)
Create 50-500 pairs of "for this input this answer is ideal." Score with this every time a new version comes out.
- Created by task owners (don't leave to engineers)
- Include edge cases and common mistakes
- Update quarterly
2. LLM-as-a-Judge
Have another strong LLM (GPT-5, Claude Opus) score "this answer is good/bad." Enables mass evaluation.
- State the scoring rubric ("accuracy 0-5, clarity 0-5")
- Bias measures: swap positions in A/B comparison, vote across multiple models
- Periodically verify it matches humans
3. User Feedback
👍 / 👎, star ratings, detailed comments in production. Aggregate with LangSmith or Helicone.



