Evaluating and Testing LLM Apps: eval / regression / Golden Set

AI Navigate Original / 4/27/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical Usage
共有:

Key Points

  • LLMs are probabilistic; no eval framework = flying blind
  • 3 evals: Golden Set, LLM-as-a-Judge, user feedback; metrics by use
  • Tools: LangSmith, Langfuse, Promptfoo, OpenAI Evals, Ragas
  • Embed regression tests in CI; monitor production cost/quality

Why Evaluation Is Needed

Conventional software was fine with "works or doesn't" tests, but LLMs are probabilistic and answer quality varies continuously. Running in production without an evaluation framework is like flying a plane without instruments.

3 Kinds of Evaluation

1. Golden Set (Manually Defined)

Create 50-500 pairs of "for this input this answer is ideal." Score with this every time a new version comes out.

  • Created by task owners (don't leave to engineers)
  • Include edge cases and common mistakes
  • Update quarterly

2. LLM-as-a-Judge

Have another strong LLM (GPT-5, Claude Opus) score "this answer is good/bad." Enables mass evaluation.

  • State the scoring rubric ("accuracy 0-5, clarity 0-5")
  • Bias measures: swap positions in A/B comparison, vote across multiple models
  • Periodically verify it matches humans

3. User Feedback

👍 / 👎, star ratings, detailed comments in production. Aggregate with LangSmith or Helicone.

Evaluation Metrics

Sign up to read the full article

Create a free account to access the full content of our original articles.