How do you eval LLM output that isn't code?

Dev.to / 5/29/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research

共有:

Key Points

The article explains that evaluating non-code LLM outputs (e.g., pitches, show notes, ledes) cannot rely on simple pass/fail assertions like code-based checks.
It proposes a two-stage evaluation pipeline: first, cheap binary assertions to catch mechanical failures (required sections, format rules, refusal behavior), then a second, graded judgment stage for quality.
In the graded stage, the model scores each output on 1–5 across seven dimensions, including an “Editorial Naturalness” dimension that enforces whether the text sounds like a human who understands the specific media context.
Editorial Naturalness is evaluated using observable signals (lexical choice, structure patterns, tone, and genre conventions), and outputs below a hard threshold are rejected regardless of other scores.
The approach aims to prevent the system from drifting into fluent but “machine-sounding” prose that would still score well on averaged metrics, which are shown to hide the key failure mode.

Code has a luxury: it either runs or it doesn't. You write an assertion, you run it, you get a green check or a red one. Most LLM eval frameworks lean on exactly this — assert output contains X, assert valid JSON, assert no error.

Editorial output has no such luxury. A pitch treatment, a show-notes draft, a lede — there's no test that returns true for "a working producer would send this." So how do you evaluate it without a human reading every output, every time?

I had to answer this concretely, because I maintain a library of ~400 Claude skills for media work, and "trust me, it's good" is not a quality bar. Here's the approach.

Two stages: binary first, judgment second

Stage one is the cheap filter — binary assertions. Even for prose, a lot of failure is mechanical and testable: did it produce the required sections? Is the lede one sentence? Did it refuse to fabricate a quote when given no source? These catch the obvious breaks fast, run blind across many inputs, and cost nothing. The library runs thousands of these. The interesting result: the few "failures" were skills correctly refusing to invent content on deliberately thin inputs — the desired behaviour, not a bug.

Stage two is the part that matters for prose — graded judgment. A model scores each output 1–5 across seven dimensions: coherence, relevance, accuracy, completeness, usefulness, format-fit, and one more that does the real work.

The dimension that does the work: Editorial Naturalness

Six of the seven dimensions are standard. The seventh is a hard floor: does this read like a person who knows the medium, or like a model?

This is scored against observable tells, not vibes:

Lexical — the AI vocabulary (delve, leverage, robust, seamless, tapestry).
Structural — the false pivot ("not just X, but Y"), throat-clearing openers, rule-of-three on every line.
Tonal — manufactured enthusiasm, hedging stacks, the apology spiral.
Genre — does it honour the conventions of the format it claims to serve?

A skill can score 5/5 on the other six and still fail. If Editorial Naturalness is below the floor, it doesn't ship. That single constraint is what stops the library drifting into competent-sounding slop.

Why a hard floor and not an average

Averages hide the thing you care about. A draft that's accurate, complete, well-structured — and unmistakably machine-written — would pass an averaged score comfortably. For media work that draft is useless: the audience clocks it in a sentence. The floor forces the failure to surface instead of being averaged away.

The honest limitation

A model grading prose is generous — it tends to like fluent text, including fluent AI text. So the scores are treated as a filter, not a verdict: they catch the clear failures and rank candidates, but the bar for "stable" is deliberately set high (≥ 4.0 with the naturalness floor), and the rubric is anchored on observable tells rather than taste, so two runs roughly agree. It's not perfect. It's a lot better than shipping on feel.

The whole framework — dimensions, thresholds, the banned-phrase list — is open source. If you're evaluating non-code LLM output, take it apart and tell me where it's too soft.

→ github.com/ur-grue/autopunk-media-skills

Black Hat USA

AI Business

Gemini Flash Gets Pricey, AI Act Delays, Agents Drive Online Traffic

The Batch

Which one has the most chance of open-sourcing old 2020-2024 AI models? OpenAI, Google or Antrophic? Why? Tell also a model that would open source (ya select only one old model)

Reddit r/LocalLLaMA

No Template Fits? Generate Your Own Awesome DESIGN.md with .NET and Ollama

Dev.to

Mask-CNN: Localizing Parts and Selecting Descriptors for Fine-Grained ImageRecognition