FIRST
Why Not Just Use LLM-as-Judge?
Many teams default to using another LLM to evaluate their agent. It's easy. No labels needed.
But it has a critical flaw: you're trusting an LLM to judge an LLM. If both models share the same biases (and they often do, since they're trained on similar data), the judge will approve wrong answers
confidently.
Conformal prediction avoids this entirely. It uses your ground truth labels — answers you know are correct — and builds a mathematical guarantee from them. No model judges another model. Math does.
How to Know If You Can Trust Your AI Agent (A Simple Framework)
You built an AI agent. It answers questions. It sounds confident. But how do you actually know if it's reliable?
Not gut feeling. Not vibes. A mathematical guarantee.
This article explains a 5-step framework — built on a technique called Conformal Prediction — that gives you a provable reliability score for any LLM agent. No PhD required.
The Core Problem: One Answer Tells You Nothing
Imagine you ask your trading agent: "What does RSI above 70 mean?"
It says: "Overbought condition."
Is that trustworthy? You have no idea. It could be lucky. It could always say that. You just don't know from a single answer.
This is the fundamental problem. LLMs are not calculators. They are probabilistic. Ask the same question 10 times and you might get 10 different answers. So the framework starts by exploiting that property — not hiding from it.
Step 1: Self-Consistency Sampling — Ask Many Times, Not Once
Instead of asking your agent a question once, ask it 10 times (with temperature > 0 so answers vary) and count which answers appear most often.
Example:
You ask: "What does RSI above 70 mean?" — 10 times.
Answer Count Rank
Overbought condition 7 1
Bullish momentum 2 2
Buy signal 1 3
Now you know something real: the agent is 70% consistent on this answer. That's meaningful signal.
What consistency tells you:
8–10/10 same answer → Agent knows this well
5–7/10 same answer → Agent is somewhat uncertain
1–4/10 same answer → Agent is basically guessing
⚙️ Technical note: Always set temperature above 0 (0.7 recommended). Temperature 0 returns the same answer every time, which defeats the purpose entirely.
⚙️ Technical note: Always set temperature above 0 (0.7 recommended). Temperature 0 returns the same answer every time, which defeats the purpose entirely.
Step 2: Nonconformity Scores — Turn Rankings Into Numbers
Rankings like "Rank 1, Rank 2, Rank 3" are labels. Math needs actual numbers. Nonconformity scores are the conversion
The rule is dead simple:
Nonconformity Score = Rank of the CORRECT answer
If the correct answer appeared most often (Rank 1) → Score = 1 (great)
If the correct answer was second most frequent (Rank 2) → Score = 2 (okay)
If it was buried at Rank 4 → Score = 4 (bad)
Example:
You ask: "What is a death cross?" The correct answer is "Bearish signal."
Answer Count Rank
Bullish reversal 6 1
Bearish signal 3 2
Neutral pattern 1 3
The correct answer landed at Rank 2 → Nonconformity Score = 2
The agent ranked a wrong answer higher than the right one. That's a problem — and now you have a number that captures it.
What this step produces: You run this across 50 calibration questions (questions where you know the correct answer). You get a list of 50 scores, like: [1, 2, 1, 4, 1, 3, 2, 1, ...]
This list is the raw material for Step 3.
Step 3: Calibration — Find Your Reliability Threshold
You now have 50 numbers. Calibration answers: what score cutoff gives you 95% reliability?
How it works:
- Sort all 50 scores from lowest to highest
- Find the value at the 95th percentile (position 48 of 50)
- That value is your threshold
Say the value at position 48 is 3. Your threshold is 3.
What the threshold does:
For any new question the agent answers, you calculate its score. If the score is ≤ 3, you include that answer in your "prediction set." If it's > 3, you exclude it.
Answer candidate Score Decision
"Overbought" 1 ✅ Include
"Bullish momentum" 2 ✅ Include
"Buy signal" 3 ✅ Include
"Random guess" 4 ❌ Exclude
Result: Prediction Set = {Overbought, Bullish momentum, Buy signal}
What Is a Prediction Set — And Why Does It Exist?
This is the part most explanations skip. Let's fix that.
A prediction set is not a final answer. It's a set of candidate answers that the framework guarantees contains the correct answer at your target reliability level (e.g., 95% of the time).
Why not just give one answer?
Because one answer hides uncertainty. A prediction set shows it.
- Small set (1–2 answers): The agent is confident. It knows this.
- Large set (4–5 answers): The agent is uncertain. Tread carefully.
- Huge set (6+ answers): The agent is lost. Don't trust it on this topic.
Think of it like a doctor saying "it's definitely pneumonia" vs. "it could be pneumonia, bronchitis, or a severe cold — we need more tests." The second answer is more honest and more useful.
Do the answers in the prediction set need to be "known" in advance?
Yes — the framework works in a multiple-choice or ranked-answer setting where you sample the agent repeatedly and rank the candidates by frequency. You're not inventing new options; you're filtering the agent's own generated answers through the threshold.
Step 4: Coverage Guarantee — Does It Actually Work?
Now you test the threshold on 50 new questions the agent hasn't been calibrated on.
For each question, you build a prediction set and check: is the correct answer inside it?
Example:
Q: "What does MACD crossover indicate?"
Correct answer: "Bullish momentum"
Prediction set: {Bullish momentum, Trend reversal, Buy signal}
Correct answer inside? ✅ YES
You do this for all 50 test questions. Say 47 out of 50 had the correct answer inside the prediction set.
Coverage = 47 / 50 = 94%
This 94% is not just a test score. It is a mathematical guarantee.
Because of conformal prediction's properties, this coverage holds for any future question drawn from the same distribution — not just your test set.
🔑 And here's the surprising part: with just 50 calibration examples, the margin of error in your threshold is at most 1/(50+1) = 1.96%. You don't need thousands of labeled examples. You need 50.
Two Types of Coverage — Don't Confuse Them
There's a distinction worth making explicit, because it trips people up.
Coverage on the calibration set is what you observed while tuning the threshold. You used it to set the threshold — so it's not a true test of generalization.
Coverage on the test set is what you measure on new, unseen questions. This is the real guarantee. This is the number that tells you whether your agent can be trusted in production.
Always report the test set coverage. The calibration coverage is just scaffolding.
Step 5: Comparing Agents
Run the exact same 50 calibration + 50 test questions on every agent you want to compare. Same threshold target. Then rank them:
Agent Coverage Avg Set Size Trustworthy?
Agent A 94% 1.2 ✅ Yes
Agent B 91% 1.8 ✅ Yes
Agent C 87% 2.4 ⚠️ Borderline
Agent D 76% 3.1 ❌ No
Agent E 64% 4.2 ❌ No
Two numbers matter:
- Coverage — Does the agent's prediction set actually contain the right answer 95% of the time?
- Average set size — When it IS right, how many options does it need? Smaller = more confident and precise.
Agent A is the winner: it hits 94% coverage with an average set size of just 1.2 — meaning it's almost always giving you a single correct answer.
The result: instead of guessing which agent to trust, you know — with a provable guarantee — which one deserves to be in production.
Built on Conformal Prediction — a distribution-free, finite-sample statistical framework.




