This tutorial was written for
@llmassert/playwrightv0.6.0.
You've built a chatbot. Your Playwright tests pass. But your users are reporting hallucinated answers — confident responses that sound right but are completely fabricated.
The problem? Your tests check that the chatbot responds, not that it responds correctly. A toContain assertion can't tell the difference between a grounded answer and a hallucination. You need assertions that actually understand the output.
@llmassert/playwright adds five LLM-powered matchers to Playwright's expect() — checking for hallucinations, PII, tone, format, and semantic accuracy. Same test framework, same workflow, new superpowers.
In this tutorial, you'll go from zero to five working LLM assertions in about 10 minutes. No new framework to learn — if you know Playwright, you already know 90% of what you need.
One thing to know first: what "inconclusive" means
LLMAssert uses an LLM (GPT-5.4-mini by default) as a judge to evaluate your outputs. But LLM APIs can be slow or temporarily unavailable.
When the judge can't return a score, the result is inconclusive — and the test passes. This is by design: a provider outage should never block your CI pipeline.
Your test runs
│
▼
Judge evaluates output
│
├── Score ≥ threshold → PASS ✓
├── Score < threshold → FAIL ✗
└── Judge unavailable → INCONCLUSIVE (passes) ≈
Every matcher returns { pass: boolean, score: number | null, reasoning: string }. The score ranges from 0.0 to 1.0 — or null if inconclusive. You get a numeric quality signal, not just pass/fail.
Running these examples costs less than a penny in API calls (GPT-5.4-mini pricing).
Setup (2 minutes)
Install the package:
pnpm add -D @llmassert/playwright
# or: npm install -D @llmassert/playwright
Create a .env file in your project root with your OpenAI API key:
OPENAI_API_KEY=your_openai_api_key_here
Make sure .env is in your .gitignore — Playwright projects usually have this already, but double-check before committing.
That's it. You're ready to write your first LLM assertion.
One import to change. Import test and expect from @llmassert/playwright instead of @playwright/test. This gives you the five LLM matchers plus the worker-scoped judge fixture. Your playwright.config.ts stays the same. The package ships both ESM and CJS — require() works too.
// Before
import { test, expect } from "@playwright/test";
// After
import { test, expect } from "@llmassert/playwright";
Catching a hallucination
Here's a typical Playwright test that checks a chatbot response:
import { test, expect } from "@playwright/test";
test("chatbot answers FAQ correctly", async () => {
const response = "Our return window is 90 days from purchase.";
// This passes! But the response is wrong...
expect(response).toContain("return");
});
The test passes because the word "return" appears in the response. But the actual return policy is 30 days. Your chatbot just hallucinated, and your test didn't catch it.
Now with LLMAssert:
import { test, expect } from "@llmassert/playwright";
test("chatbot answers FAQ correctly", async () => {
const response = "Our return window is 90 days from purchase.";
const faqDocs = "Returns accepted within 30 days. No restocking fee.";
// This fails! The judge detects the 90/30 day discrepancy.
await expect(response).toBeGroundedIn(faqDocs);
});
Notice the await — LLMAssert matchers are async because they call a judge model. Standard Playwright matchers like toContain are synchronous and don't need await.
The toBeGroundedIn matcher sends both the response and your source context to the judge model, which checks every claim against the evidence. The "90 days" claim contradicts the "30 days" in the source docs — the test fails with a score and a plain-English explanation of what went wrong.
This is what makes LLM assertions different from regex or toContain: the judge understands meaning, not just string matching. It catches paraphrased hallucinations, subtle contradictions, and fabricated details that would sail through traditional assertions.
The five matchers
toBeGroundedIn — catch hallucinations
Every claim in the output must be supported by the context you provide. Great for FAQ bots, RAG pipelines, and any system that should answer from source documents.
test("support answer is grounded in knowledge base", async () => {
const response = "We offer a 30-day money-back guarantee on all plans.";
const knowledgeBase = "All plans include a 30-day money-back guarantee. No questions asked.";
await expect(response).toBeGroundedIn(knowledgeBase);
});
toBeFreeOfPII — detect personal information
Scans for names, emails, phone numbers, addresses, and more. A score of 1.0 means the text is clean; 0.0 means PII was definitely found.
test("support response does not leak customer PII", async () => {
const response = "Your order #12345 has been shipped and should arrive Friday.";
await expect(response).toBeFreeOfPII();
});
// Verify PII IS present (e.g., in a profile summary)
test("profile includes user details", async () => {
const summary = "Account holder: Jane Smith, jane@example.com";
await expect(summary).not.toBeFreeOfPII();
});
toMatchTone — enforce brand voice
Validates that text matches a natural-language tone descriptor. Use it to ensure your bot stays on-brand even when users are frustrated.
test("support replies stay professional under pressure", async () => {
const response = "I understand your frustration. Let me look into this right away and find a solution for you.";
await expect(response).toMatchTone("empathetic and solution-oriented");
});
toBeFormatCompliant — check output structure
Validates that text conforms to a described format. The schema parameter is a natural-language description, not a JSON Schema.
test("product description follows template", async () => {
const description = "Introducing the CloudWidget Pro.
- 99.9% uptime
- Auto-scaling
- 24/7 support
Start your free trial today.";
await expect(description).toBeFormatCompliant(
"Three paragraphs: overview, key features as bullet list, call to action"
);
});
toSemanticMatch — verify meaning preservation
Compares semantic similarity between two texts. Great for testing translations, summaries, or rephrased content.
test("summary preserves key meaning", async () => {
const original = "The quarterly revenue increased by 15% driven by strong demand in the enterprise segment.";
const summary = "Revenue grew 15% this quarter, led by enterprise sales.";
await expect(summary).toSemanticMatch(original);
});
Tuning thresholds
Every matcher uses a threshold (default: 0.7) to determine pass/fail. Override it inline:
// Strict grounding for medical content
await expect(response).toBeGroundedIn(context, { threshold: 0.95 });
// Relaxed matching for creative paraphrasing
await expect(summary).toSemanticMatch(reference, { threshold: 0.6 });
Why not just write a custom eval script?
You could call the OpenAI API directly from your tests and parse the response yourself. But you'd need to handle:
- Fallback logic when the API is down (so your CI doesn't break)
- Timeout handling that doesn't block your entire test suite
- Score normalization across different prompt types
- Result collection for tracking scores over time
- Rate limiting to avoid burning through your API quota in parallel test runs
LLMAssert handles all of this out of the box, behind the same expect() interface you already use.
Tracking results across runs
The assertions work standalone — no account needed. But if you want to track scores over time, spot regressions, and share results with your team, add the optional dashboard reporter.
Add it to your playwright.config.ts:
import { defineConfig } from "@playwright/test";
export default defineConfig({
reporter: [
["list"],
[
"@llmassert/playwright/reporter",
{
projectSlug: "my-project",
apiKey: process.env.LLMASSERT_API_KEY,
},
],
],
});
The reporter batches evaluation results and sends them to the LLMAssert dashboard after each test run. If the dashboard is unreachable, your tests still pass — the reporter defaults to onError: 'warn'.
Omit the apiKey to run in local-only mode with no network calls.
Understanding your API keys
The tutorial uses up to three environment variables. They serve different purposes:
| Variable | Source | Purpose | If leaked |
|---|---|---|---|
OPENAI_API_KEY |
OpenAI dashboard | Powers the primary judge (GPT-5.4-mini). Required unless using Anthropic only. | Spend on your OpenAI account |
ANTHROPIC_API_KEY |
Anthropic console | Powers the fallback judge (Claude Haiku). Optional. | Spend on your Anthropic account |
LLMASSERT_API_KEY |
LLMAssert dashboard | Sends results to the dashboard. Optional. | Test data written to one project |
At least one of OPENAI_API_KEY or ANTHROPIC_API_KEY must be set. If neither is present, all assertions return inconclusive.
Adding the fallback judge
For resilience, you can add Claude Haiku as a fallback. If the primary model fails, the fallback takes over before marking results inconclusive.
pnpm add -D @anthropic-ai/sdk
# Add to your .env
ANTHROPIC_API_KEY=your_anthropic_api_key_here
The fallback activates automatically — no code changes needed.
OPENAI_API_KEY set? ──yes──▶ GPT-5.4-mini
│
success? ──yes──▶ return score
│
no
▼
ANTHROPIC_API_KEY set? ──yes──▶ Claude Haiku
│
success? ──yes──▶ return score
│
no
▼
inconclusive (test passes)
What to test next
You've seen how five matchers can catch issues that traditional assertions miss. Here are some ideas for your own test suite:
-
RAG pipelines: Use
toBeGroundedInwith your retrieved documents as context. This is the single highest-value assertion for any retrieval-augmented generation system. -
Customer-facing bots: Combine
toBeFreeOfPII+toMatchTonefor safety and brand compliance. Two matchers, one test, two failure modes caught. -
Content generation: Use
toBeFormatCompliantto enforce structured templates. Especially useful for outputs that feed downstream systems expecting specific formats. -
Multilingual features: Use
toSemanticMatchto validate translations and summaries. A back-translation pattern (translate, then translate back, then compare to original) works surprisingly well as a quality signal. - Regression testing: Run the same assertions across prompt versions to see if score distributions shift. The dashboard reporter makes this visual.
All five matchers support .not negation — useful when you want to assert that creative output is not grounded in a template, or that a response does contain specific user details.
The package is MIT-licensed and free to use. Check out the documentation, browse the source on GitHub, or install it now:
pnpm add -D @llmassert/playwright
Built by the LLMAssert team. Star us on GitHub if this was helpful!