Automated Self-Testing as a Quality Gate: Evidence-Driven Release Management for LLM Applications
arXiv cs.AI / 3/18/2026
💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research
Key Points
- The paper introduces an automated self-testing framework that enforces quality gates (PROMOTE/HOLD/ROLLBACK) across five empirically grounded dimensions: task success rate, research context preservation, P95 latency, safety pass rate, and evidence coverage to support evidence-based release decisions for LLM applications.
- It demonstrates the approach with a longitudinal case study of an internally deployed multi-agent conversational AI system with marketing capabilities, spanning 38 evaluation runs across 20+ internal releases.
- Results show the gate identified two rollback-grade builds in early runs, supported stable quality evolution over a four-week staging lifecycle, and indicated that evidence coverage is the primary severe-regression discriminator with runtime scaling predictably with suite size.
- A human calibration study (n=60, two evaluators, LLM-as-judge cross-validation) reveals complementary multi-modal coverage between the judge and the gate, uncovering latency and routing issues not visible in response text while the judge surfaces content-quality failures, validating the multi-dimensional gate design; supplementary pseudocode and calibration artifacts are provided for replication.
Related Articles
State of MCP Security 2026: We Scanned 15,923 AI Tools. Here's What We Found.
Dev.to
I Built a Zombie Process Killer Because Claude Code Ate 14GB of My RAM
Dev.to
Data Augmentation Using GANs
Dev.to
Building Safety Guardrails for LLM Customer Service That Actually Work in Production
Dev.to

The New AI Agent Primitive: Why Policy Needs Its Own Language (And Why YAML and Rego Fall Short)
Dev.to