Automated Self-Testing as a Quality Gate: Evidence-Driven Release Management for LLM Applications
arXiv cs.AI / 3/18/2026
💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research
Key Points
- The paper introduces an automated self-testing framework that enforces quality gates (PROMOTE/HOLD/ROLLBACK) across five empirically grounded dimensions: task success rate, research context preservation, P95 latency, safety pass rate, and evidence coverage to support evidence-based release decisions for LLM applications.
- It demonstrates the approach with a longitudinal case study of an internally deployed multi-agent conversational AI system with marketing capabilities, spanning 38 evaluation runs across 20+ internal releases.
- Results show the gate identified two rollback-grade builds in early runs, supported stable quality evolution over a four-week staging lifecycle, and indicated that evidence coverage is the primary severe-regression discriminator with runtime scaling predictably with suite size.
- A human calibration study (n=60, two evaluators, LLM-as-judge cross-validation) reveals complementary multi-modal coverage between the judge and the gate, uncovering latency and routing issues not visible in response text while the judge surfaces content-quality failures, validating the multi-dimensional gate design; supplementary pseudocode and calibration artifacts are provided for replication.




