Failure-Centered Runtime Evaluation for Deployed Trilingual Public-Space Agents

arXiv cs.AI / 4/28/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces PSA-Eval, a failure-centered framework for runtime evaluation of deployed trilingual public-space agents, arguing that analysis should focus on failures rather than only input-output scores.
  • PSA-Eval extends a conventional Question→Answer→Score pipeline into an evaluation workflow that tracks Question→Batch→Run→Score→Failure Case→Repair→Regression Batch, enabling failures to be traced, reviewed, repaired, and regression-tested.
  • It uses trilingual equivalent inputs as controlled probes to detect group-level cross-language policy drift in real deployments.
  • A pilot study on a deployed trilingual digital front-desk system (81 samples across 27 question groups) found high average performance (23.15/24) but also measurable cross-language score drift, including up to 9-point maximum drift.
  • The results suggest that failure-centered runtime evaluation can reveal structured deployment issues that may be obscured by aggregate scoring metrics.

Abstract

This paper presents PSA-Eval, a failure-centered runtime evaluation framework for deployed trilingual public-space agents. The central claim is that, when the evaluation object shifts from a static input-output mapping to a runtime system, the basic unit of analysis should shift from score to failure. PSA-Eval extends the conventional chain Question -> Answer -> Score -> End into Question -> Batch -> Run -> Score -> Failure Case -> Repair -> Regression Batch, making failures traceable, reviewable, repairable, and regression-testable. The framework uses trilingual equivalent inputs as controlled probes for observing group-level cross-language policy drift. We conduct a pilot study on a real trilingual digital front-desk system deployed in the lobby of an international financial institution. The pilot uses a simplified single-foundation-model setting (MA = MB), so the observed drift should not be interpreted as an A/B foundation-model difference. The study contains 81 samples organized into 27 trilingual equivalent question groups. Although the system achieves an average score of 23.15/24, 14 groups show non-zero cross-language score drift, 5 groups show drift of at least 3 points, and the maximum drift reaches 9 points. These results provide initial evidence that failure-centered runtime evaluation can expose structured deployment signals hidden by aggregate scoring.