Criterion Validity of LLM-as-Judge for Business Outcomes in Conversational Commerce

arXiv cs.AI / 4/2/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper tests whether LLM-as-Judge dialogue rubric scores for conversational commerce have criterion validity by correlating a 7-dimension, rubric-based evaluation with verified downstream conversion on a Chinese matchmaking platform.
  • It finds dimension-level heterogeneity: Need Elicitation and Pacing Strategy are significantly associated with conversion (after Bonferroni correction), while Contextual Memory shows no detectable association.
  • The study shows an “equal-weighted composite dilution” effect, where a uniform composite score underperforms the strongest dimensions, and conversion-informed reweighting partially corrects the problem.
  • Logistic regression controlling for conversation length confirms that the Pacing Strategy association is not explained by length confounding (OR=3.18, p=0.006).
  • A prior pilot mixed human and AI conversations and produced a misleading evaluation–outcome paradox, which the authors attribute to an agent-type confound and investigate via a Trust-Funnel mechanism.

Abstract

Multi-dimensional rubric-based dialogue evaluation is widely used to assess conversational AI, yet its criterion validity -- whether quality scores are associated with the downstream outcomes they are meant to serve -- remains largely untested. We address this gap through a two-phase study on a major Chinese matchmaking platform, testing a 7-dimension evaluation rubric (implemented via LLM-as-Judge) against verified business conversion. Our findings concern rubric design and weighting, not LLM scoring accuracy: any judge using the same rubric would face the same structural issue. The core finding is dimension-level heterogeneity: in Phase 2 (n=60 human conversations, stratified sample, verified labels), Need Elicitation (D1: rho=0.368, p=0.004) and Pacing Strategy (D3: rho=0.354, p=0.006) are significantly associated with conversion after Bonferroni correction, while Contextual Memory (D5: rho=0.018, n.s.) shows no detectable association. This heterogeneity causes the equal-weighted composite (rho=0.272) to underperform its best dimensions -- a composite dilution effect that conversion-informed reweighting partially corrects (rho=0.351). Logistic regression controlling for conversation length confirms D3's association strengthens (OR=3.18, p=0.006), ruling out a length confound. An initial pilot (n=14) mixing human and AI conversations had produced a misleading "evaluation-outcome paradox," which Phase 2 revealed as an agent-type confound artifact. Behavioral analysis of 130 conversations through a Trust-Funnel framework identifies a candidate mechanism: AI agents execute sales behaviors without building user trust. We operationalize these findings in a three-layer evaluation architecture and advocate criterion validity testing as standard practice in applied dialogue evaluation.

Criterion Validity of LLM-as-Judge for Business Outcomes in Conversational Commerce | AI Navigate