Neural Models and Language Model Prompting for the Multidimensional Evaluation of Open-Ended Conversations

arXiv cs.CL / 3/30/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper addresses how to evaluate generative AI dialogue systems by predicting dialogue-level, dimension-specific scores within DSTC-12 (Track 1).
It tests two evaluation approaches under a constraint of small models (<13B parameters): using language-model prompting as evaluators and training encoder-based classification/regression models.
Results indicate that LM prompting yields only modest correlation with human judgments but still ranks second on the test set, trailing only the baseline.
The smaller regression and classification models achieve strong correlations on the validation set for some dimensions, though performance drops on the test set.
The authors attribute part of the test-set degradation to distribution shifts in annotation score ranges across dimensions compared with train/validation data.

Abstract

The growing number of generative AI-based dialogue systems has made their evaluation a crucial challenge. This paper presents our contribution to this important problem through the Dialogue System Technology Challenge (DSTC-12, Track 1), where we developed models to predict dialogue-level, dimension-specific scores. Given the constraint of using relatively small models (i.e. fewer than 13 billion parameters) our work follows two main strategies: employing Language Models (LMs) as evaluators through prompting, and training encoder-based classification and regression models. Our results show that while LM prompting achieves only modest correlations with human judgments, it still ranks second on the test set, outperformed only by the baseline. The regression and classification models, with significantly fewer parameters, demonstrate high correlation for some dimensions on the validation set. Although their performance decreases on the test set, it is important to note that the test set contains annotations with significantly different score ranges for some of the dimensions with respect to the train and validation sets.