Predicting States of Understanding in Explanatory Interactions Using Cognitive Load-Related Linguistic Cues

arXiv cs.CL / 3/23/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper investigates how cognitive-load-related linguistic cues—surprisal, syntactic complexity, and listener gaze variation—relate to a listener's moment-by-moment understanding in explanatory dialogue.
  • It analyzes the MUNDEX corpus with self-annotated listener states (Understanding, Partial Understanding, Non-Understanding, Misunderstanding) via retrospective video recall.
  • A classification study using two off-the-shelf classifiers and a fine-tuned German BERT-based multimodal classifier demonstrates that four-state understanding can be predicted, with improvements when combining linguistic cues with textual features.
  • The results indicate that each cue contributes differently to the listener's state and that integrating multiple cues yields better predictive performance, suggesting potential for real-time adaptation in educational or conversational systems.

Abstract

We investigate how verbal and nonverbal linguistic features, exhibited by speakers and listeners in dialogue, can contribute to predicting the listener's state of understanding in explanatory interactions on a moment-by-moment basis. Specifically, we examine three linguistic cues related to cognitive load and hypothesised to correlate with listener understanding: the information value (operationalised with surprisal) and syntactic complexity of the speaker's utterances, and the variation in the listener's interactive gaze behaviour. Based on statistical analyses of the MUNDEX corpus of face-to-face dialogic board game explanations, we find that individual cues vary with the listener's level of understanding. Listener states ('Understanding', 'Partial Understanding', 'Non-Understanding' and 'Misunderstanding') were self-annotated by the listeners using a retrospective video-recall method. The results of a subsequent classification experiment, involving two off-the-shelf classifiers and a fine-tuned German BERT-based multimodal classifier, demonstrate that prediction of these four states of understanding is generally possible and improves when the three linguistic cues are considered alongside textual features.