On the Proper Treatment of Units in Surprisal Theory

arXiv cs.CL / 5/1/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

Surprisal theory connects human language processing effort to how predictable upcoming linguistic units are, but the concept of a “unit” is often treated too vaguely in empirical studies.
The paper highlights a mismatch: experiments typically segment stimuli into linguistic units (e.g., words), while pretrained language models distribute probability over a fixed token alphabet that may not correspond to those units.
It argues that many surprisal-based predictors rely on ad hoc procedures that mix up two different decisions—what the unit of analysis is and which portions of the input are evaluated as regions of interest.
The authors propose a unified framework that separates these choices and supports surprisal reasoning over arbitrary unit inventories, treating tokenization as an implementation detail.
Overall, the work calls for making unit-definition and evaluation-region choices explicit in surprisal-based analyses rather than implicitly assuming them.

Abstract

Surprisal theory links human processing effort to the predictability of an upcoming linguistic unit, but empirical work often leaves the notion of a unit underspecified. In practice, experimental stimuli are segmented into linguistically motivated units (e.g., words), while pretrained language models assign probability mass to a fixed token alphabet that typically does not align with those units. As a result, surprisal-based predictors depend implicitly on ad hoc procedures that conflate two distinct modeling choices: the definition of the unit of analysis and the choice of regions of interest over which predictions are evaluated. In this paper, we disentangle these choices and give a unified framework for reasoning about surprisal over arbitrary unit inventories. We argue that surprisal-based analyses should make these choices explicit and treat tokenization as an implementation detail rather than a scientific primitive.