On the Proper Treatment of Units in Surprisal Theory
arXiv cs.CL / 5/1/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- Surprisal theory connects human language processing effort to how predictable upcoming linguistic units are, but the concept of a “unit” is often treated too vaguely in empirical studies.
- The paper highlights a mismatch: experiments typically segment stimuli into linguistic units (e.g., words), while pretrained language models distribute probability over a fixed token alphabet that may not correspond to those units.
- It argues that many surprisal-based predictors rely on ad hoc procedures that mix up two different decisions—what the unit of analysis is and which portions of the input are evaluated as regions of interest.
- The authors propose a unified framework that separates these choices and supports surprisal reasoning over arbitrary unit inventories, treating tokenization as an implementation detail.
- Overall, the work calls for making unit-definition and evaluation-region choices explicit in surprisal-based analyses rather than implicitly assuming them.
Related Articles

Why Autonomous Coding Agents Keep Failing — And What Actually Works
Dev.to

Why Enterprise AI Pilots Fail
Dev.to

The PDF Feature Nobody Asked For (That I Use Every Day)
Dev.to

How to Fix OpenClaw Tool Calling Issues
Dev.to

Mistral's new flagship Medium 3.5 folds chat, reasoning, and code into one model
THE DECODER