Measuring Temporal Linguistic Emergence in Diffusion Language Models

arXiv cs.CL / 4/28/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies diffusion language models by leveraging their explicit denoising trajectory to measure when different information types become detectable during generation.
  • Using multiple 32-step runs of LLaDA-8B-Base on masked WikiText-103, the authors derive temporal metrics including token commitment, linear recoverability of POS/coarse semantics/token identity, confidence/entropy dynamics, and sensitivity to re-masking mid-trajectory.
  • Results are consistent across random seeds: content-related categories stabilize earlier than function-heavy categories, and coarse linguistic labels remain more linearly recoverable than exact lexical identity under the probe setup.
  • The work finds that uncertainty dynamics relate to eventual correctness (tokens that will be wrong show higher uncertainty), while mid-trajectory perturbation sensitivity peaks, largely due to local effects at perturbed positions.
  • Overall, the authors argue that “denoising time” is a meaningful analysis dimension: coarse labels are recovered earlier and more robustly than lexical identity, and intermediate states are the most sensitive to interventions in their experimental setting.

Abstract

Diffusion language models expose an explicit denoising trajectory, making it possible to ask when different kinds of information become measurable during generation. We study three independent 32-step runs of LLaDA-8B-Base on masked WikiText-103 text, each with 1{,}000 probe-training sequences and 200 held-out evaluation sequences. From saved trajectories, we derive four temporal measurements: token commitment; linear recoverability of part-of-speech (POS), coarse semantic category, and token identity; confidence and entropy dynamics; and sensitivity under mid-trajectory re-masking. Across seeds, the same ordering recurs: content categories stabilize earlier than function-heavy categories, POS and coarse semantic labels remain substantially more linearly recoverable than exact lexical identity under our probe setup, uncertainty remains higher for tokens that ultimately resolve incorrectly even though late confidence becomes less calibrated, and perturbation sensitivity peaks in the middle of the trajectory. A direct/collateral decomposition shows that this peak is overwhelmingly local to the perturbed positions themselves. In this LLaDA+WikiText setting, denoising time is therefore a useful analysis axis: under our measurements, coarse labels are recovered earlier and more robustly than lexical identity, trajectory-level uncertainty tracks eventual correctness, and mid-trajectory states are the most intervention-sensitive.

Measuring Temporal Linguistic Emergence in Diffusion Language Models | AI Navigate