Dataset-Level Metrics Attenuate Non-Determinism: A Fine-Grained Non-Determinism Evaluation in Diffusion Language Models
arXiv cs.LG / 4/16/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that diffusion language models’ non-determinism is underestimated when evaluated using only dataset-level, fixed-configuration metrics because aggregation across runs masks input-level instability.
- It proposes and performs a fine-grained evaluation that measures sample-level prediction differences across both model factors (e.g., guidance scale, diffusion steps, Monte Carlo sampling) and system factors (e.g., batch size, hardware, numerical precision).
- The results show that non-determinism in DLMs is pervasive and structured, with code generation much more sensitive to evaluation-factor choices than question answering.
- To better explain where non-determinism comes from, the authors introduce Factor Variance Attribution (FVA), decomposing observed variance across different evaluation factor settings.
- Overall, the study concludes that reliable non-determinism assessment for diffusion LMs requires factor-aware, fine-grained evaluation rather than relying on aggregate dataset-level scores.
Related Articles

Introducing Claude Opus 4.7
Anthropic News

Who Audits the Auditors? Building an LLM-as-a-Judge for Agentic Reliability
Dev.to

"Enterprise AI Cost Optimization: How Companies Are Cutting AI Infrastructure Sp
Dev.to

Config-first code generator to replace repetitive AI boilerplate — looking for feedback and collaborators
Dev.to

The US Government Fired 40% of an Agency, Then Asked AI to Do Their Jobs
Dev.to