dTRPO: Trajectory Reduction in Policy Optimization of Diffusion Large Language Models
arXiv cs.AI / 3/20/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- dTRPO introduces trajectory reduction techniques to cut the cost of trajectory probability calculation in diffusion LLM policy optimization, enabling scalable offline training.
- It proves that under reference policy regularization, the probability ratio of newly unmasked tokens is an unbiased estimate of the ratio for intermediate diffusion states, and that the full trajectory probability can be estimated with a single forward pass of a re-masked final state.
- By integrating these results into a policy optimization objective, dTRPO achieves gains on 7B dLLMs across STEM tasks (up to 9.6%), coding tasks (up to 4.3%), and instruction-following tasks (up to 3.0%).
- It also delivers training efficiency via offline, single-forward evaluation and improved generation efficiency through high-quality outputs.
Related Articles
ADICはどの種類の革新なのか ―― ドリフト監査デモで見る「事後説明」から「通過条件」への移行**
Qiita
Complete Guide: How To Make Money With Ai
Dev.to
Built a small free iOS app to reduce LLM answer uncertainty with multiple models
Dev.to
Without Valid Data, AI Transformation Is Flying Blind – Why We Need to “Grasp” Work Again
Dev.to
How We Used Hindsight Memory to Build an AI That Knows Your Weaknesses
Dev.to