Multi-ORFT: Stable Online Reinforcement Fine-Tuning for Multi-Agent Diffusion Planning in Cooperative Driving
arXiv cs.RO / 4/14/2026
📰 NewsSignals & Early TrendsModels & Research
Key Points
- Multi-ORFT is introduced as a stable online reinforcement fine-tuning framework for multi-agent diffusion-based cooperative driving planners, targeting better closed-loop reliability.
- The method combines scene-conditioned diffusion pre-training (using inter-agent self-attention, cross-attention, and AdaLN-Zero scene conditioning) to improve scene consistency and road adherence of generated joint trajectories.
- For online post-training, Multi-ORFT defines a two-level MDP that leverages step-wise reverse-kernel likelihoods and uses dense trajectory-level rewards with variance-gated group-relative policy optimization (VG-GRPO) to stabilize learning in reactive environments.
- On the WOMD closed-loop benchmark, Multi-ORFT lowers collision rate (2.04%→1.89%) and off-road rate (1.68%→1.36%) while increasing average speed (8.36→8.61 m/s), outperforming several strong open-source diffusion planning baselines on key safety/efficiency metrics.


