Reward-Aware Trajectory Shaping for Few-step Visual Generation

arXiv cs.CV / 4/17/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses how to generate high-fidelity visuals with extremely few sampling steps, arguing that standard distillation methods cap student performance by forcing imitation of a stronger teacher.
  • It proposes Reward-Aware Trajectory Shaping (RATS), which aligns teacher and student latent denoising trajectories at key stages using horizon matching.
  • RATS introduces a reward-aware gate that dynamically modulates teacher guidance depending on relative reward performance, tightening guidance when the teacher is better and easing it when the student catches up.
  • By combining trajectory distillation, reward-aware gating, and preference alignment, RATS aims to transfer preference-relevant knowledge from high-step generators without adding test-time compute.
  • Experiments reportedly show RATS improves the efficiency–quality trade-off for few-step visual generation, substantially reducing the quality gap between few-step students and stronger multi-step generators.

Abstract

Achieving high-fidelity generation in extremely few sampling steps has long been a central goal of generative modeling. Existing approaches largely rely on distillation-based frameworks to compress the original multi-step denoising process into a few-step generator. However, such methods inherently constrain the student to imitate a stronger multi-step teacher, imposing the teacher as an upper bound on student performance. We argue that introducing \textbf{preference alignment awareness} enables the student to optimize toward reward-preferred generation quality, potentially surpassing the teacher instead of being restricted to rigid teacher imitation. To this end, we propose \textbf{Reward-Aware Trajectory Shaping (RATS)}, a lightweight framework for preference-aligned few-step generation. Specifically, teacher and student latent trajectories are aligned at key denoising stages through horizon matching, while a \textbf{reward-aware gate} is introduced to adaptively regulate teacher guidance based on their relative reward performance. Trajectory shaping is strengthened when the teacher achieves higher rewards, and relaxed when the student matches or surpasses the teacher, thereby enabling continued reward-driven improvement. By seamlessly integrating trajectory distillation, reward-aware gating, and preference alignment, RATS effectively transfers preference-relevant knowledge from high-step generators without incurring additional test-time computational overhead. Experimental results demonstrate that RATS substantially improves the efficiency--quality trade-off in few-step visual generation, significantly narrowing the gap between few-step students and stronger multi-step generators.