Not all tokens contribute equally to diffusion learning

arXiv cs.CV / 4/9/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper finds that conditional diffusion models for text-to-video can ignore semantically important tokens during inference, especially under classifier-free guidance, resulting in biased or incomplete generations.
  • It attributes the problem to two drivers: long-tailed token-frequency distribution bias in training data and spatial misalignment in cross-attention where informative tokens get overshadowed by less meaningful ones.
  • To fix this, the authors propose DARE, combining Distribution-Rectified Classifier-Free Guidance (DR-CFG) to debias token contributions and Spatial Representation Alignment (SRA) to reweight/align cross-attention based on token importance.
  • Experiments across multiple benchmark datasets show DARE improves both generation fidelity and semantic alignment, outperforming existing methods.

Abstract

With the rapid development of conditional diffusion models, significant progress has been made in text-to-video generation. However, we observe that these models often neglect semantically important tokens during inference, leading to biased or incomplete generations under classifier-free guidance. We attribute this issue to two key factors: distributional bias caused by the long-tailed token frequency in training data, and spatial misalignment in cross-attention where semantically important tokens are overshadowed by less informative ones. To address these issues, we propose Distribution-Aware Rectification and Spatial Ensemble (DARE), a unified framework that improves semantic guidance in diffusion models from the perspectives of distributional debiasing and spatial consistency. First, we introduce Distribution-Rectified Classifier-Free Guidance (DR-CFG), which regularizes the training process by dynamically suppressing dominant tokens with low semantic density, encouraging the model to better capture underrepresented semantic cues and learn a more balanced conditional distribution. This design mitigates the risk of the model distribution overfitting to tokens with low semantic density. Second, we propose Spatial Representation Alignment (SRA), which adaptively reweights cross-attention maps according to token importance and enforces representation consistency, enabling semantically important tokens to exert stronger spatial guidance during generation. This mechanism effectively prevents low semantic-density tokens from dominating the attention allocation, thereby avoiding the dilution of the spatial and distributional guidance provided by high semantic-density tokens. Extensive experiments on multiple benchmark datasets demonstrate that DARE consistently improves generation fidelity and semantic alignment, achieving significant gains over existing approaches.