YOSE: You Only Select Essential Tokens for Efficient DiT-based Video Object Removal

arXiv cs.CV / 5/1/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • The paper proposes YOSE, an efficient fine-tuning framework for DiT-based video object removal that targets high inference latency in mask-based editing.
  • YOSE uses Batch Variable-length Indexing (BVI) to adaptively select only the essential spatiotemporal tokens indicated by the mask, enabling variable-length token processing per sample.
  • It also introduces a Diffusion Process Simulator (DiffSim) that approximates how unmasked regions affect DiT self-attention, preserving semantic consistency for masked areas.
  • Experiments show mask-aware acceleration where inference time scales roughly linearly with the masked region size, achieving up to 2.5× speedup in 70% of cases without sacrificing comparable visual quality.
  • The authors provide an open-source implementation via the linked GitHub repository.

Abstract

Recent advances in Diffusion Transformer (DiT)-based video generation technologies have shown impressive results for video object removal. However, these methods still suffer from substantial inference latency. For instance, although MiniMax Remover achieves state-of-the-art visual quality, it operates at only around 10FPS, primarily due to dense computations over the entire spatiotemporal token space, even when only a small masked region actually requires processing. In this paper, we present YOSE, You Only Select Essential Tokens, an efficient fine-tuning framework. YOSE introduces two key components: Batch Variable-length Indexing (BVI) and Diffusion Process Simulator (DiffSim) Module. BVI is a differentiable dynamic indexing operator that adaptively selects essential tokens based on mask information, enabling variable-length token processing across samples. DiffSim provides a diffusion process approximation mechanism for unmasked tokens, which simulates the influence of unmasked regions within DiT self-attention to maintain semantic consistency for masked tokens. With these designs, YOSE achieves mask-aware acceleration, where the inference time scales approximately linearly with the masked regions, in contrast to full-token diffusion methods whose computation remains constant regardless of the mask size. Extensive experiments demonstrate that YOSE achieves up to 2.5X speedup in 70% of cases while maintaining visual quality comparable to the baseline. Code is available at: https://github.com/Wucy0519/YOSE-CVPR26.