YOSE: You Only Select Essential Tokens for Efficient DiT-based Video Object Removal
arXiv cs.CV / 5/1/2026
📰 NewsDeveloper Stack & InfrastructureModels & Research
Key Points
- The paper proposes YOSE, an efficient fine-tuning framework for DiT-based video object removal that targets high inference latency in mask-based editing.
- YOSE uses Batch Variable-length Indexing (BVI) to adaptively select only the essential spatiotemporal tokens indicated by the mask, enabling variable-length token processing per sample.
- It also introduces a Diffusion Process Simulator (DiffSim) that approximates how unmasked regions affect DiT self-attention, preserving semantic consistency for masked areas.
- Experiments show mask-aware acceleration where inference time scales roughly linearly with the masked region size, achieving up to 2.5× speedup in 70% of cases without sacrificing comparable visual quality.
- The authors provide an open-source implementation via the linked GitHub repository.
Related Articles

Why Autonomous Coding Agents Keep Failing — And What Actually Works
Dev.to

Text-to-image is easy. Chaining LLMs to generate, critique, and iterate on images autonomously is a routing nightmare. AgentSwarms now supports Image generation playground and creative media workflows!
Reddit r/artificial

Automating FDA Compliance: AI for Specialty Food Producers
Dev.to

Mistral's new flagship Medium 3.5 folds chat, reasoning, and code into one model
THE DECODER
I hate this group but not literally
Reddit r/LocalLLaMA