DiT as Real-Time Rerenderer: Streaming Video Stylization with Autoregressive Diffusion Transformer
arXiv cs.CV / 4/16/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces RTR-DiT, a streaming video stylization framework that uses a Diffusion Transformer to improve stability and consistency on long videos.
- It fine-tunes a bidirectional teacher model for both text-guided and reference-guided stylization, then compresses it into a few-step autoregressive model using Self Forcing and Distribution Matching Distillation.
- A reference-preserving KV cache update strategy is proposed to maintain consistency across long sequences and enable real-time switching between text prompts and reference images.
- Experiments report that RTR-DiT outperforms prior diffusion-based stylization approaches on both quantitative metrics and visual quality, while supporting real-time interactive applications.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.
Related Articles

Black Hat Asia
AI Business

Introducing Claude Opus 4.7
Anthropic News

AI traffic to US retailers rose 393% in Q1, and it’s boosting their revenue too
TechCrunch

Who Audits the Auditors? Building an LLM-as-a-Judge for Agentic Reliability
Dev.to

"Enterprise AI Cost Optimization: How Companies Are Cutting AI Infrastructure Sp
Dev.to