Control-DINO: Feature Space Conditioning for Controllable Image-to-Video Diffusion
arXiv cs.CV / 4/3/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- Control-DINO proposes using self-supervised feature embeddings (e.g., DINO) as a more general conditioning signal for pretrained image-to-video diffusion models, rather than relying only on perceptual/geometric/semantic signals.
- The approach introduces a lightweight architecture and training strategy aimed at decoupling appearance information (style/lighting) from other preserved scene features, improving controllability for tasks like stylization and relighting.
- The paper argues that although DINO features are highly effective for reconstruction, their entangled nature can restrict generative ability, and it addresses this limitation via targeted conditioning design.
- Experiments indicate that lower spatial resolution can be offset by higher feature dimensionality, which helps maintain or improve controllability in generative rendering from explicit spatial inputs.
- Results are positioned as enabling more robust video domain transfer and video-from-3D generation, expanding the practical controllable use of feature-conditioned video diffusion.
Related Articles

Black Hat Asia
AI Business

Mistral raises $830M, 9fin hits unicorn status, and new Tech.eu Summit speakers unveiled
Tech.eu

ChatGPT costs $20/month. I built an alternative for $2.99.
Dev.to

OpenAI shifts to usage-based pricing for Codex in ChatGPT business plans
THE DECODER

Why I built an AI assistant that doesn't know who you are
Dev.to