UniVid: Pyramid Diffusion Model for High Quality Video Generation
arXiv cs.CV / 3/17/2026
📰 NewsModels & Research
Key Points
- UniVid is a unified video generation model that enables T2V, I2V, and (T+I)2V generation by using both text prompts and a reference image as controls.
- It scales up a pre-trained text-to-image diffusion backbone and adds temporal-pyramid cross-frame attention modules and convolutions to produce temporally coherent video frames.
- It introduces a dual-stream cross-attention mechanism whose attention scores can be re-weighted to interpolate between single-modal and bimodal controls during inference.
- Experimental results show UniVid achieves superior temporal coherence across T2V, I2V, and (T+I)2V tasks.
Related Articles

PearlOS. We gave swarm intelligence a local desktop environment and code control to self-evolve. Has been pretty incredible to see so far. Open source and free if you want your own.
Reddit r/LocalLLaMA
QwenDean-4B | fine-tuned SLM for UIGen; our first attempt, looking for feedback!
Reddit r/LocalLLaMA
acestep.cpp: portable C++17 implementation of ACE-Step 1.5 music generation using GGML. Runs on CPU, CUDA, ROCm, Metal, Vulkan
Reddit r/LocalLLaMA

**Introducing SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding**
Hugging Face Blog

Newest GPU server in the lab! 72gb ampere vram!
Reddit r/LocalLLaMA