VideoFlexTok: Flexible-Length Coarse-to-Fine Video Tokenization
arXiv cs.CV / 4/15/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces VideoFlexTok, a video tokenization method that uses a variable-length, coarse-to-fine sequence of tokens rather than a fixed spatiotemporal 3D token grid.
- Early (coarse) tokens are designed to capture more abstract information like semantics and motion, while later (fine) tokens progressively add fine-grained detail.
- Using a generative flow decoder, the approach can reconstruct realistic videos from any chosen token count, enabling compute-adaptive fidelity.
- Experiments on class- and text-to-video generation indicate improved training efficiency, including comparable quality with a significantly smaller model (1.1B vs 5.2B parameters).
- The method supports longer video generation under limited compute by training on 10-second, 81-frame clips with far fewer tokens (672), outperforming the token budget requirements of comparable 3D-grid tokenizers.
Related Articles

Black Hat Asia
AI Business
Are gamers being used as free labeling labor? The rise of "Simulators" that look like AI training grounds [D]
Reddit r/MachineLearning

I built a trading intelligence MCP server in 2 days — here's how
Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to
Qwen3.5-35B running well on RTX4060 Ti 16GB at 60 tok/s
Reddit r/LocalLLaMA