THFM: A Unified Video Foundation Model for 4D Human Perception and Beyond
arXiv cs.CV / 3/30/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- THFM is introduced as a unified video foundation model that performs both dense human perception tasks (depth, normals, segmentation, dense pose) and sparse tasks (2D/3D keypoints) using one architecture.
- The model is built by adapting a pretrained text-to-video diffusion model into a single-forward-pass perception system, with learnable tokens added to support sparse prediction outputs.
- THFM can switch among multiple perception tasks through text-prompt modulation, enabling a prompt-driven “one model, many tasks” setup.
- Despite being trained only on synthetic video data, THFM achieves state-of-the-art or better results than specialized models across multiple benchmarks.
- The paper reports emergent generalization behavior, such as training on single-human scenes and then applying to multi-human scenes and new object categories (e.g., anthropomorphic characters and animals).
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.
Related Articles

Black Hat Asia
AI Business

Mr. Chatterbox is a (weak) Victorian-era ethically trained model you can run on your own computer
Simon Willison's Blog
Beyond the Chatbot: Engineering Multi-Agent Ecosystems in 2026
Dev.to

I missed the "fun" part in software development
Dev.to

The Billion Dollar Tax on AI Agents
Dev.to