Sapiens2
arXiv cs.CV / 4/24/2026
📰 NewsDeveloper Stack & InfrastructureModels & Research
Key Points
- Sapiens2 is a new family of high-resolution transformer models designed for human-centric vision, aiming for better generalization and high-fidelity outputs across many downstream tasks.
- The model scales from 0.4B to 5B parameters, supports native 1K resolution, and includes hierarchical variants that can run at 4K using windowed attention and 2K output-resolution pretraining.
- Training improvements include a unified pretraining approach that combines masked image reconstruction with self-distilled contrastive objectives, which the authors report works better across a wider range of task types.
- Sapiens2 improves data quality and annotations by pretraining on a curated set of 1B high-quality human images, and it uses architectural advances to enable longer training schedules with improved stability.
- The benchmark results claim new state-of-the-art performance and notable gains over the previous generation, including pose (+4 mAP), body-part segmentation (+24.3 mIoU), and normal estimation (45.6% lower angular error), plus extension to new tasks like pointmap and albedo estimation.
Related Articles
GPT-5.5 is here. So is DeepSeek V4. And honestly, I am tired of version numbers.
Dev.to
I Built an AI Image Workflow with GPT Image 2.0 (+ Fixing Its Biggest Flaw)
Dev.to
Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-GGUF
Reddit r/LocalLLaMA
Building a Visual Infrastructure Layer: How We’re Solving the "Visual Trust Gap" for E-com
Dev.to
Qwen3.6 35B-A3B is quite useful on 780m iGPU (llama.cpp,vulkan)
Reddit r/LocalLLaMA