Sapiens2

arXiv cs.CV / 4/24/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • Sapiens2 is a new family of high-resolution transformer models designed for human-centric vision, aiming for better generalization and high-fidelity outputs across many downstream tasks.
  • The model scales from 0.4B to 5B parameters, supports native 1K resolution, and includes hierarchical variants that can run at 4K using windowed attention and 2K output-resolution pretraining.
  • Training improvements include a unified pretraining approach that combines masked image reconstruction with self-distilled contrastive objectives, which the authors report works better across a wider range of task types.
  • Sapiens2 improves data quality and annotations by pretraining on a curated set of 1B high-quality human images, and it uses architectural advances to enable longer training schedules with improved stability.
  • The benchmark results claim new state-of-the-art performance and notable gains over the previous generation, including pose (+4 mAP), body-part segmentation (+24.3 mIoU), and normal estimation (45.6% lower angular error), plus extension to new tasks like pointmap and albedo estimation.

Abstract

We present Sapiens2, a model family of high-resolution transformers for human-centric vision focused on generalization, versatility, and high-fidelity outputs. Our model sizes range from 0.4 to 5 billion parameters, with native 1K resolution and hierarchical variants that support 4K. Sapiens2 substantially improves over its predecessor in both pretraining and post-training. First, to learn features that capture low-level details (for dense prediction) and high-level semantics (for zero-shot or few-label settings), we combine masked image reconstruction with self-distilled contrastive objectives. Our evaluations show that this unified pretraining objective is better suited for a wider range of downstream tasks. Second, along the data axis, we pretrain on a curated dataset of 1 billion high-quality human images and improve the quality and quantity of task annotations. Third, architecturally, we incorporate advances from frontier models that enable longer training schedules with improved stability. Our 4K models adopt windowed attention to reason over longer spatial context and are pretrained with 2K output resolution. Sapiens2 sets a new state-of-the-art and improves over the first generation on pose (+4 mAP), body-part segmentation (+24.3 mIoU), normal estimation (45.6% lower angular error) and extends to new tasks such as pointmap and albedo estimation. Code: https://github.com/facebookresearch/sapiens2