VGGT-Segmentor: Geometry-Enhanced Cross-View Segmentation

arXiv cs.CV / 4/16/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces VGGT-Segmentor (VGGT-S), a geometry-enhanced framework for instance-level cross-view segmentation across egocentric and exocentric images.
It argues that existing geometry-aware methods like VGGT can suffer from pixel-level projection drift that degrades dense prediction, motivating a union segmentation head for pixel-accurate masks.
VGGT-S uses a three-stage Union Segmentation Head (mask prompt fusion, point-guided prediction, iterative mask refinement) to convert robust cross-view feature alignment into precise segmentation outputs.
It proposes a single-image self-supervised training approach that avoids paired annotations while maintaining strong generalization performance.
On the Ego-Exo4D benchmark, VGGT-S reports new state-of-the-art results of 67.7% (Ego→Exo) and 68.0% (Exo→Ego) average IoU, with correspondence-free pretraining outperforming many fully supervised baselines.

Abstract

Instance-level object segmentation across disparate egocentric and exocentric views is a fundamental challenge in visual understanding, critical for applications in embodied AI and remote collaboration. This task is exceptionally difficult due to severe changes in scale, perspective, and occlusion, which destabilize direct pixel-level matching. While recent geometry-aware models like VGGT provide a strong foundation for feature alignment, we find they often fail at dense prediction tasks due to significant pixel-level projection drift, even when their internal object-level textntion remains consistent. To bridge this gap, we introduce VGGT-Segmentor (VGGT-S), a framework that unifies robust geometric modeling with pixel-accurate semantic segmentation. VGGT-S leverages VGGT's powerful cross-view feature representation and introduces a novel Union Segmentation Head. This head operates in three stages: mask prompt fusion, point-guided prediction, and iterative mask refinement, effectively translating high-level feature alignment into a precise segmentation mask. Furthermore, we propose a single-image self-supervised training strategy that eliminates the need for paired annotations and enables strong generalization. On the Ego-Exo4D benchmark, VGGT-S sets a new state-of-the-art, achieving 67.7% and 68.0% average IoU for Ego to Exo and Exo to Ego tasks, respectively, significantly outperforming prior methods. Notably, our correspondence-free pretrained model surpasses most fully-supervised baselines, demonstrating the effectiveness and scalability of our approach.

Black Hat Asia

AI Business

Best AI Video Generators in 2026 (That Actually Work for Real Content)

Dev.to

Vibe Coding Just Graduated From Joke to Job Title

Dev.to

512,000 Lines of Leaked Code Exposed Anthropic's Secret Models

Dev.to

"The AI Agent Dilemma: Why Efficiency Beats Intelligence in Competitive Economie

Dev.to

VGGT-Segmentor: Geometry-Enhanced Cross-View Segmentation

Key Points

Abstract

Related Articles

Black Hat Asia

Best AI Video Generators in 2026 (That Actually Work for Real Content)

Vibe Coding Just Graduated From Joke to Job Title

512,000 Lines of Leaked Code Exposed Anthropic's Secret Models

"The AI Agent Dilemma: Why Efficiency Beats Intelligence in Competitive Economie

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer