dinov3.seg: Open-Vocabulary Semantic Segmentation with DINOv3

arXiv cs.AI / 3/23/2026

💬 OpinionModels & Research

共有:

Key Points

The paper introduces dinov3.seg, a dedicated framework for Open-Vocabulary Semantic Segmentation (OVSS) built on the DINOv3 backbone to handle open-set text-defined categories.
It jointly aligns text embeddings with both the global CLS token and local patch-level visual features, enabling strong semantic discrimination alongside fine-grained spatial locality.
The approach performs early refinement of visual representations prior to image-text interaction and late refinement of the resulting image-text correlation features to improve dense predictions in cluttered scenes.
A high-resolution local-global inference strategy based on sliding-window aggregation preserves spatial detail while maintaining global context, and experiments on five OVSS benchmarks show consistent gains over state-of-the-art methods.

Abstract

Open-Vocabulary Semantic Segmentation (OVSS) assigns pixel-level labels from an open set of text-defined categories, demanding reliable generalization to unseen classes at inference. Although modern vision-language models (VLMs) support strong open-vocabulary recognition, their representations learned through global contrastive objectives remain suboptimal for dense prediction, prompting many OVSS methods to depend on limited adaptation or refinement of image-text similarity maps. This, in turn, restricts spatial precision and robustness in complex, cluttered scenes. We introduce dinov3.seg, extending dinov3.txt into a dedicated framework for OVSS. Our contributions are four-fold. First, we design a task-specific architecture tailored to this backbone, systematically adapting established design principles from prior open-vocabulary segmentation work. Second, we jointly leverage text embeddings aligned with both the global [CLS] token and local patch-level visual features from ViT-based encoder, effectively combining semantic discrimination with fine-grained spatial locality. Third, unlike prior approaches that rely primarily on post hoc similarity refinement, we perform early refinement of visual representations prior to image-text interaction, followed by late refinement of the resulting image-text correlation features, enabling more accurate and robust dense predictions in cluttered scenes. Finally, we propose a high-resolution local-global inference strategy based on sliding-window aggregation, which preserves spatial detail while maintaining global context. We conduct extensive experiments on five widely adopted OVSS benchmarks to evaluate our approach. The results demonstrate its effectiveness and robustness, consistently outperforming current state-of-the-art methods.

Does Synthetic Data Generation of LLMs Help Clinical Text Mining?

Dev.to

The Dawn of the Local AI Era: From iPhone 17 Pro to the Future of NVIDIA RTX

Dev.to

[P] Prompt optimization for analog circuit placement — 97% of expert quality, zero training data

Reddit r/MachineLearning

[R] Looking for arXiv endorser (cs.AI or cs.LG)

Reddit r/MachineLearning

I curated an 'Awesome List' for Generative AI in Jewelry- papers, datasets, open-source models and tools included!

Reddit r/artificial

dinov3.seg: Open-Vocabulary Semantic Segmentation with DINOv3

Key Points

Abstract

Related Articles

Does Synthetic Data Generation of LLMs Help Clinical Text Mining?

The Dawn of the Local AI Era: From iPhone 17 Pro to the Future of NVIDIA RTX

[P] Prompt optimization for analog circuit placement — 97% of expert quality, zero training data

[R] Looking for arXiv endorser (cs.AI or cs.LG)

I curated an 'Awesome List' for Generative AI in Jewelry- papers, datasets, open-source models and tools included!

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer