AI Navigate

FineViT: Progressively Unlocking Fine-Grained Perception with Dense Recaptions

arXiv cs.CV / 3/19/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • FineViT introduces a vision encoder aimed at fine-grained perception by replacing coarse web data with dense recaptions to reduce information loss in traditional CLIP-based encoders.
  • The model is trained from scratch at high native resolution on billions of global recaptioned image-text pairs to build a rich semantic foundation before improving local perception via alignment with large language models.
  • A curated FineCap-450M dataset with over 450 million high-quality local captions is used to enhance local detail through LLM alignment.
  • Experimental results show state-of-the-art zero-shot recognition and long-context retrieval, with FineViT outperforming multimodal encoders like SigLIP2 and Qwen-ViT when integrated into MLLMs.
  • The work proposes FineViT as a new baseline for fine-grained visual perception in multimodal systems, potentially impacting downstream AI perception tasks and model design.

Abstract

While Multimodal Large Language Models (MLLMs) have experienced rapid advancements, their visual encoders frequently remain a performance bottleneck. Conventional CLIP-based encoders struggle with dense spatial tasks due to the loss of visual details caused by low-resolution pretraining and the reliance on noisy, coarse web-crawled image-text pairs. To overcome these limitations, we introduce FineViT, a novel vision encoder specifically designed to unlock fine-grained perception. By replacing coarse web data with dense recaptions, we systematically mitigate information loss through a progressive training paradigm.: first, the encoder is trained from scratch at a high native resolution on billions of global recaptioned image-text pairs, establishing a robust, detail rich semantic foundation. Subsequently, we further enhance its local perception through LLM alignment, utilizing our curated FineCap-450M dataset that comprises over 450 million high quality local captions. Extensive experiments validate the effectiveness of the progressive strategy. FineViT achieves state-of-the-art zero-shot recognition and retrieval performance, especially in long-context retrieval, and consistently outperforms multimodal visual encoders such as SigLIP2 and Qwen-ViT when integrated into MLLMs. We hope FineViT could serve as a powerful new baseline for fine-grained visual perception.