AI Navigate

CycleCap: Improving VLMs Captioning Performance via Self-Supervised Cycle Consistency Fine-Tuning

arXiv cs.CV / 3/20/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • CycleCap introduces a self-supervised fine-tuning scheme that uses cycle consistency between a visual-language model and a text-to-image model to improve image captioning and reduce hallucinations.
  • The approach employs Group Relative Policy Optimization with a live reward based on the similarity between the original and reconstructed images, computed online during training.
  • It eliminates the need for curated image-text datasets by leveraging raw images as the training signal, guiding captions to be more grounded in visual content.
  • Across four VLMs ranging from 1B to 7B parameters, CycleCap achieves consistent improvements on captioning and hallucination benchmarks, outperforming state-of-the-art methods that rely on supervised cycle-consistency training.

Abstract

Visual-Language Models (VLMs) have achieved remarkable progress in image captioning, visual question answering, and visual reasoning. Yet they remain prone to vision-language misalignment, often producing overly generic or hallucinated descriptions. Existing approaches address this via instruction tuning-requiring costly, large-scale annotated datasets or via complex test-time frameworks for caption refinement. In this work, we revisit image-text alignment through the lens of cycle consistency: given an image and a caption generated by an image-to-text model, the backward mapping through a text-to-image model should reconstruct an image that closely matches the original. In our setup, a VLM serves as the image-to-text component, while a pre-trained text-to-image model closes the loop by reconstructing the image from the generated caption. Building on this, we introduce CycleCap, a fine-tuning scheme to improve image captioning using Group Relative Policy Optimization (GRPO) with a reward based on the similarity between the original and reconstructed images, computed on-the-fly. Unlike previous work that uses cycle consistency loss for preference dataset construction, our method leverages cycle consistency directly as a self-supervised training signal. This enables the use of raw images alone, eliminating the need for curated image-text datasets, while steering the VLM to produce more accurate and grounded text descriptions. Applied to four VLMs ranging from 1B to 7B parameters, CycleCap yields consistent improvements across captioning and hallucination benchmarks, surpassing state-of-the-art methods that rely on supervised cycle consistency training.