GRPO-TTA: Test-Time Visual Tuning for Vision-Language Models via GRPO-Driven Reinforcement Learning

arXiv cs.CV / 5/6/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces GRPO-TTA, extending Group Relative Policy Optimization (GRPO) to the test-time adaptation setting for vision-language models.
  • It reformulates class-specific prompt prediction as a group-wise reinforcement learning problem by building output groups from top-K CLIP similarity candidates, allowing optimization without ground-truth labels.
  • The method uses test-time adaptation-specific rewards, including alignment rewards and dispersion rewards, to steer tuning of the visual encoder.
  • Experiments on multiple benchmarks show GRPO-TTA outperforms prior test-time adaptation approaches, with especially large gains under natural distribution shifts.

Abstract

Group Relative Policy Optimization (GRPO) has recently shown strong performance in post-training large language models and vision-language models. It raises a question of whether the GRPO also significantly promotes the test-time adaptation (TTA) of vision language models. In this paper, we propose Group Relative Policy Optimization for Test-Time Adaptation (GRPO-TTA), which adapts GRPO to the TTA setting by reformulating class-specific prompt prediction as a group-wise policy optimization problem. Specifically, we construct output groups by sampling top-K class candidates from CLIP similarity distributions, enabling probability-driven optimization without access to ground-truth labels. Moreover, we design reward functions tailored to test-time adaptation, including alignment rewards and dispersion rewards, to guide effective visual encoder tuning. Extensive experiments across diverse benchmarks demonstrate that GRPO-TTA consistently outperforms existing test-time adaptation methods, with notably larger performance gains under natural distribution shifts.