TTA-Vid: Generalized Test-Time Adaptation for Video Reasoning

arXiv cs.CV / 4/2/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes TTA-Vid, a generalized test-time adaptation method for video reasoning that adapts a pretrained model to incoming videos without needing explicit labels or ground-truth annotations.
  • TTA-Vid performs step-by-step reasoning at inference time over multiple frame subsets and uses a batch-aware, frequency-based reward computed across subsets as pseudo ground truth to update the model.
  • The authors report that models adapted using only a single batch—or even a single sample during the adaptation procedure—can generalize across an entire dataset and also transfer to other datasets at test time.
  • To improve efficiency and effectiveness, the method includes a multi-armed bandit strategy to adaptively select more informative frames using the same reward formulation.
  • Experiments across multiple video reasoning tasks show consistent gains and indicate that TTA-Vid can outperform existing state-of-the-art approaches that rely on large-scale supervised training.

Abstract

Recent video reasoning models have shown strong results on temporal and multimodal understanding, yet they depend on large-scale supervised data and multi-stage training pipelines, making them costly to train and difficult to adapt to new domains. In this work, we leverage the paradigm of Test-Time Reinforcement Learning on video-language data to allow for adapting a pretrained model to incoming video samples at test-time without explicit labels. The proposed test-time adaptation for video approach (TTA-Vid) combines two components that work simultaneously: (1) a test-time adaptation that performs step-by-step reasoning at inference time on multiple frame subsets. We then use a batch-aware frequency-based reward computed across different frame subsets as pseudo ground truth to update the model. It shows that the resulting model trained on a single batch or even a single sample from a dataset, is able to generalize at test-time to the whole dataset and even across datasets. Because the adaptation occurs entirely at test time, our method requires no ground-truth annotations or dedicated training splits. Additionally, we propose a multi-armed bandit strategy for adaptive frame selection that learns to prioritize informative frames, guided by the same reward formulation. Our evaluation shows that TTA-Vid yields consistent improvements across various video reasoning tasks and is able to outperform current state-of-the-art methods trained on large-scale data. This highlights the potential of test-time reinforcement learning for temporal multimodal understanding.