AVRT: Audio-Visual Reasoning Transfer through Single-Modality Teachers

arXiv cs.CV / 4/21/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces AVRT, a framework for transferring text reasoning capabilities to multimodal audio-visual reasoning by addressing the scarcity of high-quality multimodal reasoning data.
  • AVRT generates separate audio and vision reasoning traces using single-modality teacher models, then merges them with an LLM “merger” to form high-quality multimodal reasoning traces.
  • It adapts target models to audio-visual reasoning using a two-stage training pipeline: an SFT “cold start” on the generated traces followed by reinforcement learning on larger-scale data.
  • Experiments on seven audio-visual/audio benchmarks show that AVRT-trained 3B and 7B models reach state-of-the-art performance among similarly sized models, and improvements also transfer to related single-modality reasoning tasks.
  • The authors position AVRT as a new training pipeline for multimodal reasoning models, potentially enabling better reasoning over combined sensory inputs like audio and video.

Abstract

Recent advances in reasoning models have shown remarkable progress in text-based domains, but transferring those capabilities to multimodal settings, e.g., to allow reasoning over audio-visual data, still remains a challenge, in part because of the limited availability of high-quality reasoning data in targeted multimodal combinations. To address this problem, we introduce AVRT, a novel framework that generates high-quality audio-visual reasoning traces from single-modality teacher models. We generate independent vision- and audio-reasoning traces via models specialized to reason over their respective modalities and merge the resulting traces with an LLM merger model. The resulting multimodal traces are used in a supervised fine-tuning (SFT) cold start to adapt the target model to audio-visual reasoning traces first, before training it in a second reinforcement learning stage on larger-scale data. Evaluated on seven audio-visual and audio benchmarks, our 3B and 7B parameter models achieve state-of-the-art results among models of comparable size including OmniBench and DailyOmni for audio-visual and MMAR for audio-only reasoning, showing that cross-modal training also transfers to single-modality tasks and establishing a new training pipeline for multimodal reasoning models.