StereoFoley: Object-Aware Stereo Audio Generation from Video

Apple Machine Learning Journal / 4/28/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • StereoFoley is a video-to-audio generation framework designed to produce semantically aligned, temporally synchronized, and spatially accurate stereo sound at 48 kHz.
  • The work targets a key limitation of recent video-to-audio generative models, which often output mono audio or lack object-aware stereo imaging due to insufficient professionally mixed, spatially accurate datasets.
  • The authors train a base stereo audio generation model from video, reporting state-of-the-art performance in semantic accuracy and audio-video synchronization.
  • The approach extends beyond basic generation toward object-aware stereo behavior, aiming to deliver more realistic spatial audio tied to scene elements.
  • The paper is positioned as research for ICASSP (April 2026 publication) and is disseminated via an arXiv preprint for broader access by the research community.
We present StereoFoley, a video-to-audio generation framework that produces semantically aligned, temporally synchronized, and spatially accurate stereo sound at 48 kHz. While recent generative video-to-audio models achieve strong semantic and temporal fidelity, they largely remain limited to mono or fail to deliver object-aware stereo imaging, constrained by the lack of professionally mixed, spatially accurate video-to-audio datasets. First, we develop and train a base model that generates stereo audio from video, achieving state-of-the-art in both semantic accuracy and synchronization. Next…

Continue reading this article on the original site.

Read original →