FoleyDesigner: Immersive Stereo Foley Generation with Precise Spatio-Temporal Alignment for Film Clips

arXiv cs.CV / 4/8/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisTools & Practical UsageModels & Research

共有:

Key Points

The paper introduces FoleyDesigner, a framework for generating immersive stereo Foley for film clips by aligning sound events precisely in space and time.
It uses a multi-agent approach that combines latent diffusion models trained on spatio-temporal cues from video frames with LLM-driven hybrid mechanisms that mimic professional film post-production workflows.
To overcome dataset limitations, the authors release FilmStereo, a new professional stereo audio dataset with spatial metadata, precise timestamps, and semantic annotations across eight common Foley categories.
The system supports interactive user control and outputs audio compatible with professional mixing pipelines, including 5.1-channel Dolby Atmos workflows aligned with ITU-R BS.775 standards.
Experiments reported in the paper show improved spatio-temporal alignment over existing baselines while maintaining practical integration with film production requirements.

Abstract

Foley art plays a pivotal role in enhancing immersive auditory experiences in film, yet manual creation of spatio-temporally aligned audio remains labor-intensive. We propose FoleyDesigner, a novel framework inspired by professional Foley workflows, integrating film clip analysis, spatio-temporally controllable Foley generation, and professional audio mixing capabilities. FoleyDesigner employs a multi-agent architecture for precise spatio-temporal analysis. It achieves spatio-temporal alignment through latent diffusion models trained on spatio-temporal cues extracted from video frames, combined with large language model (LLM)-driven hybrid mechanisms that emulate post-production practices in film industry. To address the lack of high-quality stereo audio datasets in film, we introduce FilmStereo, the first professional stereo audio dataset containing spatial metadata, precise timestamps, and semantic annotations for eight common Foley categories. For applications, the framework supports interactive user control while maintaining seamless integration with professional pipelines, including 5.1-channel Dolby Atmos systems compliant with ITU-R BS.775 standards, thereby offering extensive creative flexibility. Extensive experiments demonstrate that our method achieves superior spatio-temporal alignment compared to existing baselines, with seamless compatibility with professional film production standards. The project page is available at https://gekiii996.github.io/FoleyDesigner/ .