MMAudio-LABEL: Audio Event Labeling via Audio Generation for Silent Video

arXiv cs.CV / 5/4/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes MMAudio-LABEL, a framework that generates audio from silent video while simultaneously predicting frame-aligned sound event labels (type and timing).
  • It argues that a two-step “generate-then-detect” (post-hoc) pipeline is limited due to error accumulation, and instead introduces an event-aware joint learning approach.
  • MMAudio-LABEL uses a foundational audio generation model as a backbone and jointly produces audio plus sound event predictions aligned with video frames.
  • On the Greatest Hits dataset, the method substantially improves onset detection (46.7% to 75.0%) and 17-class material classification (40.6% to 61.0%) compared with baselines.
  • The authors conclude that jointly learning audio synthesis and event prediction yields audio that is not only high-quality but also more interpretable and practical for applications like sound production.

Abstract

Recent advances in multimodal generation have enabled high-quality audio generation from silent videos. Practical applications, such as sound production, demand not only the generated audio but also explicit sound event labels detailing the type and timing of sounds. One straightforward approach involves applying a standard sound event detection to the generated audio. However, this post-hoc pipeline is inherently limited, as it is prone to error accumulation. To address this limitation, we propose MMAudio-LABEL (LAtent-Based Event Labeling), an event-aware audio generation framework built on a foundational audio generation model as its backbone that jointly generates audio and frame-aligned sound event predictions from silent videos. We evaluate our method on the Greatest Hits dataset for onset detection and 17-class material classification. Our approach improves onset-detection accuracy from 46.7% to 75.0% and material-classification accuracy from 40.6% to 61.0% over baselines. These results suggest that jointly learning audio generation and event prediction enables a more interpretable and practical video-to-audio synthesis.