MMAudio-LABEL: Audio Event Labeling via Audio Generation for Silent Video

arXiv cs.CV / 5/4/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes MMAudio-LABEL, a framework that generates audio from silent video while simultaneously predicting frame-aligned sound event labels (type and timing).
It argues that a two-step “generate-then-detect” (post-hoc) pipeline is limited due to error accumulation, and instead introduces an event-aware joint learning approach.
MMAudio-LABEL uses a foundational audio generation model as a backbone and jointly produces audio plus sound event predictions aligned with video frames.
On the Greatest Hits dataset, the method substantially improves onset detection (46.7% to 75.0%) and 17-class material classification (40.6% to 61.0%) compared with baselines.
The authors conclude that jointly learning audio synthesis and event prediction yields audio that is not only high-quality but also more interpretable and practical for applications like sound production.

Abstract

Recent advances in multimodal generation have enabled high-quality audio generation from silent videos. Practical applications, such as sound production, demand not only the generated audio but also explicit sound event labels detailing the type and timing of sounds. One straightforward approach involves applying a standard sound event detection to the generated audio. However, this post-hoc pipeline is inherently limited, as it is prone to error accumulation. To address this limitation, we propose MMAudio-LABEL (LAtent-Based Event Labeling), an event-aware audio generation framework built on a foundational audio generation model as its backbone that jointly generates audio and frame-aligned sound event predictions from silent videos. We evaluate our method on the Greatest Hits dataset for onset detection and 17-class material classification. Our approach improves onset-detection accuracy from 46.7% to 75.0% and material-classification accuracy from 40.6% to 61.0% over baselines. These results suggest that jointly learning audio generation and event prediction enables a more interpretable and practical video-to-audio synthesis.

AnnouncementsBuilding a new enterprise AI services company with Blackstone, Hellman & Friedman, and Goldman Sachs

Anthropic News

Dara Khosrowshahi on replacing Uber drivers — and himself — with AI

The Verge

CLMA Frame Test

Dev.to

You Are Right — You Don't Need CLAUDE.md

Dev.to

Governance and Liability in AI Agents: What I Built Trying to Answer Those Questions

Dev.to

MMAudio-LABEL: Audio Event Labeling via Audio Generation for Silent Video

Key Points

Abstract

Related Articles

AnnouncementsBuilding a new enterprise AI services company with Blackstone, Hellman & Friedman, and Goldman Sachs

Dara Khosrowshahi on replacing Uber drivers — and himself — with AI

CLMA Frame Test

You Are Right — You Don't Need CLAUDE.md

Governance and Liability in AI Agents: What I Built Trying to Answer Those Questions

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer