[P] Deezer showed CNN detection fails on compressed audio, here's a dual-engine approach that survives MP3

Reddit r/MachineLearning / 3/27/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research

共有:

Key Points

A CNN detector trained on mel-spectrogram artifacts for spotting AI-generated music works on WAV audio but fails after common codecs like MP3/AAC compress the signal, removing cues the model relies on.
The proposed workaround uses a dual-engine hybrid: a source-separation model (Demucs) splits tracks into stems, reconstructs them, and then checks how closely the reconstruction matches the original.
The approach leverages a key behavioral difference: human recordings produce stem “bleed” from recording conditions, so reconstruction diverges more, while independently synthesized AI stems reconstruct more similarly.
Results reported include ~1.1% human false positives and 80%+ AI detection, with performance claimed to hold across multiple codecs (MP3, AAC, OGG) because the method avoids reliance on fragile compression-sensitive spectral artifacts.
The system mitigates compute by running the expensive separation/reconstruction only when the CNN is uncertain, but detection varies by AI generator and the separation step can be non-deterministic on borderline cases.

I've been working on detecting AI-generated music and ran into the same wall that Deezer's team documented in their paper, CNN-based detection on mel-spectrograms breaks when audio is compressed to MP3.

The problem: A ResNet18 trained on mel-spectrograms works well on WAV files, but real-world music is distributed as MP3/AAC. Compression destroys the subtle spectral artifacts the CNN relies on.

What actually worked: Instead of trying to make the CNN more robust, I added a second engine based on source separation (Demucs). The idea is simple:

Separate a track into 4 stems (vocals, drums, bass, other)
Re-mix them back together
Measure the difference between original and reconstructed audio

For human-recorded music, stems bleed into each other during recording (room acoustics, mic crosstalk, etc.), so separation + reconstruction produces noticeable differences. For AI music, each stem is synthesized independently separation and reconstruction yield nearly identical results.

Results:

Human false positive rate: ~1.1%
AI detection rate: 80%+
Works regardless of audio codec (MP3, AAC, OGG)

The CNN handles the easy cases (high-confidence predictions), and the reconstruction engine only kicks in when CNN is uncertain. This saves compute since source separation is expensive.

Limitations:

Detection rate varies across different AI generators
Demucs is non-deterministic borderline cases can flip between runs
Only tested on music, not speech or sound effects

Curious if anyone has explored similar hybrid approaches, or has ideas for making the reconstruction analysis more robust.

submitted by /u/Leather_Lobster_2558
[link] [comments]

[Boost]

Dev.to

Managing LLM context in a real application

Dev.to

Got My 39-Agent System Audited Live. Here's What the Maturity Scorecard Revealed.

Dev.to

OpenAI Killed Sora — Here's Your 10-Minute Migration Guide (Free API)

Dev.to

Switching my AI voice agent from WebSocket to WebRTC — what broke and what I learned

Dev.to

[P] Deezer showed CNN detection fails on compressed audio, here's a dual-engine approach that survives MP3

Key Points

Related Articles

[Boost]

Managing LLM context in a real application

Got My 39-Agent System Audited Live. Here's What the Maturity Scorecard Revealed.

OpenAI Killed Sora — Here's Your 10-Minute Migration Guide (Free API)

Switching my AI voice agent from WebSocket to WebRTC — what broke and what I learned

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer