[P] Deezer showed CNN detection fails on compressed audio, here's a dual-engine approach that survives MP3

Reddit r/MachineLearning / 3/27/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • A CNN detector trained on mel-spectrogram artifacts for spotting AI-generated music works on WAV audio but fails after common codecs like MP3/AAC compress the signal, removing cues the model relies on.
  • The proposed workaround uses a dual-engine hybrid: a source-separation model (Demucs) splits tracks into stems, reconstructs them, and then checks how closely the reconstruction matches the original.
  • The approach leverages a key behavioral difference: human recordings produce stem “bleed” from recording conditions, so reconstruction diverges more, while independently synthesized AI stems reconstruct more similarly.
  • Results reported include ~1.1% human false positives and 80%+ AI detection, with performance claimed to hold across multiple codecs (MP3, AAC, OGG) because the method avoids reliance on fragile compression-sensitive spectral artifacts.
  • The system mitigates compute by running the expensive separation/reconstruction only when the CNN is uncertain, but detection varies by AI generator and the separation step can be non-deterministic on borderline cases.

I've been working on detecting AI-generated music and ran into the same wall that Deezer's team documented in their paper, CNN-based detection on mel-spectrograms breaks when audio is compressed to MP3.

The problem: A ResNet18 trained on mel-spectrograms works well on WAV files, but real-world music is distributed as MP3/AAC. Compression destroys the subtle spectral artifacts the CNN relies on.

What actually worked: Instead of trying to make the CNN more robust, I added a second engine based on source separation (Demucs). The idea is simple:

  1. Separate a track into 4 stems (vocals, drums, bass, other)
  2. Re-mix them back together
  3. Measure the difference between original and reconstructed audio

For human-recorded music, stems bleed into each other during recording (room acoustics, mic crosstalk, etc.), so separation + reconstruction produces noticeable differences. For AI music, each stem is synthesized independently separation and reconstruction yield nearly identical results.

Results:

  • Human false positive rate: ~1.1%
  • AI detection rate: 80%+
  • Works regardless of audio codec (MP3, AAC, OGG)

The CNN handles the easy cases (high-confidence predictions), and the reconstruction engine only kicks in when CNN is uncertain. This saves compute since source separation is expensive.

Limitations:

  • Detection rate varies across different AI generators
  • Demucs is non-deterministic borderline cases can flip between runs
  • Only tested on music, not speech or sound effects

Curious if anyone has explored similar hybrid approaches, or has ideas for making the reconstruction analysis more robust.

submitted by /u/Leather_Lobster_2558
[link] [comments]
広告