Revisiting Content-Based Music Recommendation: Efficient Feature Aggregation from Large-Scale Music Models

arXiv cs.AI / 4/25/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

Key Points

  • The paper revisits music recommendation by arguing that conventional collaborative-filtering approaches underuse audio content, hurting performance in cold-start cases.
  • It introduces TASTE, a new dataset and benchmarking framework that pairs raw audio with textual metadata to better support multimodal music recommendation research.
  • Using large-scale self-supervised music encoders, the authors show that learned audio representations substantially improve recommendation outcomes across tasks such as candidate recall and CTR.
  • They propose MuQ-token, a method for efficiently aggregating multi-layer audio features, which outperforms other feature integration techniques across multiple experimental settings.
  • The work positions its multimodal benchmark and code release as a reusable foundation for future content-based and multimodal recommender-system research.

Abstract

Music Recommendation Systems (MRSs) are a cornerstone of modern streaming platforms. Existing recommendation models, spanning both recall and ranking stages, predominantly rely on collaborative filtering, which fails to exploit the intrinsic characteristics of audio and consequently leads to suboptimal performance, particularly in cold-start scenarios. However, existing music recommendation datasets often lack rich multimodal information, such as raw audio signals and descriptive textual metadata. Moreover, current recommender system evaluation frameworks remain inadequate, as they neither fully leverage multimodal information nor support a diverse range of algorithms, especially multimodal methods. To address these limitations, we propose TASTE, a comprehensive dataset and benchmarking framework designed to highlight the role of multimodal information in music recommendation. Our dataset integrates both audio and textual modalities. By leveraging recent large-scale self-supervised music encoders, we demonstrate the substantial value of the extracted audio representations across recommendation tasks, including candidate recall and CTR. In addition, we introduce the \textbf{MuQ-token} method, which enables more efficient integration of multi-layer audio features. This method consistently outperforms other feature integration techniques across various settings. Overall, our results not only validate the effectiveness of content-driven approaches but also provide a highly effective and reusable multimodal foundation for future research. Code is available at https://github.com/zreach/TASTE