GaMMA: Towards Joint Global-Temporal Music Understanding in Large Multimodal Models

arXiv cs.AI / 5/4/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

GaMMA is a new large multimodal model aimed at broad, end-to-end music understanding by jointly learning musical audio signals and language.
The model builds on LLaVA’s streamlined encoder–decoder design for cross-modal learning, and uses mixture-of-experts audio encoders to handle both time-series and non-time-series music tasks under one parameter set.
GaMMA is trained using large-scale curated datasets and a progressive pipeline covering pretraining, supervised fine-tuning (SFT), and reinforcement learning (RL).
The paper introduces MusicBench, a large human-curated benchmark with 3,739 multiple-choice questions to evaluate temporal and global (non-temporal) music understanding.
Experiments report new state-of-the-art results in the music domain, including 79.1% on MuchoMusic, 79.3% on MusicBench-Temporal, and 81.3% on MusicBench-Global.

Abstract

In this paper, we propose GaMMA, a state-of-the-art (SoTA) large multimodal model (LMM) designed to achieve comprehensive musical content understanding. GaMMA inherits the streamlined encoder-decoder design of LLaVA, enabling effective cross-modal learning between music and language. By incorporating audio encoders in a mixture-of-experts manner, GaMMA effectively unifies both time-series and non-time-series music understanding tasks within one set of parameters. Our approach combines carefully curated datasets at scale with a progressive training pipeline, effectively pushing the boundaries of music understanding via pretraining, supervised fine-tuning (SFT), and reinforcement learning (RL). To comprehensively assess both temporal and non-temporal capability of music LMMs, we introduce MusicBench, the largest music-oriented benchmark, comprising 3,739 human-curated multiple-choice questions covering diverse aspects of musical understanding. Extensive experiments demonstrate that GaMMA establishes new SoTA in the music domain, achieving 79.1% accuracy on MuchoMusic, 79.3% on MusicBench-Temporal, and 81.3% on MusicBench-Global, consistently outperforming previous methods.

A very basic litmus test for LLMs "ok give me a python program that reads my c: and put names and folders in a sorted list from biggest to small"

Reddit r/LocalLLaMA

ALM on Power Platform: ADO + GitHub, the best of both worlds

Dev.to

Iron Will, Iron Problems: Kiwi-chan's Mining Misadventures! 🥝⛏️

Dev.to

Experiment: Does repeated usage influence ChatGPT 5.4 outputs in a RAG-like setup?

Dev.to

Find 12 high-volume, low-competition GEO content topics Topify.ai should rank on

Dev.to

GaMMA: Towards Joint Global-Temporal Music Understanding in Large Multimodal Models

Key Points

Abstract

Related Articles

A very basic litmus test for LLMs "ok give me a python program that reads my c: and put names and folders in a sorted list from biggest to small"

ALM on Power Platform: ADO + GitHub, the best of both worlds

Iron Will, Iron Problems: Kiwi-chan's Mining Misadventures! 🥝⛏️

Experiment: Does repeated usage influence ChatGPT 5.4 outputs in a RAG-like setup?

Find 12 high-volume, low-competition GEO content topics Topify.ai should rank on

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer