Listen and Chant Before You Read: The Ladder of Beauty in LM Pre-Training

arXiv cs.CL / 4/24/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper reports that pre-training a Transformer on music before language substantially speeds up language acquisition, using piano performances from the MAESTRO dataset.
It proposes a “music → poetry → prose” developmental pipeline and finds a 17.5% perplexity improvement over random initialization, with different parts of the model improving orthogonally (internal computation vs. embeddings).
Convergence tests indicate the gains persist beyond an initial head start, showing a sustained 5.5% validation gap at the plateau with faster convergence across multiple runs.
The study shows that real music reaches the transfer ceiling of synthetic patterns using about one-third the data, and scaling experiments suggest an optimal pre-training data volume that depends on model capacity.
The authors conclude that structured human creative outputs can be an efficient pre-training substrate for small language models, while noting that stronger evidence at modern pre-training scales will require much larger experiments.

Abstract

We show that pre-training a Transformer on music before language significantly accelerates language acquisition. Using piano performances (MAESTRO dataset), a developmental pipeline -- music

\to

poetry

\to

prose -- yields a

17.5\%

perplexity improvement over random initialization (

p < 0.001

, 5 seeds), with music and poetry improving orthogonal model components (internal computation and embeddings, respectively). Convergence tests confirm that this is not a transient head start: at

d\!=\!64

, multi-seed validation (5 seeds) shows a persistent 5.5\% gap at plateau (

p = 0.017

), with the pipeline converging faster and to a lower loss in every run. Real music matches the transfer ceiling of synthetic patterns with one-third the data, and scaling experiments reveal that optimal pre-training data volume shifts with model capacity (

-3\% \to +3\% \to +6\%

advantage of larger datasets from

d\!=\!16

d\!=\!64

). Across the scales we study (

d\!\in\!\{16,32,64\}

, up to

{\sim}400

K parameters), these results suggest a capacity-dependent data curation principle and indicate that structured human creative outputs can provide an efficient pre-training substrate for small language models; stronger conclusions at modern pre-training scale will require substantially larger experiments.

The 67th Attempt: When Your "Knowledge Management" System Becomes a Self-Fulfilling Prophecy of Excellence

Dev.to

Context Engineering for Developers: A Practical Guide (2026)

Dev.to

GPT-5.5 is here. So is DeepSeek V4. And honestly, I am tired of version numbers.

Dev.to

I Built an AI Image Workflow with GPT Image 2.0 (+ Fixing Its Biggest Flaw)

Dev.to

Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-GGUF

Reddit r/LocalLLaMA

Listen and Chant Before You Read: The Ladder of Beauty in LM Pre-Training

Key Points

Abstract

Related Articles

The 67th Attempt: When Your "Knowledge Management" System Becomes a Self-Fulfilling Prophecy of Excellence

Context Engineering for Developers: A Practical Guide (2026)

GPT-5.5 is here. So is DeepSeek V4. And honestly, I am tired of version numbers.

I Built an AI Image Workflow with GPT Image 2.0 (+ Fixing Its Biggest Flaw)

Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-GGUF

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer