PRISM: Demystifying Retention and Interaction in Mid-Training

arXiv cs.LG / 3/19/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

PRISM is an empirical study of mid-training design choices for large language models, conducting controlled experiments across seven base models from four families, two architectures, and scales from 3B to 24B parameters.
Mid-training on roughly 27B high‑quality tokens yields consistent gains on math (+15 to +40 points), code (+5 to +12), and science (+6 to +13) benchmarks while preserving general performance.
When RL is applied through the full PRISM pipeline, macro-average reasoning scores rise from under 12 to 29–42, whereas applying RL directly to base models is much less effective; data composition during mid-training—especially including science data—drives these gains.
Mechanistically, mid-training densely reconfigures over 90% of model weights, RL refinements affect about 5% of parameters, RL preserves mid-training representational geometry (CKA > 0.998), and RL only succeeds on mid-trained models, underscoring the value of retention-aware mid-training for reliable reasoning enhancement.

Abstract

We present PRISM, a comprehensive empirical study of mid-training design choices for large language models. Through controlled experiments across seven base models spanning four families (Granite, LLaMA, Mistral, Nemotron-H), two architecture types (dense Transformer and attention-Mamba hybrid), and scales from 3B to 24B parameters, we show that mid-training on approximately 27B high-quality tokens yields consistent gains of +15 to +40 points on math, +5 to +12 points on code, and +6 to +13 points on science benchmarks while preserving general performance. The full PRISM to RL pipeline improves macro-average across six reasoning benchmarks from under 12 to 29-42 (a 3-4x improvement), whereas RL applied directly to most of the base models remains substantially less effective, with AIME scores near zero. Data composition matters most at mid-training, not RL: including science data during mid-training unlocks +17 to +28 point GPQA-Diamond gains during RL, while changing the RL mix produces less than 2 point differences. Mechanistically, mid-training densely restructures over 90% of model weights, while RL makes sparse, front-loaded refinements to approximately 5% of parameters. Representation analysis (CKA) confirms that RL consistently preserves mid-training's representational geometry (over 0.998 CKA) across architectures. Crucially, RL applies identical weight changes regardless of starting point, yet only succeeds on mid-trained models, consistent with mid-training placing the model in a configuration from which RL can effectively improve performance. Our results demonstrate that retention-aware mid-training is highly effective for reliable reasoning enhancement and provide practical guidance for designing robust mid-training pipelines.