PRISM: Pre-alignment via Black-box On-policy Distillation for Multimodal Reinforcement Learning

arXiv cs.CL / 5/1/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • The paper argues that the common LMM alignment pipeline (SFT on curated demos followed by RLVR) suffers from distributional drift that can degrade original capabilities and mismatch the supervision distribution, especially in multimodal reasoning where perception and reasoning errors drift differently.
  • It proposes PRISM, a three-stage pipeline that inserts an explicit distribution-alignment step between SFT and RLVR using on-policy distillation framed as a black-box, response-level adversarial game with a Mixture-of-Experts discriminator.
  • PRISM provides disentangled corrective signals for perception and reasoning while not requiring access to teacher logits, making the alignment step more “black-box” compared with teacher-based approaches.
  • The authors augment training with 113K additional high-fidelity demonstrations generated by Gemini 3 Flash (dense visual grounding and step-by-step reasoning) on top of 1.26M public demos to improve alignment quality.
  • Experiments on Qwen3-VL show PRISM yields consistent downstream RLVR gains across multiple RL algorithms and benchmarks, boosting average accuracy by +4.4 (4B) and +6.0 (8B) over the SFT-to-RLVR baseline, and the code/data/checkpoints are released publicly.

Abstract

The standard post-training recipe for large multimodal models (LMMs) applies supervised fine-tuning (SFT) on curated demonstrations followed by reinforcement learning with verifiable rewards (RLVR). However, SFT introduces distributional drift that neither preserves the model's original capabilities nor faithfully matches the supervision distribution. This problem is further amplified in multimodal reasoning, where perception errors and reasoning failures follow distinct drift patterns that compound during subsequent RL. We introduce PRISM, a three-stage pipeline that mitigates this drift by inserting an explicit distribution-alignment stage between SFT and RLVR. Building on the principle of on-policy distillation (OPD), PRISM casts alignment as a black-box, response-level adversarial game between the policy and a Mixture-of-Experts (MoE) discriminator with dedicated perception and reasoning experts, providing disentangled corrective signals that steer the policy toward the supervision distribution without requiring access to teacher logits. While 1.26M public demonstrations suffice for broad SFT initialization, distribution alignment demands higher-fidelity supervision; we therefore curate 113K additional demonstrations from Gemini 3 Flash, featuring dense visual grounding and step-by-step reasoning on the hardest unsolved problems. Experiments on Qwen3-VL show that PRISM consistently improves downstream RLVR performance across multiple RL algorithms (GRPO, DAPO, GSPO) and diverse multimodal benchmarks, improving average accuracy by +4.4 and +6.0 points over the SFT-to-RLVR baseline on 4B and 8B, respectively. Our code, data, and model checkpoints are publicly available at https://github.com/XIAO4579/PRISM.