PRISM: Pre-alignment via Black-box On-policy Distillation for Multimodal Reinforcement Learning
arXiv cs.CL / 5/1/2026
📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research
Key Points
- The paper argues that the common LMM alignment pipeline (SFT on curated demos followed by RLVR) suffers from distributional drift that can degrade original capabilities and mismatch the supervision distribution, especially in multimodal reasoning where perception and reasoning errors drift differently.
- It proposes PRISM, a three-stage pipeline that inserts an explicit distribution-alignment step between SFT and RLVR using on-policy distillation framed as a black-box, response-level adversarial game with a Mixture-of-Experts discriminator.
- PRISM provides disentangled corrective signals for perception and reasoning while not requiring access to teacher logits, making the alignment step more “black-box” compared with teacher-based approaches.
- The authors augment training with 113K additional high-fidelity demonstrations generated by Gemini 3 Flash (dense visual grounding and step-by-step reasoning) on top of 1.26M public demos to improve alignment quality.
- Experiments on Qwen3-VL show PRISM yields consistent downstream RLVR gains across multiple RL algorithms and benchmarks, boosting average accuracy by +4.4 (4B) and +6.0 (8B) over the SFT-to-RLVR baseline, and the code/data/checkpoints are released publicly.
Related Articles

Black Hat USA
AI Business

Why Autonomous Coding Agents Keep Failing — And What Actually Works
Dev.to

Text-to-image is easy. Chaining LLMs to generate, critique, and iterate on images autonomously is a routing nightmare. AgentSwarms now supports Image generation playground and creative media workflows!
Reddit r/artificial

Announcing the NVIDIA Nemotron 3 Super Build Contest
Dev.to

75% of Sites Blocking AI Bots Still Get Cited. Here Is Why Blocking Does Not Work.
Dev.to