SaFeR-Steer: Evolving Multi-Turn MLLMs via Synthetic Bootstrapping and Feedback Dynamics

arXiv cs.LG / 4/21/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

Key Points

  • The paper highlights a training–deployment mismatch in multimodal long-context multi-turn LLM use, where unsafe intent can escalate through evolving visual-text context and where safety can decay over long dialogues.
  • It proposes SaFeR-Steer, a progressive multi-turn alignment framework that uses staged synthetic bootstrapping plus tutor-in-the-loop GRPO to train a single “student” model under adaptive, on-policy attacks.
  • The method introduces TCSR, which propagates late-turn safety failures back to earlier turns using trajectory-based minimum/average safety metrics, aiming to prevent escalation patterns.
  • The authors release the STEER dataset (with STEER-SFT, STEER-RL, and STEER-Bench) covering 2–10 turns, and report sizable improvements when starting from Qwen2.5-VL models on both single-turn and multi-turn safety/helpfulness benchmarks.
  • Source code is provided, indicating the approach and dataset are intended for replication and further research.

Abstract

MLLMs are increasingly deployed in multi-turn settings, where attackers can escalate unsafe intent through the evolving visual-text history and exploit long-context safety decay. Yet safety alignment is still dominated by single-turn data and fixed-template dialogues, leaving a mismatch between training and deployment.To bridge this gap, we propose SaFeR-Steer, a progressive multi-turn alignment framework that combines staged synthetic bootstrapping with tutor-in-the-loop GRPO to train a single student under adaptive, on-policy attacks. We also introduce TCSR, which uses trajectory minimum/average safety to propagate late-turn failures to earlier turns.I. Dataset. We release STEER, a multi-turn multimodal safety dataset with STEER-SFT (12,934), STEER-RL (2,000), and STEER-Bench (3,227) dialogues spanning 2~10 turns.II. Experiment. Starting from Qwen2.5-VL-3B/7B, SaFeR-Steer substantially improves Safety/Helpfulness on both single-turn (48.30/45.86 -> 81.84/70.77 for 3B; 56.21/60.32 -> 87.89/77.40 for 7B) and multi-turn benchmarks (12.55/27.13 -> 55.58/70.27 for 3B; 24.66/46.48 -> 64.89/72.35 for 7B), shifting failures to later turns and yielding robustness beyond scaling alone.Codes are available at https://github.com/Ed-Bg/SaFeR-Steer