Stabilizing Unsupervised Self-Evolution of MLLMs via Continuous Softened Retracing reSampling

arXiv cs.CV / 4/7/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses instability in unsupervised self-evolution for multimodal LLMs, arguing that majority-vote pseudo-labeling can reinforce intrinsic model biases rather than true correctness.
  • It proposes CSRS, combining a retracing re-inference mechanism (RRM) from anchor points to better explore long-tail reasoning trajectories.
  • CSRS introduces Softened Frequency Reward (SFR), using continuous, frequency-calibrated reward signals instead of binary feedback to reduce degradation during post-training.
  • To prevent over-reliance on superficial multimodal cues, the method incorporates Visual Semantic Perturbation (VSP) to steer the model toward mathematical/logical reasoning.
  • Experiments report significantly improved reasoning performance for Qwen2.5-VL-7B on benchmarks like MathVision and state-of-the-art results on geometric tasks, with code provided on GitHub.

Abstract

In the unsupervised self-evolution of Multimodal Large Language Models, the quality of feedback signals during post-training is pivotal for stable and effective learning. However, existing self-evolution methods predominantly rely on majority voting to select the most frequent output as the pseudo-golden answer, which may stem from the model's intrinsic biases rather than guaranteeing the objective correctness of the reasoning paths. To counteract the degradation, we propose \textbf{C}ontinuous \textbf{S}oftened \textbf{R}etracing re\textbf{S}ampling (\textbf{CSRS}) in MLLM self-evolution. Specifically, we introduce a Retracing Re-inference Mechanism (\textbf{RRM}) that the model re-inferences from anchor points to expand the exploration of long-tail reasoning paths. Simultaneously, we propose Softened Frequency Reward (\textbf{SFR}), which replaces binary rewards with continuous signals, calibrating reward based on the answers' frequency across sampled reasoning sets. Furthermore, incorporated with Visual Semantic Perturbation (\textbf{VSP}), CSRS ensures the model prioritizes mathematical logic over visual superficiality. Experimental results demonstrate that CSRS significantly enhances the reasoning performance of Qwen2.5-VL-7B on benchmarks such as MathVision. We achieve state-of-the-art (SOTA) results in unsupervised self-evolution on geometric tasks. Our code is avaible at https://github.com/yyy195/CSRS.