TED: Training-Free Experience Distillation for Multimodal Reasoning

arXiv cs.LG / 3/31/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • TED introduces a training-free, context-based knowledge distillation method for multimodal reasoning that transfers teacher “reasoning experiences” into the student’s prompt rather than updating model parameters.
  • For each input, the student samples multiple reasoning trajectories while a teacher generates its own solution and compares against both the student trajectories and the ground-truth answer to extract effective reasoning patterns.
  • TED maintains and continuously refines an experience buffer, using an experience compression mechanism to prevent unbounded growth and reduce noise via selective merge, rewrite, and removal.
  • Experiments on multimodal reasoning benchmarks (MathVision and VisualPuzzles) show consistent performance gains, including improvements for Qwen3-VL-8B from 0.627 to 0.702 (MathVision) and 0.517 to 0.561 (VisualPuzzles) using only 100 training samples.
  • The results indicate that meaningful knowledge transfer is possible in low-data, no-parameter-update settings, achieving performance competitive with parameter-based distillation while cutting training cost by more than 5x.

Abstract

Knowledge distillation is typically realized by transferring a teacher model's knowledge into a student's parameters through supervised or reinforcement-based optimization. While effective, such approaches require repeated parameter updates and large-scale training data, limiting their applicability in resource-constrained environments. In this work, we propose TED, a training-free, context-based distillation framework that shifts the update target of distillation from model parameters to an in-context experience injected into the student's prompt. For each input, the student generates multiple reasoning trajectories, while a teacher independently produces its own solution. The teacher then compares the student trajectories with its reasoning and the ground-truth answer, extracting generalized experiences that capture effective reasoning patterns. These experiences are continuously refined and updated over time. A key challenge of context-based distillation is unbounded experience growth and noise accumulation. TED addresses this with an experience compression mechanism that tracks usage statistics and selectively merges, rewrites, or removes low-utility experiences. Experiments on multimodal reasoning benchmarks MathVision and VisualPuzzles show that TED consistently improves performance. On MathVision, TED raises the performance of Qwen3-VL-8B from 0.627 to 0.702, and on VisualPuzzles from 0.517 to 0.561 with just 100 training samples. Under this low-data, no-update setting, TED achieves performance competitive with fully trained parameter-based distillation while reducing training cost by over 5x, demonstrating that meaningful knowledge transfer can be achieved through contextual experience.