MAVFusion: Efficient Infrared and Visible Video Fusion via Motion-Aware Sparse Interaction

arXiv cs.CV / 4/3/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • MAVFusion is an end-to-end framework for infrared/visible video fusion that addresses the difficulty of handling frame-to-frame motion, which earlier static image fusion methods struggle with.
  • It uses optical flow to detect dynamic regions and then applies computationally expensive cross-modal attention only to these sparse motion-related areas, improving efficiency while preserving salient transitions.
  • For static background regions, the method switches to a lightweight weak-interaction module to maintain structural and appearance consistency across time.
  • The approach decouples dynamic and static processing to improve both temporal consistency and fine-grained fusion details, while substantially accelerating inference.
  • Experiments report state-of-the-art results on multiple infrared/visible video benchmarks and a throughput of 14.16 FPS at 640×480 resolution, with code to be released on GitHub.

Abstract

Infrared and visible video fusion combines the object saliency from infrared images with the texture details from visible images to produce semantically rich fusion results. However, most existing methods are designed for static image fusion and cannot effectively handle frame-to-frame motion in videos. Current video fusion methods improve temporal consistency by introducing interactions across frames, but they often require high computational cost. To mitigate these challenges, we propose MAVFusion, an end-to-end video fusion framework featuring a motion-aware sparse interaction mechanism that enhances efficiency while maintaining superior fusion quality. Specifically, we leverage optical flow to identify dynamic regions in multi-modal sequences, adaptively allocating computationally intensive cross-modal attention to these sparse areas to capture salient transitions and facilitate inter-modal information exchange. For static background regions, a lightweight weak interaction module is employed to maintain structural and appearance integrity. By decoupling the processing of dynamic and static regions, MAVFusion simultaneously preserves temporal consistency and fine-grained details while significantly accelerating inference. Extensive experiments demonstrate that MAVFusion achieves state-of-the-art performance on multiple infrared and visible video benchmarks, achieving a speed of 14.16\,FPS at 640 \times 480 resolution. The source code will be available at https://github.com/ixilai/MAVFusion.