MAVFusion: Efficient Infrared and Visible Video Fusion via Motion-Aware Sparse Interaction

arXiv cs.CV / 4/3/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

MAVFusion is an end-to-end framework for infrared/visible video fusion that addresses the difficulty of handling frame-to-frame motion, which earlier static image fusion methods struggle with.
It uses optical flow to detect dynamic regions and then applies computationally expensive cross-modal attention only to these sparse motion-related areas, improving efficiency while preserving salient transitions.
For static background regions, the method switches to a lightweight weak-interaction module to maintain structural and appearance consistency across time.
The approach decouples dynamic and static processing to improve both temporal consistency and fine-grained fusion details, while substantially accelerating inference.
Experiments report state-of-the-art results on multiple infrared/visible video benchmarks and a throughput of 14.16 FPS at 640×480 resolution, with code to be released on GitHub.

Abstract

Infrared and visible video fusion combines the object saliency from infrared images with the texture details from visible images to produce semantically rich fusion results. However, most existing methods are designed for static image fusion and cannot effectively handle frame-to-frame motion in videos. Current video fusion methods improve temporal consistency by introducing interactions across frames, but they often require high computational cost. To mitigate these challenges, we propose MAVFusion, an end-to-end video fusion framework featuring a motion-aware sparse interaction mechanism that enhances efficiency while maintaining superior fusion quality. Specifically, we leverage optical flow to identify dynamic regions in multi-modal sequences, adaptively allocating computationally intensive cross-modal attention to these sparse areas to capture salient transitions and facilitate inter-modal information exchange. For static background regions, a lightweight weak interaction module is employed to maintain structural and appearance integrity. By decoupling the processing of dynamic and static regions, MAVFusion simultaneously preserves temporal consistency and fine-grained details while significantly accelerating inference. Extensive experiments demonstrate that MAVFusion achieves state-of-the-art performance on multiple infrared and visible video benchmarks, achieving a speed of 14.16\,FPS at

640 \times 480

resolution. The source code will be available at https://github.com/ixilai/MAVFusion.

Why I built an AI assistant that doesn't know who you are

Dev.to

DenseNet Paper Walkthrough: All Connected

Towards Data Science

Meta Adaptive Ranking Model: What Instagram Advertisers Gain in 2026 | MKDM

Dev.to

The Facebook insider building content moderation for the AI era

TechCrunch

Qwen3.5 vs Gemma 4: Benchmarks vs real world use?

Reddit r/LocalLLaMA

MAVFusion: Efficient Infrared and Visible Video Fusion via Motion-Aware Sparse Interaction

Key Points

Abstract

Related Articles

Why I built an AI assistant that doesn't know who you are

DenseNet Paper Walkthrough: All Connected

Meta Adaptive Ranking Model: What Instagram Advertisers Gain in 2026 | MKDM

The Facebook insider building content moderation for the AI era

Qwen3.5 vs Gemma 4: Benchmarks vs real world use?

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer