FTPFusion: Frequency-Aware Infrared and Visible Video Fusion with Temporal Perturbation

arXiv cs.CV / 4/3/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • FTPFusion is a frequency-aware method for fusing infrared and visible videos that aims to improve both spatial detail and temporal stability, which are often in tension in existing approaches.
  • The model splits features into high-frequency and low-frequency components, using sparse cross-modal spatio-temporal interaction for high-frequency motion/complementary details and a temporal perturbation strategy for robustness to flicker, jitter, and misalignment.
  • FTPFusion introduces an offset-aware temporal consistency constraint to explicitly stabilize cross-frame representations when temporal disturbances occur.
  • Experiments on multiple public benchmarks show FTPFusion outperforming state-of-the-art fusion methods on metrics covering spatial fidelity and temporal consistency.
  • The authors state that the source code will be released on GitHub, enabling further replication and downstream research use.

Abstract

Infrared and visible video fusion plays a critical role in intelligent surveillance and low-light monitoring. However, maintaining temporal stability while preserving spatial detail remains a fundamental challenge. Existing methods either focus on frame-wise enhancement with limited temporal modeling or rely on heavy spatio-temporal aggregation that often sacrifices high-frequency details. In this paper, we propose FTPFusion, a frequency-aware infrared and visible video fusion method based on temporal perturbation and sparse cross-modal interaction. Specifically, FTPFusion decomposes the feature representations into high-frequency and low-frequency components for collaborative modeling. The high-frequency branch performs sparse cross-modal spatio-temporal interaction to capture motion-related context and complementary details. The low-frequency branch introduces a temporal perturbation strategy to enhance robustness against complex video variations, such as flickering, jitter, and local misalignment. Furthermore, we design an offset-aware temporal consistency constraint to explicitly stabilize cross-frame representations under temporal disturbances. Extensive experiments on multiple public benchmarks demonstrate that FTPFusion consistently outperforms state-of-the-art methods across multiple metrics in both spatial fidelity and temporal consistency. The source code will be available at https://github.com/ixilai/FTPFusion.