Frames2Residual: Spatiotemporal Decoupling for Self-Supervised Video Denoising

arXiv cs.CV / 3/12/2026

💬 OpinionModels & Research

共有:

Key Points

The paper introduces Frames2Residual (F2R), a self-supervised video denoising framework that decouples spatiotemporal training into two stages: blind temporal consistency modeling and non-blind spatial texture recovery.
Stage 1 uses a frame-wise blind temporal estimator to learn inter-frame consistency and produce a temporally stable anchor without relying on center-pixel masking.
Stage 2 employs a non-blind spatial refiner that uses the temporal anchor to safely reintroduce the center frame and recover high-frequency spatial residuals while preserving temporal stability.
Experimental results show that this decoupled approach enables F2R to outperform existing self-supervised methods on both sRGB and raw video benchmarks, highlighting the effectiveness of spatiotemporal decoupling in video denoising.

Abstract

Self-supervised video denoising methods typically extend image-based frameworks into the temporal dimension, yet they often struggle to integrate inter-frame temporal consistency with intra-frame spatial specificity. Existing Video Blind-Spot Networks (BSNs) require noise independence by masking the center pixel, this constraint prevents the use of spatial evidence for texture recovery, thereby severing spatiotemporal correlations and causing texture loss. To address this, we propose Frames2Residual (F2R), a spatiotemporal decoupling framework that explicitly divides self-supervised training into two distinct stages: blind temporal consistency modeling and non-blind spatial texture recovery. In Stage 1, a blind temporal estimator learns inter-frame consistency using a frame-wise blind strategy, producing a temporally consistent anchor. In Stage 2, a non-blind spatial refiner leverages this anchor to safely reintroduce the center frame and recover intra-frame high-frequency spatial residuals while preserving temporal stability. Extensive experiments demonstrate that our decoupling strategy allows F2R to outperform existing self-supervised methods on both sRGB and raw video benchmarks.

Math needs thinking time, everyday knowledge needs memory, and a new Transformer architecture aims to deliver both

THE DECODER

Kreuzberg v4.5.0: We loved Docling's model so much that we gave it a faster engine

Reddit r/LocalLLaMA

Today, what hardware to get for running large-ish local models like qwen 120b ?

Reddit r/LocalLLaMA

Running mistral locally for meeting notes and it's honestly good enough for my use case

Reddit r/LocalLLaMA

[D] Single-artist longitudinal fine art dataset spanning 5 decades now on Hugging Face — potential applications in style evolution, figure representation, and ethical training data

Reddit r/MachineLearning

Frames2Residual: Spatiotemporal Decoupling for Self-Supervised Video Denoising

Key Points

Abstract

Related Articles

Math needs thinking time, everyday knowledge needs memory, and a new Transformer architecture aims to deliver both

Kreuzberg v4.5.0: We loved Docling's model so much that we gave it a faster engine

Today, what hardware to get for running large-ish local models like qwen 120b ?

Running mistral locally for meeting notes and it's honestly good enough for my use case

[D] Single-artist longitudinal fine art dataset spanning 5 decades now on Hugging Face — potential applications in style evolution, figure representation, and ethical training data

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer