FUN: A Focal U-Net Combining Reconstruction and Object Detection for Snapshot Spectral Imaging

arXiv cs.CV / 5/1/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • Snapshot spectral imaging enables real-time hyperspectral object detection, but performance is often limited by slow post-capture reconstruction in conventional approaches.
  • The paper introduces FUN (Focal U-shaped Network), an end-to-end multi-task framework that jointly performs HSI reconstruction and object detection using a shared U-shaped backbone.
  • FUN uses multi-task interaction where reconstruction learns spectral information while detection helps guide semantic-aware priors, improving both tasks.
  • To avoid expensive self-attention, the method introduces focal modulation that efficiently modulates spatial and spectral features with reduced quadratic complexity.
  • The authors release a new HSI object detection dataset (8712 annotated objects across 363 HSIs) and report state-of-the-art results with 40% fewer parameters and 30% less computation, suggesting suitability for future real-time edge deployment.

Abstract

Conventional push-broom hyperspectral imaging suffers from slow acquisition speeds, precluding real-time object detection; in contrast, snapshot spectral imaging enables instantaneous hyperspectral images (HSIs) capture, making real-time object detection feasible, yet its potential is often compromised by time-consuming post-capture reconstruction. To address this issue, we propose the Focal U-shaped Network (FUN), a novel end-to-end framework that jointly performs HSI reconstruction and object detection via multi-task learning. FUN employs a shared U-shaped backbone, where reconstruction provides underlying spectral information while detection guides semantic-aware priors learning, facilitating mutually beneficial task interaction. Crucially, we introduce focal modulation, an efficient alternative to self-attention that modulates spatial and spectral features while reducing quadratic computational complexity, enabling a self-attention-free architecture for joint reconstruction and detection. Furthermore, we contribute a new HSI object detection dataset with 8712 annotated objects across 363 HSIs to facilitate evaluation of the proposed method. Experiments demonstrate that FUN achieves state-of-the-art performance on both tasks, using 40% fewer parameters and 30% less computation than recent alternatives, making it promising for future real-time edge deployment. The code and datasets are available: https://github.com/ShawnDong98/FUN.