Real Image Denoising with Knowledge Distillation for High-Performance Mobile NPUs

arXiv cs.CV / 5/6/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • The paper proposes an NPU-aware hardware–algorithm co-design method for real-world image denoising on mobile NPUs, addressing operator incompatibility and memory-access overhead.
  • It uses knowledge distillation from a high-capacity teacher to train a lightweight “student” model (LiteDenoiseNet) optimized for tiled-memory SoC architectures.
  • By restricting the network to NPU-native primitives (e.g., 3x3 convolutions, ReLU, nearest-neighbor upsampling) and applying progressive context expansion up to 1024x1024 crops, it achieves strong benchmark PSNR/SSIM scores at full resolution.
  • Runtime results under a standardized Full HD protocol show 34.0 ms on MediaTek Dimensity 9500 and 46.1 ms on Qualcomm Snapdragon 8 Elite, with an “Inference Inversion” effect where NPU-compatible design yields up to 3.88× faster dedicated NPU execution than the integrated mobile GPU.
  • The 1.96M-parameter student recovers 99.8% of the teacher’s quality using high-alpha knowledge distillation (alpha=0.9), reaching a 21.2× parameter reduction while reducing the PSNR gap to just 0.05 dB; training stats and the model are released via an NN Dataset repository.

Abstract

While deep-learning-based image restoration has achieved unprecedented fidelity, deployment on mobile Neural Processing Units (NPUs) remains bottlenecked by operator incompatibility and memory-access overhead. We propose an NPU-aware hardware-algorithm co-design approach for real-world image denoising on mobile NPUs. Our approach employs a high-capacity teacher to supervise a lightweight student network specifically designed to leverage the tiled-memory architectures of modern mobile SoCs. By prioritizing NPU-native primitives -- standard 3x3 convolutions, ReLU activations, and nearest-neighbor upsampling -- and employing a progressive context expansion strategy (up to 1024x1024 crops), the model achieves 37.66 dB PSNR / 0.9278 SSIM on the validation benchmark and 37.58 dB PSNR / 0.9098 SSIM on the held-out test benchmark at full resolution (2432x3200) in the Mobile AI 2026 challenge. Following the official challenge rules, the inference runtime is measured under a standardized Full HD (1088x1920) protocol, where it runs in 34.0 ms on the MediaTek Dimensity 9500 and 46.1 ms on the Qualcomm Snapdragon 8 Elite NPU. We further reveal an "Inference Inversion" effect, where strict adherence to NPU-compatible operations enables dedicated NPU execution up to 3.88x faster than the integrated mobile GPU. The 1.96M-parameter student recovers 99.8% of the teacher's restoration quality via high-alpha knowledge distillation (alpha = 0.9), achieving a 21.2x parameter reduction while closing the PSNR gap from 1.63 dB to only 0.05 dB. These results establish hardware-aware distillation as an effective strategy for unifying high-fidelity denoising with practical deployment across diverse mobile NPU architectures. The proposed lightweight student model (LiteDenoiseNet) and its training statistics are provided in the NN Dataset, available at https://github.com/ABrain-One/NN-Dataset.