Frequency-Enhanced Dual-Subspace Networks for Few-Shot Fine-Grained Image Classification

arXiv cs.CV / 4/17/2026

📰 NewsModels & Research

共有:

Key Points

The paper introduces FEDSNet, a method for few-shot fine-grained image classification that targets texture bias and noise overfitting common in single-view metric learning approaches.
It separates low-frequency global structure from spatial features using DCT-based low-pass filtering to suppress background interference.
FEDSNet builds two independent low-rank subspaces via truncated SVD—one capturing spatial texture and the other capturing frequency structural information.
An adaptive gating mechanism fuses distances from both subspaces, leveraging the frequency subspace’s stability to improve structural robustness under few-shot settings.
Experiments on CUB-200-2011, Stanford Cars, Stanford Dogs, and FGVC-Aircraft show strong and efficient performance versus existing metric learning methods.

Abstract

Few-shot fine-grained image classification aims to recognize subcategories with high visual similarity using only a limited number of annotated samples. Existing metric learning-based methods typically rely solely on spatial domain features. Confined to this single perspective, models inevitably suffer from inherent texture biases, entangling essential structural details with high-frequency background noise. Furthermore, lacking cross-view geometric constraints, single-view metrics tend to overfit this noise, resulting in structural instability under few-shot conditions. To address these issues, this paper proposes the Frequency-Enhanced Dual-Subspace Network (FEDSNet). Specifically, FEDSNet utilizes the Discrete Cosine Transform (DCT) and a low-pass filtering mechanism to explicitly isolate low-frequency global structural components from spatial features, thereby suppressing background interference. Truncated Singular Value Decomposition (SVD) is employed to construct independent, low-rank linear subspaces for both spatial texture and frequency structural features. An adaptive gating mechanism is designed to dynamically fuse the projection distances from these dual views. This strategy leverages the structural stability of the frequency subspace to prevent the spatial subspace from overfitting to background features. Extensive experiments on four benchmark datasets - CUB-200-2011, Stanford Cars, Stanford Dogs, and FGVC-Aircraft - demonstrate that FEDSNet exhibits excellent classification performance and robustness, achieving highly competitive results compared to existing metric learning algorithms. Complexity analysis further confirms that the proposed network achieves a favorable balance between high accuracy and computational efficiency, providing an effective new paradigm for few-shot fine-grained visual recognition.