Hierarchical Vision Transformer Enhanced by Graph Convolutional Network for Image Classification

arXiv cs.CV / 4/21/2026

📰 NewsModels & Research

Key Points

  • The paper introduces GCN-HViT, a Hierarchical Vision Transformer enhanced with a Graph Convolutional Network to improve image classification performance.
  • It addresses ViT’s key limitations around patch-size selection, by using a hierarchical design that combines information from small and large patches across multiple levels.
  • It improves spatial understanding by using a GCN to capture local patch connectivity and generate 2D position embeddings, compensating for ViT’s 1D positional encodings.
  • The method also targets the complementary gap between local and global modeling: GCN extracts local representations while the transformer captures global patch relationships.
  • Experiments on three real-world datasets reportedly show state-of-the-art results for image classification using GCN-HViT.

Abstract

Vision Transformer (ViT) has brought new breakthroughs to the field of image classification by introducing the self-attention mechanism and Graph Convolutional Networks(GCN) have been proposed and successfully applied in data representation and analysis. However, there are key challenges which limit their further development: (1) The patch size selected by ViT is crucial for accurate predictions, which raises a natural question: How to select the size of patches properly or how to comprehensively combine small patches and larger patches; (2) While the spatial structure information is important in vision tasks, the 1D position embeddings fails to capture the spatial structure information of patches more accurately; (3) The GCN can capture the local connectivity relationships between image nodes, but it lacks the ability to capture global graph structural information. On the contrary, the self-attention mechanism of ViT can draw the global relation on image patches, but it is unable to model the local structure of image. To overcome such limitations, we propose the Hierarchical Vision Transformer Enhanced by Graph Convolutional Network (GCN-HViT) for image classification. Specifically, the Hierarchical ViT we designed can model patch-wise information interactions on a global scale within each level and model hierarchical relationships between small patches and large patches across multiple levels. In addition, the proposed GCN method functions as a local feature extractor to obtain the local representation of each image patch which serves as a 2D position embedding of each patch in the 2D space. Meanwhile, it models patch-wise information interactions on a local scale within each level. Extensive experiments on 3 real-world datasets demonstrate that GCN-HViT achieves state-of-the-art performance.