FractalMamba++: Scaling Vision Mamba Across Resolutions via Hilbert Fractal Geometry

arXiv cs.CV / 5/6/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses a key limitation of Vision Mamba: performance can degrade when 2D patch grids are serialized into a 1D recurrence, especially at inference resolutions larger than the training grid.
  • It introduces FractalMamba++, which uses Hilbert curve–based fractal serialization to better preserve spatial locality across resolutions, improving neighborhood consistency compared with raster/linear scans.
  • The model adds a Fractal Hierarchy Skip Connection (FHSC) that injects long-range state using deterministic routes derived from Hilbert recursion, reducing long-sequence information fading without runtime search or custom CUDA kernels.
  • It further incorporates Fractal-Aware 2D Rotary Position Encoding (FA-RoPE) to tie positional interactions to true 2D proximity and fractal hierarchy level rather than the serialized 1D distance.
  • Experiments across ImageNet classification, COCO detection/segmentation, ADE20K segmentation, and LEVIR-CD+ change detection show FractalMamba++ delivers improved results over existing Mamba-based vision backbones, particularly for high-resolution inputs.

Abstract

Vision Mamba offers linear complexity for long visual sequences, yet its performance depends critically on how a two-dimensional patch grid is serialized into a one-dimensional state-space recurrence. Raster-style scans disrupt spatial continuity, and the mismatch between 2D locality and 1D state propagation becomes increasingly severe when the inference resolution grows beyond the training grid. This paper presents FractalMamba++, a resolution-scalable vision backbone organized around a single geometric principle: the recursive self-similar structure of the Hilbert curve determines how patches are serialized, where long-range state shortcuts are inserted, and how positional relations are encoded. First, Hilbert-curve-based Fractal Serialization preserves local 2D neighborhoods more faithfully than linear scans and provides consistent neighborhood statistics across resolutions. Second, the Fractal Hierarchy Skip Connection (FHSC) derives a compact set of deterministic state-injection routes from Hilbert recursion levels, mitigating long-sequence information fading without runtime search, hand-derived gradients, or dedicated CUDA kernels. Third, Fractal-Aware 2D Rotary Position Encoding (FA-RoPE) combines normalized 2D coordinates with a fractal hierarchy level so that feature interactions depend on actual spatial proximity and recursive structural role rather than serialized 1D distance. Extensive experiments on ImageNet-1K classification, COCO detection and instance segmentation, ADE20K semantic segmentation, and LEVIR-CD+ remote sensing change detection show that FractalMamba++ improves over existing Mamba-based vision backbones, especially under high-resolution inputs.