RF-HiT: Rectified Flow Hierarchical Transformer for General Medical Image Segmentation

arXiv cs.CV / 4/22/2026

📰 NewsModels & Research

Key Points

  • The paper introduces RF-HiT, a Rectified Flow Hierarchical Transformer designed to improve general medical image segmentation by combining long-range context with accurate boundary delineation.
  • It addresses common transformer/diffusion bottlenecks by using rectified flow plus an hourglass transformer backbone and a multi-scale hierarchical encoder, achieving linear computational complexity.
  • RF-HiT uses learnable interpolation to fuse anatomically guided conditioning features across resolutions, enabling strong multi-scale representation with low overhead.
  • The model reports efficient inference—down to as few as three discretization steps—and compact compute requirements (10.14 GFLOPs, 13.6M parameters).
  • On benchmarks, RF-HiT achieves 91.27% mean Dice on ACDC and 87.40% on BraTS 2021, matching or exceeding more computationally intensive architectures and supporting real-time clinical segmentation potential.

Abstract

Accurate medical image segmentation requires both long-range contextual reasoning and precise boundary delineation, a task where existing transformer- and diffusion-based paradigms are frequently bottlenecked by quadratic computational complexity and prohibitive inference latency. We propose RF-HiT, a Rectified Flow Hierarchical Transformer that integrates an hourglass transformer backbone with a multi-scale hierarchical encoder for anatomically guided feature conditioning. Unlike prior diffusion-based approaches, RF-HiT leverages rectified flow with efficient transformer blocks to achieve linear complexity while requiring only a few discretization steps. The model further fuses conditioning features across resolutions via learnable interpolation, enabling effective multi-scale representation with minimal computational overhead. As a result, RF-HiT achieves a strong efficiency-performance trade-off, requiring only 10.14 GFLOPs, 13.6M parameters, and inference in as few as three steps. Despite its compact design, RF-HiT attains 91.27% mean Dice on ACDC and 87.40% on BraTS 2021, achieving performance comparable to or exceeding that of significantly more intensive architectures. This demonstrates its strong potential as a robust, computationally efficient foundation for real-time clinical segmentation.