RD-ViT: Recurrent-Depth Vision Transformer for Semantic Segmentation with Reduced Data Dependence Extending the Recurrent-Depth Transformer Architecture to Dense Prediction

arXiv cs.CV / 5/6/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

RD-ViT is a recurrent-depth Vision Transformer designed for semantic segmentation that reduces reliance on large datasets by replacing unique parameters per layer with a single shared transformer block run T times.
The model targets dense prediction in both 2D and 3D (including cardiac MRI), using LTI-stable state injection for convergence, Adaptive Computation Time (ACT) to allocate compute across space, and depth-wise LoRA for parameter-efficient adaptation.
RD-ViT optionally incorporates Mixture-of-Experts (MoE) feed-forward networks to specialize for different semantic regions, with expert utilization emerging without explicit routing supervision.
Experiments on the ACDC cardiac MRI benchmark show RD-ViT improves over standard ViT with reduced data (e.g., Dice 0.774 vs 0.762 in 2D at 10% data) and achieves strong 3D performance with fewer parameters (Dice 0.812 with 3.0M parameters, about 99.4% of a standard ViT at roughly 53% the parameters).
The paper reports additional efficiency and flexibility benefits via ACT halting maps (more computation at cardiac boundaries) and depth extrapolation that allows more inference loops than training without performance degradation, and it releases code and notebooks publicly.

Abstract

Vision Transformers (ViTs) achieve state-of-the-art segmentation accuracy but require large training datasets because each layer has unique parameters that must be learned independently. We present RD-ViT, a Recurrent-Depth Vision Transformer that adapts the Recurrent-Depth Transformer (RDT) architecture to dense prediction tasks, supporting both 2D and 3D inputs. RD-ViT replaces the deep stack of unique transformer blocks with a single shared block looped T times, augmented with LTI-stable state injection for guaranteed convergence, Adaptive Computation Time (ACT) for spatial compute allocation, depth-wise LoRA adaptation, and optional Mixture-of-Experts (MoE) feed-forward networks for category-specific specialization. We evaluate on the ACDC cardiac MRI segmentation benchmark in both 2D slice-level and 3D volumetric settings with exclusively real experiments executed in Google Colab. In 2D, RD-ViT outperforms standard ViT at 10% training data (Dice 0.774 vs 0.762) and at full data (0.882 vs 0.872). In 3D, RD-ViT with MoE achieves Dice 0.812 with 3.0M parameters, reaching 99.4% of standard ViT performance (0.817) at 53% of the parameter count. MoE expert utilization analysis reveals that different experts spontaneously specialize for different cardiac structures (RV, MYO, LV) without explicit routing supervision. ACT halting maps show higher compute allocation at cardiac boundaries, and the mean ponder time decreases from 2.6 to 1.4 iterations during training, demonstrating learned computational efficiency. Depth extrapolation enables inference with more loops than training without degradation. All code, notebooks, and results are publicly released.

Antwerp startup Maurice & Nora raises €1M to address rising care demand

Tech.eu

Top 10 Free AI Tools for Students in 2026: The Ultimate Study Guide

Dev.to

SIFS (SIFS Is Fast Search) - local code search for coding agents

Dev.to

Discover Amazing AI Bots in EClaw's Bot Plaza: The GitHub for AI Personalities

Dev.to

AI as Your Contingency Co-Pilot: Automating Wedding Day 'What-Ifs'

Dev.to

RD-ViT: Recurrent-Depth Vision Transformer for Semantic Segmentation with Reduced Data Dependence Extending the Recurrent-Depth Transformer Architecture to Dense Prediction

Key Points

Abstract

Related Articles

Antwerp startup Maurice & Nora raises €1M to address rising care demand

Top 10 Free AI Tools for Students in 2026: The Ultimate Study Guide

SIFS (SIFS Is Fast Search) - local code search for coding agents

Discover Amazing AI Bots in EClaw's Bot Plaza: The GitHub for AI Personalities

AI as Your Contingency Co-Pilot: Automating Wedding Day 'What-Ifs'

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

Key Points

Abstract

Related Articles

Antwerp startup Maurice &amp; Nora raises €1M to address rising care demand

Top 10 Free AI Tools for Students in 2026: The Ultimate Study Guide

SIFS (SIFS Is Fast Search) - local code search for coding agents

Discover Amazing AI Bots in EClaw's Bot Plaza: The GitHub for AI Personalities

AI as Your Contingency Co-Pilot: Automating Wedding Day 'What-Ifs'

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

Antwerp startup Maurice & Nora raises €1M to address rising care demand