T-Gated Adapter: A Lightweight Temporal Adapter for Vision-Language Medical Segmentation

arXiv cs.CV / 4/10/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces **T-Gated Adapter**, a lightweight temporal adapter designed to improve **vision-language medical segmentation** by incorporating **adjacent-slice context** rather than treating 2D slices independently.
  • It injects temporal information at the **visual token level** using a temporal transformer over a fixed context window, plus a spatial refinement block and an **adaptive gating mechanism** to balance temporal vs single-slice features.
  • Training on **30 labeled FLARE22 volumes** improves abdominal organ segmentation, reaching a **mean Dice of 0.704** with a **+0.206 gain** over a baseline VLM without temporal context.
  • In **zero-shot cross-dataset** testing (BTCV, AMOS22), the approach shows consistent gains (**+0.210** and **+0.230**) and reduces the average cross-domain performance drop from **38.0% to 24.9%**.
  • Cross-modality evaluation on **AMOS22 MRI** without MRI supervision yields **mean Dice of 0.366**, outperforming a fully supervised CT-only 3D baseline (DynUNet: **0.224**), suggesting stronger generalization of CLIP-style visual semantics across modalities.

Abstract

Medical image segmentation traditionally relies on fully supervised 3D architectures that demand a large amount of dense, voxel-level annotations from clinical experts which is a prohibitively expensive process. Vision Language Models (VLMs) offer a powerful alternative by leveraging broad visual semantic representations learned from billions of images. However, when applied independently to 2D slices of a 3D scan, these models often produce noisy and anatomically implausible segmentations that violate the inherent continuity of anatomical structures. We propose a temporal adapter that addresses this by injecting adjacent-slice context directly into the model's visual token representations. The adapter comprises a temporal transformer attending across a fixed context window at the token level, a spatial context block refining within-slice representations, and an adaptive gate balancing temporal and single-slice features. Training on 30 labeled volumes from the FLARE22 dataset, our method achieves a mean Dice of 0.704 across 13 abdominal organs with a gain of +0.206 over the baseline VLM trained with no temporal context. Zero-shot evaluation on BTCV and AMOS22 datasets yields consistent improvements of +0.210 and +0.230, with the average cross-domain performance drop reducing from 38.0% to 24.9%. Furthermore, in a cross-modality evaluation on AMOS22 MRI with neither model receiving any MRI supervision, our method achieves a mean Dice of 0.366, outperforming a fully supervised 3D baseline (DynUNet, 0.224) trained exclusively on CT, suggesting that CLIP's visual semantic representations generalize more gracefully across imaging modalities than convolutional features.

T-Gated Adapter: A Lightweight Temporal Adapter for Vision-Language Medical Segmentation | AI Navigate