T-Gated Adapter: A Lightweight Temporal Adapter for Vision-Language Medical Segmentation

arXiv cs.CV / 4/10/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces **T-Gated Adapter**, a lightweight temporal adapter designed to improve **vision-language medical segmentation** by incorporating **adjacent-slice context** rather than treating 2D slices independently.
It injects temporal information at the **visual token level** using a temporal transformer over a fixed context window, plus a spatial refinement block and an **adaptive gating mechanism** to balance temporal vs single-slice features.
Training on **30 labeled FLARE22 volumes** improves abdominal organ segmentation, reaching a **mean Dice of 0.704** with a **+0.206 gain** over a baseline VLM without temporal context.
In **zero-shot cross-dataset** testing (BTCV, AMOS22), the approach shows consistent gains (**+0.210** and **+0.230**) and reduces the average cross-domain performance drop from **38.0% to 24.9%**.
Cross-modality evaluation on **AMOS22 MRI** without MRI supervision yields **mean Dice of 0.366**, outperforming a fully supervised CT-only 3D baseline (DynUNet: **0.224**), suggesting stronger generalization of CLIP-style visual semantics across modalities.

Abstract

Medical image segmentation traditionally relies on fully supervised 3D architectures that demand a large amount of dense, voxel-level annotations from clinical experts which is a prohibitively expensive process. Vision Language Models (VLMs) offer a powerful alternative by leveraging broad visual semantic representations learned from billions of images. However, when applied independently to 2D slices of a 3D scan, these models often produce noisy and anatomically implausible segmentations that violate the inherent continuity of anatomical structures. We propose a temporal adapter that addresses this by injecting adjacent-slice context directly into the model's visual token representations. The adapter comprises a temporal transformer attending across a fixed context window at the token level, a spatial context block refining within-slice representations, and an adaptive gate balancing temporal and single-slice features. Training on 30 labeled volumes from the FLARE22 dataset, our method achieves a mean Dice of 0.704 across 13 abdominal organs with a gain of +0.206 over the baseline VLM trained with no temporal context. Zero-shot evaluation on BTCV and AMOS22 datasets yields consistent improvements of +0.210 and +0.230, with the average cross-domain performance drop reducing from 38.0% to 24.9%. Furthermore, in a cross-modality evaluation on AMOS22 MRI with neither model receiving any MRI supervision, our method achieves a mean Dice of 0.366, outperforming a fully supervised 3D baseline (DynUNet, 0.224) trained exclusively on CT, suggesting that CLIP's visual semantic representations generalize more gracefully across imaging modalities than convolutional features.

Black Hat Asia

AI Business

GLM 5.1 tops the code arena rankings for open models

Reddit r/LocalLLaMA

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

My Bestie Built a Free MCP Server for Job Search — Here's How It Works

Dev.to

can we talk about how AI has gotten really good at lying to you?

Reddit r/artificial

T-Gated Adapter: A Lightweight Temporal Adapter for Vision-Language Medical Segmentation

Key Points

Abstract

Related Articles

Black Hat Asia

GLM 5.1 tops the code arena rankings for open models

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

My Bestie Built a Free MCP Server for Job Search — Here's How It Works

can we talk about how AI has gotten really good at lying to you?

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer