Exploring Prompt Alignment with Clinical Factors in Zero-Shot Segmentation VLMs for NSCLC Tumor Segmentation

arXiv cs.CV / 5/5/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The study investigates which prompt dimensions most strongly control the spatial behavior of a zero-shot vision-language model (VoxTell) for NSCLC gross tumor volume segmentation.
Through sub-prompt decomposition, perturbation robustness tests, specificity ladders, and cross-case prompt swaps, the authors find anatomical location is the dominant alignment driver, with location changes often causing catastrophic segmentation failures.
Irrelevant prompts reliably lead to zero segmentation, while increasing prompt specificity generally improves performance (with diagnosis-only prompts behaving differently).
Cross-case prompt swaps show patient-specific conditioning, with matched cases achieving much higher Dice scores than mismatched ones, suggesting the model encodes case-specific spatial context.
VoxTell’s fully zero-shot mean Dice score is statistically indistinguishable from nnUNet, while outperforming other zero-shot baselines, and the paper argues evaluation should include prompt-dimension alignment in addition to Dice.

Abstract

Zero-shot vision-language models (VLMs) offer a promptable alternative to task-specific training for gross tumor volume (GTV) delineation in non-small-cell lung cancer (NSCLC), but the prompt dimensions that govern their spatial behavior remain poorly understood. We study this question by probing alignment directions in VoxTell on a held-out internal NSCLC tumor dataset through sub-prompt decomposition into diagnosis, demographic, staging, anatomical, generic, and irrelevant controls; attribute-wise perturbation robustness; specificity ladders; and cross-case prompt swaps, while benchmarking against fine-tuned and zero-shot baselines using the Dice Similarity Coefficient (DSC) with Wilcoxon signed-rank tests and Benjamini-Hochberg correction. Alignment analyses revealed that anatomical location is the dominant driver of VoxTell's spatial attention: 63.4 percent of location perturbations caused catastrophic drops, prompt specificity improved from generic to full descriptions except for diagnosis-only prompts, irrelevant prompts correctly yielded zero segmentation, and cross-case prompt swaps confirmed patient-specific conditioning (matched DSC 0.906 vs. mismatched 0.406). Histology and stage substitutions had minimal effect, indicating that the model prioritizes "where to look" over "what to look for." In this context, VoxTell, operating fully zero-shot, achieved a mean DSC of 0.613, statistically indistinguishable from nnUNet (0.690, adjusted p = 0.156) and Ahmed et al. (0.675, adjusted p = 0.679), while significantly outperforming all other zero-shot models. Together, these findings argue that segmentation VLMs should be evaluated not only by Dice, but also by the prompt dimensions to which they align.

Singapore's Fraud Frontier: Why AI Scam Detection Demands Regulatory Precision

Dev.to

How AI is Changing the Way We Code in 2026: The Shift from Syntax to Strategy

Dev.to

13 CLAUDE.md Rules That Make AI Write Modern PHP (Not PHP 5 Resurrected)

Dev.to

MCP annotations are a UX layer, not a security layer

Dev.to

From OOM to 262K Context: Running Qwen3-Coder 30B Locally on 8GB VRAM

Dev.to

Exploring Prompt Alignment with Clinical Factors in Zero-Shot Segmentation VLMs for NSCLC Tumor Segmentation

Key Points

Abstract

Related Articles

Singapore's Fraud Frontier: Why AI Scam Detection Demands Regulatory Precision

How AI is Changing the Way We Code in 2026: The Shift from Syntax to Strategy

13 CLAUDE.md Rules That Make AI Write Modern PHP (Not PHP 5 Resurrected)

MCP annotations are a UX layer, not a security layer

From OOM to 262K Context: Running Qwen3-Coder 30B Locally on 8GB VRAM

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer