Exploring Prompt Alignment with Clinical Factors in Zero-Shot Segmentation VLMs for NSCLC Tumor Segmentation

arXiv cs.CV / 5/5/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The study investigates which prompt dimensions most strongly control the spatial behavior of a zero-shot vision-language model (VoxTell) for NSCLC gross tumor volume segmentation.
  • Through sub-prompt decomposition, perturbation robustness tests, specificity ladders, and cross-case prompt swaps, the authors find anatomical location is the dominant alignment driver, with location changes often causing catastrophic segmentation failures.
  • Irrelevant prompts reliably lead to zero segmentation, while increasing prompt specificity generally improves performance (with diagnosis-only prompts behaving differently).
  • Cross-case prompt swaps show patient-specific conditioning, with matched cases achieving much higher Dice scores than mismatched ones, suggesting the model encodes case-specific spatial context.
  • VoxTell’s fully zero-shot mean Dice score is statistically indistinguishable from nnUNet, while outperforming other zero-shot baselines, and the paper argues evaluation should include prompt-dimension alignment in addition to Dice.

Abstract

Zero-shot vision-language models (VLMs) offer a promptable alternative to task-specific training for gross tumor volume (GTV) delineation in non-small-cell lung cancer (NSCLC), but the prompt dimensions that govern their spatial behavior remain poorly understood. We study this question by probing alignment directions in VoxTell on a held-out internal NSCLC tumor dataset through sub-prompt decomposition into diagnosis, demographic, staging, anatomical, generic, and irrelevant controls; attribute-wise perturbation robustness; specificity ladders; and cross-case prompt swaps, while benchmarking against fine-tuned and zero-shot baselines using the Dice Similarity Coefficient (DSC) with Wilcoxon signed-rank tests and Benjamini-Hochberg correction. Alignment analyses revealed that anatomical location is the dominant driver of VoxTell's spatial attention: 63.4 percent of location perturbations caused catastrophic drops, prompt specificity improved from generic to full descriptions except for diagnosis-only prompts, irrelevant prompts correctly yielded zero segmentation, and cross-case prompt swaps confirmed patient-specific conditioning (matched DSC 0.906 vs. mismatched 0.406). Histology and stage substitutions had minimal effect, indicating that the model prioritizes "where to look" over "what to look for." In this context, VoxTell, operating fully zero-shot, achieved a mean DSC of 0.613, statistically indistinguishable from nnUNet (0.690, adjusted p = 0.156) and Ahmed et al. (0.675, adjusted p = 0.679), while significantly outperforming all other zero-shot models. Together, these findings argue that segmentation VLMs should be evaluated not only by Dice, but also by the prompt dimensions to which they align.