Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation

arXiv cs.CV / 3/12/2026

📰 NewsTools & Practical UsageModels & Research

共有:

Key Points

The study benchmarks 11 promptable foundation models for bone and implant segmentation across four anatomical regions (wrist, shoulder, hip, lower leg) using non-iterative 2D and 3D prompting on private and public datasets.
Pareto-optimal models in 2D are SAM and SAM2.1, and in 3D are nnInteractive and Med-SAM2, with performance highly dependent on the model and prompting strategy.
Localization accuracy and rater consistency vary by anatomical structure, being higher for simple structures (e.g., wrist bones) and lower for complex structures (e.g., pelvis, tibia, implants).
Segmentation performance drops when using human prompts compared with ideal prompts derived from reference labels, indicating that human-driven prompting can overestimate real-world performance.
The authors provide open-source code for prompt extraction and model inference and conclude that selecting the most suitable foundation model for human-driven clinical use remains challenging due to sensitivity to prompt variations.

Abstract

Promptable Foundation Models (FMs), initially introduced for natural image segmentation, have also revolutionized medical image segmentation. The increasing number of models, along with evaluations varying in datasets, metrics, and compared models, makes direct performance comparison between models difficult and complicates the selection of the most suitable model for specific clinical tasks. In our study, 11 promptable FMs are tested using non-iterative 2D and 3D prompting strategies on a private and public dataset focusing on bone and implant segmentation in four anatomical regions (wrist, shoulder, hip and lower leg). The Pareto-optimal models are identified and further analyzed using human prompts collected through a dedicated observer study. Our findings are: 1) The segmentation performance varies a lot between FMs and prompting strategies; 2) The Pareto-optimal models in 2D are SAM and SAM2.1, in 3D nnInteractive and Med-SAM2; 3) Localization accuracy and rater consistency vary with anatomical structures, with higher consistency for simple structures (wrist bones) and lower consistency for complex structures (pelvis, tibia, implants); 4) The segmentation performance drops using human prompts, suggesting that performance reported on "ideal" prompts extracted from reference labels might overestimate the performance in a human-driven setting; 5) All models were sensitive to prompt variations. While two models demonstrated intra-rater robustness, it did not scale to inter-rater settings. We conclude that the selection of the most optimal FM for a human-driven setting remains challenging, with even high-performing FMs being sensitive to variations in human input prompts. Our code base for prompt extraction and model inference is available: https://github.com/CarolineMagg/segmentation-FM-benchmark/

Self-Refining Agents in Spec-Driven Development

Dev.to

How to Optimize Your LinkedIn Profile with AI in 2026 (Get Found by Recruiters)

Dev.to

Agentforce Builder: How to Build AI Agents in Salesforce

Dev.to

How AI Consulting Services Support Staff Development in Dubai

Dev.to

Week 3: Why I'm Learning 'Boring' ML Before Building with LLMs

Dev.to

Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation

Key Points

Abstract

Related Articles

Self-Refining Agents in Spec-Driven Development

How to Optimize Your LinkedIn Profile with AI in 2026 (Get Found by Recruiters)

Agentforce Builder: How to Build AI Agents in Salesforce

How AI Consulting Services Support Staff Development in Dubai

Week 3: Why I'm Learning 'Boring' ML Before Building with LLMs

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer