Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation

arXiv cs.CV / 3/12/2026

📰 NewsTools & Practical UsageModels & Research

共有:

Key Points

The study benchmarks 11 promptable foundation models for bone and implant segmentation across four anatomical regions (wrist, shoulder, hip, lower leg) using non-iterative 2D and 3D prompting on private and public datasets.
Pareto-optimal models in 2D are SAM and SAM2.1, and in 3D are nnInteractive and Med-SAM2, with performance highly dependent on the model and prompting strategy.
Localization accuracy and rater consistency vary by anatomical structure, being higher for simple structures (e.g., wrist bones) and lower for complex structures (e.g., pelvis, tibia, implants).
Segmentation performance drops when using human prompts compared with ideal prompts derived from reference labels, indicating that human-driven prompting can overestimate real-world performance.
The authors provide open-source code for prompt extraction and model inference and conclude that selecting the most suitable foundation model for human-driven clinical use remains challenging due to sensitivity to prompt variations.

Abstract

Promptable Foundation Models (FMs), initially introduced for natural image segmentation, have also revolutionized medical image segmentation. The increasing number of models, along with evaluations varying in datasets, metrics, and compared models, makes direct performance comparison between models difficult and complicates the selection of the most suitable model for specific clinical tasks. In our study, 11 promptable FMs are tested using non-iterative 2D and 3D prompting strategies on a private and public dataset focusing on bone and implant segmentation in four anatomical regions (wrist, shoulder, hip and lower leg). The Pareto-optimal models are identified and further analyzed using human prompts collected through a dedicated observer study. Our findings are: 1) The segmentation performance varies a lot between FMs and prompting strategies; 2) The Pareto-optimal models in 2D are SAM and SAM2.1, in 3D nnInteractive and Med-SAM2; 3) Localization accuracy and rater consistency vary with anatomical structures, with higher consistency for simple structures (wrist bones) and lower consistency for complex structures (pelvis, tibia, implants); 4) The segmentation performance drops using human prompts, suggesting that performance reported on "ideal" prompts extracted from reference labels might overestimate the performance in a human-driven setting; 5) All models were sensitive to prompt variations. While two models demonstrated intra-rater robustness, it did not scale to inter-rater settings. We conclude that the selection of the most optimal FM for a human-driven setting remains challenging, with even high-performing FMs being sensitive to variations in human input prompts. Our code base for prompt extraction and model inference is available: https://github.com/CarolineMagg/segmentation-FM-benchmark/

GDPR and AI Training Data: What You Need to Know Before Training on Personal Data

Dev.to

We built a 9-item checklist that catches LLM coding agent failures before execution starts

Dev.to

Edge-to-Cloud Swarm Coordination for heritage language revitalization programs with embodied agent feedback loops

Dev.to

How to Build an Automated SEO Workflow with AI: Lessons Learned from Developing SEONIB

Dev.to

AI Crawler Management: The Definitive Guide to robots.txt for AI Bots

Dev.to

Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation

Key Points

Abstract

Related Articles

GDPR and AI Training Data: What You Need to Know Before Training on Personal Data

We built a 9-item checklist that catches LLM coding agent failures before execution starts

Edge-to-Cloud Swarm Coordination for heritage language revitalization programs with embodied agent feedback loops

How to Build an Automated SEO Workflow with AI: Lessons Learned from Developing SEONIB

AI Crawler Management: The Definitive Guide to robots.txt for AI Bots

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer