Autonomous Skeletal Landmark Localization towards Agentic C-Arm Control

arXiv cs.CV / 4/22/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • The paper addresses delays caused by failures of conventional deep learning C-arm control, proposing an agentic framework that uses multimodal LLMs to incorporate clinician feedback and reasoning for more accurate positioning.
  • It investigates adapting multimodal large language models for autonomous skeletal landmark localization, which is a prerequisite step for C-arm control.
  • The authors fine-tuned two MLLMs using both annotated synthetic X-ray data and real X-ray data, training the models to retrieve the closest skeletal landmarks from each image.
  • Quantitative results show the fine-tuned MLLMs perform competitively with a leading DL approach across localization tasks, while qualitative experiments demonstrate reasoning-based correction of incorrect predictions and sequential navigation of the C-arm toward a target.
  • The study releases code on GitHub, supporting further research toward agentic autonomous C-arm control systems.

Abstract

Purpose: Automated C-arm positioning ensures timely treatment in patients requiring emergent interventions. When a conventional Deep Learning (DL) approach for C-arm control fails, clinicians must revert to manual operation, resulting in additional delays. Consequently, an agentic C-arm control framework based on multimodal large language models (MLLMs) is highly desirable, as it can incorporate clinician feedback and use reasoning to make adjustments toward more accurate positioning. Skeletal landmark localization is essential for C-arm control, and we investigate adapting MLLMs for autonomous landmark localization. Methods: We used an annotated synthetic X-ray dataset and a real X-ray dataset. Each X-ray in both datasets is paired with several skeletal landmarks. We fine-tuned two MLLMs and tasked them with retrieving the closest landmarks from each X-ray. Quantitative evaluations of landmark localization were performed and compared against a leading DL approach. We further conducted qualitative experiments demonstrating: (1) how an MLLM can correct an initially incorrect prediction through reasoning, and (2) how the MLLM can sequentially navigate the C-arm toward a target location. Results: On both datasets, fine-tuned MLLMs demonstrate competitive performance across all localization tasks when compared with the DL approach. In the qualitative experiments, the MLLMs provide evidence of reasoning and spatial awareness. Conclusion: This study shows that fine-tuned MLLMs achieve accurate skeletal landmark localization and hold promise for agentic autonomous C-arm control. Our code is available athttps://github.com/marszzibros/C-arm-localization-LLMs.git