An Explainable Vision-Language Model Framework with Adaptive PID-Tversky Loss for Lumbar Spinal Stenosis Diagnosis

arXiv cs.AI / 4/6/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses lumbar spinal stenosis (LSS) diagnosis from multi-view MRI, targeting the delays and inter-observer variability caused by labor-intensive manual interpretation.
  • It proposes an end-to-end explainable vision-language model that uses a Spatial Patch Cross-Attention module for text-directed, spatially precise localization of spinal anomalies.
  • It introduces an Adaptive PID-Tversky Loss that applies control-theory-inspired, dynamically adjusted penalties to better handle extreme class imbalance and under-segmented minority instances.
  • The approach combines foundational VLMs with an Automated Radiology Report Generation module to improve interpretability, translating segmentation outputs into radiologist-style clinical reports.
  • Reported results include 90.69% classification accuracy, a macro-averaged Dice score of 0.9512 for segmentation, and a CIDEr score of 92.80%, alongside claims of a new benchmark for transparent, supervised clinical AI.

Abstract

Lumbar Spinal Stenosis (LSS) diagnosis remains a critical clinical challenge, with diagnosis heavily dependent on labor-intensive manual interpretation of multi-view Magnetic Resonance Imaging (MRI), leading to substantial inter-observer variability and diagnostic delays. Existing vision-language models simultaneously fail to address the extreme class imbalance prevalent in clinical segmentation datasets while preserving spatial accuracy, primarily due to global pooling mechanisms that discard crucial anatomical hierarchies. We present an end-to-end Explainable Vision-Language Model framework designed to overcome these limitations, achieved through two principal objectives. We propose a Spatial Patch Cross-Attention module that enables precise, text-directed localization of spinal anomalies with spatial precision. A novel Adaptive PID-Tversky Loss function by integrating control theory principles dynamically further modifies training penalties to specifically address difficult, under-segmented minority instances. By incorporating foundational VLMs alongside an Automated Radiology Report Generation module, our framework demonstrates considerable performance: a diagnostic classification accuracy of 90.69%, a macro-averaged Dice score of 0.9512 for segmentation, and a CIDEr score of 92.80%. Furthermore, the framework shows explainability by converting complex segmentation predictions into radiologist-style clinical reports, thereby establishing a new benchmark for transparent, interpretable AI in clinical medical imaging that keeps essential human supervision while enhancing diagnostic capabilities.