Agentic Automation of BT-RADS Scoring: End-to-End Multi-Agent System for Standardized Brain Tumor Follow-up Assessment

arXiv cs.CL / 3/24/2026

💬 OpinionSignals & Early TrendsModels & Research

Key Points

  • The paper presents an end-to-end multi-agent system that combines an LLM-based clinical variable extractor with CNN-based tumor segmentation to automate BT-RADS post-treatment brain tumor response classification.
  • Using 492 eligible MRI examinations from a single high-volume center, the system achieved 76.0% accuracy versus 57.5% for initial clinical assessments, improving performance by 18.5 percentage points (P<.001).
  • Context-dependent BT-RADS categories were highly sensitive (e.g., BT-1b at 100% and BT-1a at 92.7%), while threshold-dependent categories showed more moderate sensitivity (e.g., BT-3b at 57.1%).
  • For BT-4 detection, the system showed a high positive predictive value of 92.9%, suggesting strong reliability for identifying this clinically significant category.
  • The authors report that the multi-agent LLM approach produced higher agreement with an expert neuroradiologist reference standard than clinicians’ initial scoring, potentially supporting more standardized follow-up assessments.

Abstract

The Brain Tumor Reporting and Data System (BT-RADS) standardizes post-treatment MRI response assessment in patients with diffuse gliomas but requires complex integration of imaging trends, medication effects, and radiation timing. This study evaluates an end-to-end multi-agent large language model (LLM) and convolutional neural network (CNN) system for automated BT-RADS classification. A multi-agent LLM system combined with automated CNN-based tumor segmentation was retrospectively evaluated on 509 consecutive post-treatment glioma MRI examinations from a single high-volume center. An extractor agent identified clinical variables (steroid status, bevacizumab status, radiation date) from unstructured clinical notes, while a scorer agent applied BT-RADS decision logic integrating extracted variables with volumetric measurements. Expert reference standard classifications were established by an independent board-certified neuroradiologist. Of 509 examinations, 492 met inclusion criteria. The system achieved 374/492 (76.0%; 95% CI, 72.1%-79.6%) accuracy versus 283/492 (57.5%; 95% CI, 53.1%-61.8%) for initial clinical assessments (+18.5 percentage points; P<.001). Context-dependent categories showed high sensitivity (BT-1b 100%, BT-1a 92.7%, BT-3a 87.5%), while threshold-dependent categories showed moderate sensitivity (BT-3c 74.8%, BT-2 69.2%, BT-4 69.3%, BT-3b 57.1%). For BT-4, positive predictive value was 92.9%. The multi-agent LLM system achieved higher BT-RADS classification agreement with expert reference standard compared to initial clinical scoring, with high accuracy for context-dependent scores and high positive predictive value for BT-4 detection.