TimeTox: An LLM-Based Pipeline for Automated Extraction of Time Toxicity from Clinical Trial Protocols

arXiv cs.CL / 3/24/2026

💬 OpinionTools & Practical UsageModels & Research

Key Points

  • TimeTox is an LLM-based pipeline that automates extraction of “time toxicity” (cumulative healthcare contact days) from clinical trial protocol Schedule of Assessments tables.
  • The system uses Google Gemini in three stages: summary extraction from full protocol PDFs, quantifying time toxicity at six cumulative timepoints per treatment arm, and producing multi-run consensus via position-based arm matching.
  • In synthetic schedule validation, a two-stage “structure-then-count” architecture achieved 100% clinically acceptable accuracy (±3 days; MAE 0.81) compared with 41.5% for a single-pass approach (MAE 9.0).
  • On 644 real-world oncology protocols, the single-pass (vanilla) pipeline was more reproducible across three runs, reaching 95.3% clinically acceptable accuracy with 82.0% perfect stability (IQR = 0), and the authors emphasize stability for production readiness.
  • A production pipeline run extracted time toxicity for 1,288 treatment arms across multiple disease sites, with the paper concluding that reproducibility on real-world data is the decisive factor for deployment.

Abstract

Time toxicity, the cumulative healthcare contact days from clinical trial participation, is an important but labor-intensive metric to extract from protocol documents. We developed TimeTox, an LLM-based pipeline for automated extraction of time toxicity from Schedule of Assessments tables. TimeTox uses Google's Gemini models in three stages: summary extraction from full-length protocol PDFs, time toxicity quantification at six cumulative timepoints for each treatment arm, and multi-run consensus via position-based arm matching. We validated against 20 synthetic schedules (240 comparisons) and assessed reproducibility on 644 real-world oncology protocols. Two architectures were compared: single-pass (vanilla) and two-stage (structure-then-count). The two-stage pipeline achieved 100% clinically acceptable accuracy (\pm3 days) on synthetic data (MAE 0.81 days) versus 41.5% for vanilla (MAE 9.0 days). However, on real-world protocols, the vanilla pipeline showed superior reproducibility: 95.3% clinically acceptable accuracy (IQR \leq 3 days) across 3 runs on 644 protocols, with 82.0% perfect stability (IQR = 0). The production pipeline extracted time toxicity for 1,288 treatment arms across multiple disease sites. Extraction stability on real-world data, rather than accuracy on synthetic benchmarks, is the decisive factor for production LLM deployment.