TimeTox: An LLM-Based Pipeline for Automated Extraction of Time Toxicity from Clinical Trial Protocols

arXiv cs.CL / 3/24/2026

💬 OpinionTools & Practical UsageModels & Research

共有:

Key Points

TimeTox is an LLM-based pipeline that automates extraction of “time toxicity” (cumulative healthcare contact days) from clinical trial protocol Schedule of Assessments tables.
The system uses Google Gemini in three stages: summary extraction from full protocol PDFs, quantifying time toxicity at six cumulative timepoints per treatment arm, and producing multi-run consensus via position-based arm matching.
In synthetic schedule validation, a two-stage “structure-then-count” architecture achieved 100% clinically acceptable accuracy (±3 days; MAE 0.81) compared with 41.5% for a single-pass approach (MAE 9.0).
On 644 real-world oncology protocols, the single-pass (vanilla) pipeline was more reproducible across three runs, reaching 95.3% clinically acceptable accuracy with 82.0% perfect stability (IQR = 0), and the authors emphasize stability for production readiness.
A production pipeline run extracted time toxicity for 1,288 treatment arms across multiple disease sites, with the paper concluding that reproducibility on real-world data is the decisive factor for deployment.

Abstract

Time toxicity, the cumulative healthcare contact days from clinical trial participation, is an important but labor-intensive metric to extract from protocol documents. We developed TimeTox, an LLM-based pipeline for automated extraction of time toxicity from Schedule of Assessments tables. TimeTox uses Google's Gemini models in three stages: summary extraction from full-length protocol PDFs, time toxicity quantification at six cumulative timepoints for each treatment arm, and multi-run consensus via position-based arm matching. We validated against 20 synthetic schedules (240 comparisons) and assessed reproducibility on 644 real-world oncology protocols. Two architectures were compared: single-pass (vanilla) and two-stage (structure-then-count). The two-stage pipeline achieved 100% clinically acceptable accuracy (

\pm

3 days) on synthetic data (MAE 0.81 days) versus 41.5% for vanilla (MAE 9.0 days). However, on real-world protocols, the vanilla pipeline showed superior reproducibility: 95.3% clinically acceptable accuracy (IQR

\leq

3 days) across 3 runs on 644 protocols, with 82.0% perfect stability (IQR = 0). The production pipeline extracted time toxicity for 1,288 treatment arms across multiple disease sites. Extraction stability on real-world data, rather than accuracy on synthetic benchmarks, is the decisive factor for production LLM deployment.

5 Signs Your Consulting Firm Needs AI Agents (Not More Staff)

Dev.to

AgentDesk vs Hiring Another Consultant: A Cost Comparison

Dev.to

v0.18.3

Ollama Releases

"Why Your AI Agent Needs a System 1"

Dev.to

ChatterMate vs Chatwoot vs Typebot: Which Open-Source Chat Platform Is Right for You?

Dev.to

TimeTox: An LLM-Based Pipeline for Automated Extraction of Time Toxicity from Clinical Trial Protocols

Key Points

Abstract

Related Articles

5 Signs Your Consulting Firm Needs AI Agents (Not More Staff)

AgentDesk vs Hiring Another Consultant: A Cost Comparison

v0.18.3

"Why Your AI Agent Needs a System 1"

ChatterMate vs Chatwoot vs Typebot: Which Open-Source Chat Platform Is Right for You?

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer