Automated Detection of Dosing Errors in Clinical Trial Narratives: A Multi-Modal Feature Engineering Approach with LightGBM

arXiv cs.CL / 4/23/2026

💬 OpinionDeveloper Stack & InfrastructureModels & Research

Key Points

  • The study introduces an automated system to detect medication dosing errors in unstructured clinical trial narratives using gradient boosting (LightGBM) with multi-modal feature engineering.
  • It builds a large, diverse feature set (3,451 features) combining traditional NLP signals (TF-IDF, character n-grams), dense semantic embeddings (all-MiniLM-L6v2), medical domain patterns, and transformer-derived scores (BiomedBERT, DeBERTa-v3) extracted from nine text fields.
  • Evaluated on the CT-DEB benchmark with strong class imbalance (4.9% positives), the model attains 0.8725 test ROC-AUC using a 5-fold ensemble, with cross-validation showing 0.8833 ± 0.0091 AUC.
  • Ablation results show that removing sentence embeddings causes the largest drop in performance (~2.39%), and a feature-efficiency analysis indicates that selecting only the top 500–1,000 features can outperform using all features.
  • The findings emphasize feature selection as an effective form of regularization and demonstrate that sparse lexical features still add value alongside dense representations for specialized clinical text classification.

Abstract

Clinical trials require strict adherence to medication protocols, yet dosing errors remain a persistent challenge affecting patient safety and trial integrity. We present an automated system for detecting dosing errors in unstructured clinical trial narratives using gradient boosting with comprehensive multi-modal feature engineering. Our approach combines 3,451 features spanning traditional NLP (TF-IDF, character n-grams), dense semantic embeddings (all-MiniLM-L6v2), domain-specific medical patterns, and transformer-based scores (BiomedBERT, DeBERTa-v3), used to train a LightGBM model. Features are extracted from nine complementary text fields (median 5,400 characters per sample) ensuring complete coverage across all 42,112 clinical trial narratives. On the CT-DEB benchmark dataset with severe class imbalance (4.9% positive rate), we achieve 0.8725 test ROC-AUC through 5-fold ensemble averaging (cross-validation: 0.8833 + 0.0091 AUC). Systematic ablation studies reveal that removing sentence embeddings causes the largest performance degradation (2.39%), demonstrating their critical role despite contributing only 37.07% of total feature importance. Feature efficiency analysis demonstrates that selecting the top 500-1000 features yields optimal performance (0.886-0.887 AUC), outperforming the full 3,451-feature set (0.879 AUC) through effective noise reduction. Our findings highlight the importance of feature selection as a regularization technique and demonstrate that sparse lexical features remain complementary to dense representations for specialized clinical text classification under severe class imbalance.