Automated Detection of Dosing Errors in Clinical Trial Narratives: A Multi-Modal Feature Engineering Approach with LightGBM
arXiv cs.CL / 4/23/2026
💬 OpinionDeveloper Stack & InfrastructureModels & Research
Key Points
- The study introduces an automated system to detect medication dosing errors in unstructured clinical trial narratives using gradient boosting (LightGBM) with multi-modal feature engineering.
- It builds a large, diverse feature set (3,451 features) combining traditional NLP signals (TF-IDF, character n-grams), dense semantic embeddings (all-MiniLM-L6v2), medical domain patterns, and transformer-derived scores (BiomedBERT, DeBERTa-v3) extracted from nine text fields.
- Evaluated on the CT-DEB benchmark with strong class imbalance (4.9% positives), the model attains 0.8725 test ROC-AUC using a 5-fold ensemble, with cross-validation showing 0.8833 ± 0.0091 AUC.
- Ablation results show that removing sentence embeddings causes the largest drop in performance (~2.39%), and a feature-efficiency analysis indicates that selecting only the top 500–1,000 features can outperform using all features.
- The findings emphasize feature selection as an effective form of regularization and demonstrate that sparse lexical features still add value alongside dense representations for specialized clinical text classification.
Related Articles

Trajectory Forecasts in Unknown Environments Conditioned on Grid-Based Plans
Dev.to

Elevating Austria: Google invests in its first data center in the Alps.
Google Blog

OpenAI Just Named It Workspace Agents. We Open-Sourced Our Lark Version Six Months Ago
Dev.to

GPT Image 2 Subject-Lock Editing: A Practical Guide to input_fidelity
Dev.to

AI Tutor That Works Offline — Study Anywhere with EaseLearn AI
Dev.to