Benchmarking Single-Pose Docking, Consensus Rescoring, and Supervised ML on the LIT-PCBA Library: A Critical Evaluation of DiffDock, AutoDock-GPU, GNINA, and DiffDock-NMDN

arXiv cs.LG / 5/5/2026

📰 NewsIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • The study benchmarks multiple virtual screening workflows on the experimentally derived LIT-PCBA library, using 15 targets and 578,295 ligand–target pairs with confirmed actives/inactives.
  • For pose generation, the authors compare AutoDock-GPU and DiffDock, then apply rescoring with GNINA and NMDN, finding AutoDock-GNINA (GNINA rescoring of AutoDock-GPU poses) as the strongest single method with a median EF1% of 2.14.
  • DiffDock-based pipelines generally underperform the best AutoDock-GNINA approach on several targets, with particularly challenging cases such as OPRK1.
  • Consensus rescoring/ranking strategies improve robustness but still do not beat the top single-scoring workflow, while supervised ML re-ranking provides the largest benefit, reaching a median EF1% of 4.49 (+110% over AutoDock-GNINA).
  • Overall, the work concludes that no single docking method is universally dominant and that validated, cost-effective classical+ML hybrid pipelines with supervised re-ranking currently deliver the most practical early enrichment for virtual screening.

Abstract

Virtual screening performance depends heavily on the chosen docking and scoring methods. Recent AI-based tools such as DiffDock and NMDN have reported strong benchmark results, but their practical utility on realistic, experimentally-derived datasets remains unclear. Here we perform a large-scale evaluation on the LIT-PCBA library (15 targets, 578,295 ligand-target pairs with experimentally confirmed actives and inactives). We compare AutoDock-GPU and DiffDock for pose generation, followed by rescoring with GNINA and NMDN. We further evaluate rank-based consensus strategies and supervised machine learning models trained on docking features. GNINA rescoring of AutoDock-GPU poses (AutoDock-GNINA) emerged as the strongest single method with a median EF1% of 2.14. DiffDock-based approaches underperformed relative to AutoDock-GNINA, particularly on challenging targets such as OPRK1. Carefully designed consensus ranking improved robustness but did not surpass the best single scorer. Supervised ML re-ranking delivered the largest gains, achieving a median EF1% of 4.49 (+110% over AutoDock-GNINA). Our results highlight that even the best classical+ML hybrid workflows provide only modest early enrichment on realistic benchmarks. We conclude that no single docking method dominates across targets and that rigorously validated, cost-effective combinations with supervised re-ranking currently offer the most practical value for virtual screening.