Do Larger Models Really Win in Drug Discovery? A Benchmark Assessment of Model Scaling in AI-Driven Molecular Property and Activity Prediction

arXiv cs.LG / 4/30/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • A new arXiv benchmark study tests the “bigger models always win” idea in drug discovery across 22 molecular property/activity endpoints using held-out evaluations and structure-similarity-separated five-fold cross-validation.
  • Classical ML methods (e.g., RF on ECFP4 and ExtraTrees on RDKit descriptors) lead in 10 primary-metric tasks, while GNN approaches (e.g., GIN, Ligandformer) lead in 9 and pretrained molecular sequence models (e.g., MoLFormer, ChemBERTa2) lead in 3.
  • Rule-based SAR reasoning baselines (GPT5.5-SAR, Opus4.7-SAR) do not outperform on the study’s prespecified primary metrics, though using train-fold SAR knowledge can still yield measurable—yet uneven—improvements for SAR reasoning and interpretation.
  • The paper concludes that compact, specialized models can remain highly effective, and that model size/generalization does not guarantee universal gains; performance is endpoint- and protocol-dependent.
  • Larger/general models may still be useful for zero-shot reasoning, SAR interpretation, and hypothesis generation, but best results depend on matching molecular representation, inductive bias, data regime, biology of the endpoint, and validation setup.

Abstract

The rapid growth of molecular foundation models and general-purpose large language models has encouraged a scale-centric view of artificial intelligence in drug discovery, in which larger pretrained models are expected to supersede compact cheminformatics models and task-specific graph neural networks (GNNs). We test this assumption on 22 molecular property and activity endpoints, including public ADMET and Tox21 benchmarks and two internal anti-infective activity datasets. Across 167,056 held-out task--molecule evaluations under structure-similarity-separated five-fold cross-validation (37,756 ADMET, 77,946 Tox21, 49,266 anti-TB and 2,088 antimalaria), classical machine-learning (ML) models such as RF(ECFP4) and ExtraTrees(RDKit descriptors) win ten primary-metric tasks, GNNs such as GIN and Ligandformer win nine, and pretrained molecular sequence models such as MoLFormer and ChemBERTa2 win three. Rule-based SAR reasoning baselines, represented by GPT5.5-SAR and Opus4.7-SAR, do not win under the prespecified primary metrics, although train-fold-derived SAR knowledge provides measurable but uneven gains for SAR reasoning and interpretation. These results indicate that compact, specialized models remain highly effective for molecular property and activity prediction. The performance differences among classical ML, GNN and pretrained sequence models are often modest and endpoint-dependent, whereas larger or more general models do not provide a universal predictive advantage. Large models may still add value for zero-shot reasoning, SAR interpretation and hypothesis generation, but the results suggest that predictive performance depends on the alignment among molecular representation, inductive bias, data regime, endpoint biology and validation protocol.