When Does Context Help? A Systematic Study of Target-Conditional Molecular Property Prediction

arXiv cs.LG / 4/9/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper presents the first systematic study of when and how target-conditioned context improves molecular property prediction across 10 protein families, four fusion architectures, and multiple training-data regimes using temporal and random evaluation splits.
  • It finds that the fusion mechanism matters most: the FiLM-based NestDrug architecture significantly outperforms simpler context incorporation methods such as concatenation and additive conditioning.
  • Context can enable predictions that standard approaches cannot, particularly in data-scarce settings like CYP3A4, where multi-task transfer with context yields strong AUC compared with per-target baselines.
  • The authors also show that context can hurt performance when there is distribution mismatch (e.g., BACE1), and that few-shot adaptation may underperform zero-shot evaluation.
  • The study exposes major benchmarking issues, including abnormally high scores from non-learning baselines and active leakage into training data, while reporting robust temporal-split generalization to future chemical space.

Abstract

We present the first systematic study of when target context helps molecular property prediction, evaluating context conditioning across 10 diverse protein families, 4 fusion architectures, data regimes spanning 67-9,409 training compounds, and both temporal and random evaluation splits. Using NestDrug, a FiLM-based architecture that conditions molecular representations on target identity, we characterize both success and failure modes with three principal findings. First, fusion architecture dominates: FiLM outperforms concatenation by 24.2 percentage points and additive conditioning by 8.6 pp; how you incorporate context matters more than whether you include it. Second, context enables otherwise impossible predictions: on data-scarce CYP3A4 (67 training compounds), multi-task transfer achieves 0.686 AUC where per-target Random Forest collapses to 0.238. Third, context can systematically hurt: distribution mismatch causes 10.2 pp degradation on BACE1; few-shot adaptation consistently underperforms zero-shot. Beyond methodology, we expose fundamental flaws in standard benchmarking: 1-nearest-neighbor Tanimoto achieves 0.991 AUC on DUD-E without any learning, and 50% of actives leak from training data, rendering absolute performance metrics meaningless. Our temporal split evaluation (train up to 2020, test 2021-2024) achieves stable 0.843 AUC with no degradation, providing the first rigorous evidence that context-conditional molecular representations generalize to future chemical space.