In-Context Molecular Property Prediction with LLMs: A Blinding Study on Memorization and Knowledge Conflicts

arXiv cs.LG / 3/30/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies whether LLMs genuinely perform in-context molecular property regression or mainly rely on memorization, addressing concerns about benchmark contamination.
  • It runs progressively blinded experiments that reduce accessible information to disentangle effects from pre-trained knowledge versus in-context examples.
  • Nine LLM variants from GPT-4.1, GPT-5, and Gemini 2.5 families are evaluated on three MoleculeNet datasets (Delaney solubility, Lipophilicity, QM7 atomization energy).
  • The experiments include controlled in-context sample sizes (0-, 60-, and 1000-shot) to test how the amount of provided context affects performance and potential memorization behavior.
  • The authors propose a principled evaluation framework to assess molecular property prediction under controlled information access and to surface conflicts between pre-training and in-context learning.

Abstract

The capabilities of large language models (LLMs) have expanded beyond natural language processing to scientific prediction tasks, including molecular property prediction. However, their effectiveness in in-context learning remains ambiguous, particularly given the potential for training data contamination in widely used benchmarks. This paper investigates whether LLMs perform genuine in-context regression on molecular properties or rely primarily on memorized values. Furthermore, we analyze the interplay between pre-trained knowledge and in-context information through a series of progressively blinded experiments. We evaluate nine LLM variants across three families (GPT-4.1, GPT-5, Gemini 2.5) on three MoleculeNet datasets (Delaney solubility, Lipophilicity, QM7 atomization energy) using a systematic blinding approach that iteratively reduces available information. Complementing this, we utilize varying in-context sample sizes (0-, 60-, and 1000-shot) as an additional control for information access. This work provides a principled framework for evaluating molecular property prediction under controlled information access, addressing concerns regarding memorization and exposing conflicts between pre-trained knowledge and in-context information.