Can AI Scientist Agents Learn from Lab-in-the-Loop Feedback? Evidence from Iterative Perturbation Discovery

arXiv cs.LG / 3/30/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The study addresses whether LLM-based scientific agents can truly learn from experimental feedback, testing iterative perturbation discovery in Cell Painting high-content screening via 800 independently replicated experiments.
  • An LLM agent that updates hypotheses using feedback outperforms a zero-shot baseline, yielding a +53.4% average increase in discoveries per feature (p = 0.003).
  • A random-feedback control that permutes hit/miss labels eliminates the performance gain, showing the benefit depends on the feedback signal structure rather than merely triggering prompt-based recall.
  • The results indicate a capability threshold for effective in-context learning from feedback: upgrading from Claude Sonnet 4.5 to 4.6 sharply reduces gene hallucinations and turns previously non-significant gains into a large, significant improvement (+11.0, p = 0.003).
  • Overall, the paper provides evidence that lab-in-the-loop feedback can drive genuine in-context learning for scientific experimentation, but only when model capability is sufficiently high.

Abstract

Recent work has questioned whether large language models (LLMs) can perform genuine in-context learning (ICL) for scientific experimental design, with prior studies suggesting that LLM-based agents exhibit no sensitivity to experimental feedback. We shed new light on this question by carrying out 800 independently replicated experiments on iterative perturbation discovery in Cell Painting high-content screening. We compare an LLM agent that iteratively updates its hypotheses using experimental feedback to a zero-shot baseline that relies solely on pretraining knowledge retrieval. Access to feedback yields a +53.4\% increase in discoveries per feature on average (p = 0.003). To test whether this improvement arises from genuine feedback-driven learning rather than prompt-induced recall of pretraining knowledge, we introduce a random feedback control in which hit/miss labels are permuted. Under this control, the performance gain disappears, indicating that the observed improvement depends on the structure of the feedback signal (+13.0 hits, p = 0.003). We further examine how model capability affects feedback utilization. Upgrading from Claude Sonnet 4.5 to 4.6 reduces gene hallucination rates from {\sim}33\%--45\% to {\sim}3--9\%, converting a non-significant ICL effect (+0.8, p = 0.32) into a large and highly significant improvement (+11.0, p=0.003) for the best ICL strategy. These results suggest that effective in-context learning from experimental feedback emerges only once models reach a sufficient capability threshold.