Can AI Scientist Agents Learn from Lab-in-the-Loop Feedback? Evidence from Iterative Perturbation Discovery

arXiv cs.LG / 3/30/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The study addresses whether LLM-based scientific agents can truly learn from experimental feedback, testing iterative perturbation discovery in Cell Painting high-content screening via 800 independently replicated experiments.
An LLM agent that updates hypotheses using feedback outperforms a zero-shot baseline, yielding a +53.4% average increase in discoveries per feature (p = 0.003).
A random-feedback control that permutes hit/miss labels eliminates the performance gain, showing the benefit depends on the feedback signal structure rather than merely triggering prompt-based recall.
The results indicate a capability threshold for effective in-context learning from feedback: upgrading from Claude Sonnet 4.5 to 4.6 sharply reduces gene hallucinations and turns previously non-significant gains into a large, significant improvement (+11.0, p = 0.003).
Overall, the paper provides evidence that lab-in-the-loop feedback can drive genuine in-context learning for scientific experimentation, but only when model capability is sufficiently high.

Abstract

Recent work has questioned whether large language models (LLMs) can perform genuine in-context learning (ICL) for scientific experimental design, with prior studies suggesting that LLM-based agents exhibit no sensitivity to experimental feedback. We shed new light on this question by carrying out 800 independently replicated experiments on iterative perturbation discovery in Cell Painting high-content screening. We compare an LLM agent that iteratively updates its hypotheses using experimental feedback to a zero-shot baseline that relies solely on pretraining knowledge retrieval. Access to feedback yields a

+53.4\%

increase in discoveries per feature on average (

p = 0.003

). To test whether this improvement arises from genuine feedback-driven learning rather than prompt-induced recall of pretraining knowledge, we introduce a random feedback control in which hit/miss labels are permuted. Under this control, the performance gain disappears, indicating that the observed improvement depends on the structure of the feedback signal (

+13.0

hits,

p = 0.003

). We further examine how model capability affects feedback utilization. Upgrading from Claude Sonnet 4.5 to 4.6 reduces gene hallucination rates from

{\sim}33\%

45\%

{\sim}3

9\%

, converting a non-significant ICL effect (

+0.8

p = 0.32

) into a large and highly significant improvement (

+11.0

p=0.003

) for the best ICL strategy. These results suggest that effective in-context learning from experimental feedback emerges only once models reach a sufficient capability threshold.

Black Hat Asia

AI Business

Mr. Chatterbox is a (weak) Victorian-era ethically trained model you can run on your own computer

Simon Willison's Blog

Beyond the Chatbot: Engineering Multi-Agent Ecosystems in 2026

Dev.to

I missed the "fun" part in software development

Dev.to

The Billion Dollar Tax on AI Agents

Dev.to

Can AI Scientist Agents Learn from Lab-in-the-Loop Feedback? Evidence from Iterative Perturbation Discovery

Key Points

Abstract

Related Articles

Black Hat Asia

Mr. Chatterbox is a (weak) Victorian-era ethically trained model you can run on your own computer

Beyond the Chatbot: Engineering Multi-Agent Ecosystems in 2026

I missed the "fun" part in software development

The Billion Dollar Tax on AI Agents

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer