| Ran a controlled experiment measuring whether LLM coding agents benefit from access to research literature during automated experimentation. Setup: Two identical runs using Karpathy's autoresearch framework. Claude Code agent optimizing a ~7M param GPT-2 on TinyStories. M4 Pro, 100 experiments each, same seed config. Only variable — one agent had access to an MCP server that does full-text search over 2M+ CS papers and returns synthesized methods with citations. Results:
Gap was 3.2% and still widening at the 2-hour mark. Techniques the paper-augmented agent found:
What didn't work:
Key observation: Both agents attempted halving the batch size. Without literature access, the agent didn't adjust the learning rate — the run diverged. With access, it retrieved the sqrt scaling rule, applied it correctly on first attempt, then successfully halved again to 16K. Interpretation: The agent without papers was limited to techniques already encoded in its weights — essentially the "standard ML playbook." The paper-augmented agent accessed techniques published after its training cutoff (AdaGC, Feb 2025) and surfaced techniques it may have seen during training but didn't retrieve unprompted (sqrt scaling rule, 2022). This was deliberately tested on TinyStories — arguably the most well-explored small-scale setting in ML — to make the comparison harder. The effect would likely be larger on less-explored problems. Limitations: Single run per condition. The model is tiny (7M params). Some of the improvement may come from the agent spending more time reasoning about each technique rather than the paper content itself. More controlled ablations needed. I built the paper search MCP server (Paper Lantern) for this experiment. Free to try: https://code.paperlantern.ai Full writeup with methodology, all 15 paper citations, and appendices: https://www.paperlantern.ai/blog/auto-research-case-study Would be curious to see this replicated at larger scale or on different domains. [link] [comments] |
[R] Controlled experiment: giving an LLM agent access to CS papers during automated hyperparameter search improves results by 3.2%
Reddit r/MachineLearning / 3/28/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisTools & Practical UsageModels & Research
Key Points
- A controlled experiment compared two identical Karpathy autoresearch runs where a Claude Code agent optimized a ~7M GPT-2 on TinyStories, with the only difference being access to a paper-search MCP server over 2M+ CS papers.
- The paper-augmented agent achieved a higher best validation improvement (4.05% vs 3.67%), with an overall ~3.2% performance gap that was still widening at the 2-hour checkpoint.
- The agent with literature access tried substantially more paper-derived techniques (25 paper-sourced vs only standard techniques), and it correctly applied the “sqrt batch scaling” rule retrieved from research.
- A key mechanism was learning-rate adjustment after batch-size changes: the papers-enabled agent retrieved the correct scaling guidance and avoided divergence, while the no-papers agent diverged when it tried to halve batch size without changing the learning rate.
- Not all literature techniques helped: several proposed methods were incompatible with the model architecture and were reverted, indicating the value lies in selective retrieval and correct adaptation rather than blindly applying papers.
Related Articles

Black Hat Asia
AI Business

"The Agent Didn't Decide Wrong. The Instructions Were Conflicting — and Nobody Noticed."
Dev.to
Top 5 LLM Gateway Alternatives After the LiteLLM Supply Chain Attack
Dev.to

Stop Counting Prompts — Start Reflecting on AI Fluency
Dev.to

Reliable Function Calling in Deeply Recursive Union Types: Fixing Qwen Models' Double-Stringify Bug
Dev.to