| Sharing an open-source benchmark suite ( Setup. Same coding agent (Claude Opus 4.6 as the planner, Gemini Flash 3 as the task model), same input data, same evaluation scripts across all 9 tasks: test generation (mutation score), text-to-SQL (execution accuracy), PDF extraction, contract extraction, PR review, text classification, few-shot prompt selection, LLM routing, summarization evaluation. Independent variable: whether the agent could call a retrieval tool over CS literature before writing its solution. One pass per task, no retries, no manual filtering of outputs. Task selection. Tasks were chosen to span the everyday-engineering surface a coding agent actually faces, not specialized ML scenarios. Selection criteria: (1) unambiguous quantitative metric, (2) baseline performance well below ceiling, (3) standard datasets where they exist, (4) eval reproducible on a free Gemini API key in roughly 10 minutes per task. Eval methodology. Each task uses its task-standard quantitative metric (mutation score for test_generation, execution accuracy for text_to_sql, F1 on labeled spans for the extraction tasks, weighted F1 for classification, etc.). Full per-task scripts and dataset choices are in the repo - one directory per task, Retrieval setup. The "with retrieval" agent has access to three tool calls: Comparability. Both agents share the same task-specific user prompt; the only system-prompt difference is the retrieval agent's tool-call grammar. Predictions and per-task prompts are diffable in the repo ( Results.
The test-generation delta came from the agent discovering mutation-aware prompting - the techniques are MuTAP and MUTGEN - which enumerate every AST-level mutation of the target and require one test per mutation. Baseline wrote generic tests from pretrain priors. The contract extraction delta came from BEAVER (section-level relevance scoring) and PAVE (post-extraction validation), both 2026 techniques that post-date the agent's training. 10 of the 15 most-cited sources across the experiments were published in 2025 or later, which is the conservative argument for why retrieval matters: the agent could not have reached these techniques from parametric memory. Failure modes. Self-refinement hurt text-to-SQL (the agent second-guessed correct queries after reading work on SQL ambiguity). Two suggested techniques (DyT, SeeDNorm) were architecture-incompatible in the autoresearch experiment and got discarded. Retrieval surfaces better options, not guaranteed wins. Reproducibility. Every prompt, every line of agent code, every prediction file, every eval script is in the repo. Each task directory has a README documenting methodology and an Repo: https://github.com/paperlantern-ai/paper-lantern-challenges Writeup with detailed per-task discussion: https://www.paperlantern.ai/blog/coding-agent-benchmarks Happy to share additional design choices in comments. [link] [comments] |
Open-source 9-task benchmark for coding-agent retrieval augmentation. Per-task deltas +0.010 to +0.320, all evals reproducible [P]
Reddit r/MachineLearning / 4/25/2026
📰 NewsSignals & Early TrendsTools & Practical UsageModels & Research
Key Points
- The article introduces an open-source benchmark suite (“paper-lantern-challenges”) that evaluates coding agents on 9 common software tasks, comparing performance with and without retrieval-augmented technique selection.
- It reports meaningful per-task improvements (deltas ranging from +0.010 to +0.320) while keeping the evaluation conditions consistent across tasks.
- The benchmark is designed for full reproducibility: prompts, agent code paths, and prediction outputs are included in the repository, along with per-task evaluation scripts and READMEs.
- The setup uses a fixed agent/planner (Claude Opus 4.6) and task model (Gemini Flash 3) and tests specific components such as test generation, text-to-SQL, document/contract extraction, PR review, classification, prompt selection, routing, and summarization.
- Retrieval is implemented via three CS-literature-focused tool calls (explore_approaches, deep_dive, compare_approaches), with caching to reduce repeated latency during evaluations.
- The authors selected tasks based on having unambiguous metrics, non-trivial baseline difficulty, standard datasets when available, and the ability to run evaluations quickly (around 10 minutes per task) using a free Gemini API key.
Related Articles

Black Hat USA
AI Business

Just what the doctor ordered: how AI could help China bridge the medical resources gap
SCMP Tech
Why don't Automatic speech Recognition models use prompting? [D]
Reddit r/MachineLearning
Got into the Anthropic Claude Partner Network — have spots for people who want CCAF cert access
Reddit r/artificial
💎 Daily B2B Lead Report: Who's Hiring Now? (2026-04-25)
Dev.to