EpiBench: Benchmarking Multi-turn Research Workflows for Multimodal Agents
arXiv cs.CL / 4/8/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces EpiBench, a new episodic multi-turn multimodal benchmark designed to evaluate research agents that conduct proactive literature search and sustained evidence use over multiple turns.
- Tasks require agents to navigate across multiple papers, extract and align evidence from figures and tables, and then use accumulated memory to answer objective questions involving cross-paper comparisons and multi-figure integration.
- The authors propose a process-level evaluation framework aimed at fine-grained testing and diagnosis of how research agents perform throughout the workflow (not just final answers).
- Experimental results show even leading models achieve only 29.23% accuracy on the hard split, highlighting significant gaps in current capabilities for multi-step, multi-evidence scientific reasoning.
Related Articles

Black Hat Asia
AI Business

Meta's latest model is as open as Zuckerberg's private school
The Register

AI fuels global trade growth as China-US flows shift, McKinsey finds
SCMP Tech

Why multi-agent AI security is broken (and the identity patterns that actually work)
Dev.to
BANKING77-77: New best of 94.61% on the official test set (+0.13pp) over our previous tests 94.48%.
Reddit r/artificial