Plausible but Wrong: A case study on Agentic Failures in Astrophysical Workflows
arXiv cs.AI / 4/29/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The study evaluates CMBAgent in two agentic workflow paradigms (One-Shot and Deep Research) across 18 astrophysical tasks, showing it performs well on well-specified problems.
- In the One-Shot setting, providing domain-specific context yields about a 6x improvement (0.85 vs. ~0 without context).
- A dominant and most concerning failure mode is silent incorrect computation: the agent generates syntactically valid code or outputs that look plausible but are physically inaccurate.
- In the Deep Research setting, the system often fails silently under stress tests, producing physically inconsistent posteriors without self-diagnosis, and performance drops on reasoning-limit probes.
- The authors release an evaluation framework to enable systematic reliability testing of scientific AI agents.


