Plausible but Wrong: A case study on Agentic Failures in Astrophysical Workflows

arXiv cs.AI / 4/29/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The study evaluates CMBAgent in two agentic workflow paradigms (One-Shot and Deep Research) across 18 astrophysical tasks, showing it performs well on well-specified problems.
In the One-Shot setting, providing domain-specific context yields about a 6x improvement (0.85 vs. ~0 without context).
A dominant and most concerning failure mode is silent incorrect computation: the agent generates syntactically valid code or outputs that look plausible but are physically inaccurate.
In the Deep Research setting, the system often fails silently under stress tests, producing physically inconsistent posteriors without self-diagnosis, and performance drops on reasoning-limit probes.
The authors release an evaluation framework to enable systematic reliability testing of scientific AI agents.

Abstract

Agentic AI systems are increasingly being integrated into scientific workflows, yet their behavior under realistic conditions remains insufficiently understood. We evaluate CMBAgent across two workflow paradigms and eighteen astrophysical tasks. In the One-Shot setting, access to domain-specific context yields an approximately ~6x performance improvement (0.85 vs. ~0 without context), with the primary failure mode being silent incorrect computation - syntactically valid code that produces plausible but inaccurate results. In the Deep Research setting, the system frequently exhibits silent failures across stress tests, producing physically inconsistent posteriors without self-diagnosis. Overall, performance is strong on well-specified tasks but degrades on problems designed to probe reasoning limits, often without visible error signals. These findings highlight that the most concerning failure mode in agentic scientific workflows is not overt failure, but confident generation of incorrect results. We release our evaluation framework to facilitate systematic reliability analysis of scientific AI agents.

What to Build Still Beats How

Dev.to

I Build Systems, Flip Land, and Drop Trap Music — Meet Tyler Moncrieff aka Father Dust

Dev.to

From Claim Denials to Smart Decisions: My Experience Using AI in Healthcare Claims Processing

Dev.to

v0.22.1

Ollama Releases

AI created job descriptions

Reddit r/artificial

Plausible but Wrong: A case study on Agentic Failures in Astrophysical Workflows

Key Points

Abstract

Related Articles

What to Build Still Beats How

I Build Systems, Flip Land, and Drop Trap Music — Meet Tyler Moncrieff aka Father Dust

From Claim Denials to Smart Decisions: My Experience Using AI in Healthcare Claims Processing

v0.22.1

AI created job descriptions

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer