HINTBench: Horizon-agent Intrinsic Non-attack Trajectory Benchmark

arXiv cs.LG / 4/16/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that agent-safety evaluation has overemphasized externally induced attacks, but that agents can still fail under benign conditions via latent, intrinsic long-horizon risks.
  • It introduces HINTBench, a benchmark containing 629 annotated agent trajectories (523 risky, 106 safe) with ~33 steps per trajectory, designed for non-attack intrinsic risk auditing.
  • HINTBench supports three evaluation tasks: risk detection at the trajectory level, risk-step localization, and intrinsic failure-type identification, with labels organized using a five-constraint taxonomy.
  • Experimental results show a large capability gap in current LLM-based agents: strong models do well on trajectory-level detection, but drop to below 35 Strict-F1 for localizing the specific risky step.
  • The study finds that existing guard models transfer poorly to this intrinsic (non-attack) risk setting, positioning intrinsic risk auditing as an open challenge for agent safety research.

Abstract

Existing agent-safety evaluation has focused mainly on externally induced risks. Yet agents may still enter unsafe trajectories under benign conditions. We study this complementary but underexplored setting through the lens of \emph{intrinsic} risk, where intrinsic failures remain latent, propagate across long-horizon execution, and eventually lead to high-consequence outcomes. To evaluate this setting, we introduce \emph{non-attack intrinsic risk auditing} and present \textbf{HINTBench}, a benchmark of 629 agent trajectories (523 risky, 106 safe; 33 steps on average) supporting three tasks: risk detection, risk-step localization, and intrinsic failure-type identification. Its annotations are organized under a unified five-constraint taxonomy. Experiments reveal a substantial capability gap: strong LLMs perform well on trajectory-level risk detection, but their performance drops to below 35 Strict-F1 on risk-step localization, while fine-grained failure diagnosis proves even harder. Existing guard models transfer poorly to this setting. These findings establish intrinsic risk auditing as an open challenge for agent safety.