AgentHER: Hindsight Experience Replay for LLM Agent Trajectory Relabeling

arXiv cs.AI / 2026/3/24

📰 ニュースSignals & Early TrendsIdeas & Deep AnalysisModels & Research

要点

  • The paper introduces AgentHER, which adapts Hindsight Experience Replay (HER) to natural-language LLM agent trajectories by relabeling failed runs as successful demonstrations for alternative achievable goals.
  • AgentHER uses a four-stage pipeline—failure classification, outcome extraction, LLM-guided prompt relabeling with confidence gating, and data packaging—producing offline training data for SFT, DPO, and ShareGPT.
  • Experiments on WebArena and ToolBench show AgentHER improves over success-only training by +7.1 to +11.7 percentage points across multiple model families, while achieving about 2x data efficiency (matching performance with roughly half the successful demonstrations).
  • The method scales consistently across model sizes (about 1.5B to 72B parameters) and further improves under iterative redeployment, indicating it can compound gains across training rounds.
  • Human evaluation reports high relabeling precision (97.7%) using multi-judge verification, supporting the quality of recovered training signal from discarded failures.

Abstract

LLM agents fail on the majority of real-world tasks -- GPT-4o succeeds on fewer than 15% of WebArena navigation tasks and below 55% pass@1 on ToolBench (Zhou et al., 2024; Qin et al., 2024) -- yet every failed trajectory is routinely discarded, wasting the dominant source of collected experience. We introduce AgentHER, a framework that recovers this lost training signal by adapting the Hindsight Experience Replay (HER; Andrychowicz et al., 2017) principle to natural-language agent trajectories for offline data augmentation. The key insight is simple: a trajectory that fails goal A is often a correct demonstration for some achievable alternative goal B. AgentHER realises this idea through a four-stage pipeline -- failure classification, outcome extraction, LLM-guided prompt relabeling with confidence gating, and data packaging -- that converts discarded failures into high-quality SFT, DPO, and ShareGPT training data, with both zero-cost rule-based and LLM-judge implementations. On WebArena (Zhou et al., 2024) and ToolBench (Qin et al., 2024), AgentHER improves over success-only SFT by +7.1-11.7 pp across four model families (GPT-4o, Qwen2.5-72B/7B, LLaMA-3.1-8B), while achieving 2x data efficiency -- matching baseline performance with only 50% of successful demonstrations. Gains are consistent from 1.5B to 72B parameters (+5.8-9.2 pp) and compound under iterative redeployment (+2.1 pp over additional rounds). Human evaluation confirms 97.7% relabeling precision under multi-judge verification.