APEX-EM: Non-Parametric Online Learning for Autonomous Agents via Structured Procedural-Episodic Experience Replay

arXiv cs.AI / 4/1/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces APEX-EM, a non-parametric online learning framework for LLM-based autonomous agents that reuses prior procedural plans via structured procedural-episodic experience replay without updating model weights.
  • APEX-EM defines a structured experience representation capturing planning steps, artifacts, iteration history with error analysis, and quality scores, and uses a PRGII workflow with task verifiers to generate multi-dimensional reward signals.
  • It also proposes a dual-outcome experience memory that performs hybrid retrieval using semantic search, structural signature matching, and plan-DAG traversal to enable transfer across tasks with little/no lexical overlap but similar operational structure.
  • Experiments on BigCodeBench, KGQAGen-10k, and Humanity’s Last Exam show large accuracy/SR gains from memory, including 89.6% vs 41.3% on KGQAGen-10k and 83.3% SR vs 53.9% on BigCodeBench, with ablations indicating feedback usefulness depends on task type.
  • The approach treats successful executions as positive in-context examples and failures as negative examples annotated with structured error information to improve iterative planning and reuse over time.

Abstract

LLM-based autonomous agents lack persistent procedural memory: they re-derive solutions from scratch even when structurally identical tasks have been solved before. We present \textbf{APEX-EM}, a non-parametric online learning framework that accumulates, retrieves, and reuses structured procedural plans without modifying model weights. APEX-EM introduces: (1) a \emph{structured experience representation} encoding the full procedural-episodic trace of each execution -- planning steps, artifacts, iteration history with error analysis, and quality scores; (2) a \emph{Plan-Retrieve-Generate-Iterate-Ingest} (PRGII) workflow with Task Verifiers providing multi-dimensional reward signals; and (3) a \emph{dual-outcome Experience Memory} with hybrid retrieval combining semantic search, structural signature matching, and plan DAG traversal -- enabling cross-domain transfer between tasks sharing no lexical overlap but analogous operational structure. Successful experiences serve as positive in-context examples; failures as negative examples with structured error annotations. We evaluate on BigCodeBench~\cite{zhuo2025bigcodebench}, KGQAGen-10k~\cite{zhang2025kgqagen}, and Humanity's Last Exam~\cite{phan2025hle} using Claude Sonnet 4.5 and Opus 4.5. On KGQAGen-10k, APEX-EM achieves 89.6\% accuracy versus 41.3\% without memory (+48.3pp), surpassing the oracle-retrieval upper bound (84.9\%). On BigCodeBench, it reaches 83.3\% SR from a 53.9\% baseline (+29.4pp), exceeding MemRL's~\cite{memrl2025} +11.0pp gain under comparable frozen-backbone conditions (noting backbone differences controlled for in our analysis). On HLE, entity graph retrieval reaches 48.0\% from 25.2\% (+22.8pp). Ablations show component value is task-dependent: rich judge feedback is negligible for code generation but critical for structured queries (+10.3pp), while binary-signal iteration partially compensates for weaker feedback.