Learning from Supervision with Semantic and Episodic Memory: A Reflective Approach to Agent Adaptation

arXiv cs.CL / 5/4/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes a memory-augmented framework for LLM-based agents to learn target classification functions from labeled examples without updating model parameters.
  • It uses semantic memory to convert LLM-generated, label-grounded critiques into reusable task-level guidance, and episodic memory to store instance-level critiques tied to past experiences.
  • Experiments across multiple tasks and models show the best self-critique strategy improves accuracy by 8.1 percentage points over a zero-shot baseline and by 4.6 points over a label-only RAG baseline.
  • The authors introduce a new metric, “suggestibility,” to explain why performance gains differ significantly by model and domain, identifying when memory augmentation helps or fails.
  • They also find that pre-computing critiques reduces inference-time reasoning costs, cutting “thinking” tokens by an average of 31.95% compared with letting the model reason independently.

Abstract

We investigate how agents built on pretrained large language models (LLMs) can learn target classification functions from labeled examples without parameter updates. While conventional approaches like fine-tuning are often costly, inflexible, and opaque, we propose a memory-augmented framework that leverages LLM-generated critiques grounded in labeled data. Our framework uses episodic memory to store instance-level critiques - capturing specific past experiences - and semantic memory to distill these into reusable, task-level guidance. Across a diverse set of tasks and models, our best performing self-critique strategy (utilizing both memory types) yields an average improvement of 8.1 percentage points over the zero shot baseline, and 4.6pp over a RAG-based baseline that relies only on labels. However, improvements vary substantially across models and domains. To explain this variation, we introduce suggestibility - a novel metric capturing how receptive a model is to external reasoning provided in context. We use suggestibility to illuminate when and why memory augmentation succeeds or falls short. Beyond accuracy gains, we find pre-computed critiques substantially reduce inference-time computation for reasoning models, cutting thinking tokens by an average of 31.95% across all datasets by substituting for reasoning that the model would otherwise perform independently. Our findings highlight the conditions under which memory-driven, reflective learning can serve as a lightweight, interpretable, and efficient strategy for improving LLM adaptability.