M$^\star$: Every Task Deserves Its Own Memory Harness

arXiv cs.AI / 4/15/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes M$^\star$, an approach for LLM agents that automatically finds task-specific memory “harnesses” rather than using a fixed, one-size-fits-all memory architecture.
  • M$^\star$ represents an agent’s memory system as an executable Python memory program that bundles a data schema, storage logic, and workflow instructions, and then optimizes these components jointly.
  • It uses reflective code evolution with population-based search and feedback from evaluation failures to iteratively refine candidate memory programs.
  • Experiments across four benchmarks covering conversation, embodied planning, and expert reasoning show consistent performance gains over fixed-memory baselines.
  • The evolved memory programs develop structurally distinct processing mechanisms per domain, suggesting task specialization opens a broader design space than general-purpose memory paradigms.

Abstract

Large language model agents rely on specialized memory systems to accumulate and reuse knowledge during extended interactions. Recent architectures typically adopt a fixed memory design tailored to specific domains, such as semantic retrieval for conversations or skills reused for coding. However, a memory system optimized for one purpose frequently fails to transfer to others. To address this limitation, we introduce M^\star, a method that automatically discovers task-optimized memory harnesses through executable program evolution. Specifically, M^\star models an agent memory system as a memory program written in Python. This program encapsulates the data Schema, the storage Logic, and the agent workflow Instructions. We optimize these components jointly using a reflective code evolution method; this approach employs a population-based search strategy and analyzes evaluation failures to iteratively refine the candidate programs. We evaluate M^\star on four distinct benchmarks spanning conversation, embodied planning, and expert reasoning. Our results demonstrate that M^\star improves performance over existing fixed-memory baselines robustly across all evaluated tasks. Furthermore, the evolved memory programs exhibit structurally distinct processing mechanisms for each domain. This finding indicates that specializing the memory mechanism for a given task explores a broad design space and provides a superior solution compared to general-purpose memory paradigms.