MemAware benchmark shows that RAG-based agent memory fails on implicit context — search scores 2.8% vs 0.8% with no memory

Reddit r/LocalLLaMA / 3/27/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • MemAware introduces a benchmark focused on whether RAG-based agent memory can retrieve relevant past context when the user’s question contains no matching keywords (implicit context).
  • Across 900 questions at three difficulty levels, retrieval accuracy using local BM25 + vector search collapses on the hard tier, reaching 0.7%—essentially identical to using no memory.
  • The benchmark shows that search-based memory fails when cross-domain reasoning is required (e.g., recalling that a user shops at Target when asked about using loyalty discounts for a car-related need).
  • The MemAware dataset and evaluation harness are open source under MIT, enabling teams to plug in their own memory systems and measure performance on implicit-context recall.

Built a benchmark that tests something none of the existing memory benchmarks test: can an AI agent surface relevant past context when the user doesn't ask about it?

Most agent memory systems work like this: user asks something → agent searches memory → retrieves results → answers. This works great when the user asks "what was the database decision?" But what about:

  • User: "Set up the database for the new service" → agent should recall you decided on PostgreSQL last month
  • User: "My transcript was denied, no record under my name" → agent should recall you changed your name
  • User: "What time should I set my alarm for my 8:30 meeting?" → agent should recall your 45-min commute

None of these have keywords that would match in search. MemAware tests 900 of these questions at 3 difficulty levels.

Results with local BM25 + vector search:

  • Easy (keyword overlap): 6.0% accuracy
  • Medium (same domain): 3.7%
  • Hard (cross-domain): 0.7% — literally the same as no memory at all

The hard tier is essentially unsolved by search. "Ford Mustang needs air filter, where can I use my loyalty discounts?" → should recall the user shops at Target. There's no search query that connects car maintenance to grocery store loyalty programs.

The dataset + harness is open source (MIT). You can plug in your own memory system and test: https://github.com/kevin-hs-sohn/memaware

Interested in what approaches people are trying. Seems like you need some kind of pre-loaded overview of the user's full history rather than per-query retrieval.

submitted by /u/Salty-Asparagus-4751
[link] [comments]
広告