Built a benchmark that tests something none of the existing memory benchmarks test: can an AI agent surface relevant past context when the user doesn't ask about it?
Most agent memory systems work like this: user asks something → agent searches memory → retrieves results → answers. This works great when the user asks "what was the database decision?" But what about:
- User: "Set up the database for the new service" → agent should recall you decided on PostgreSQL last month
- User: "My transcript was denied, no record under my name" → agent should recall you changed your name
- User: "What time should I set my alarm for my 8:30 meeting?" → agent should recall your 45-min commute
None of these have keywords that would match in search. MemAware tests 900 of these questions at 3 difficulty levels.
Results with local BM25 + vector search:
- Easy (keyword overlap): 6.0% accuracy
- Medium (same domain): 3.7%
- Hard (cross-domain): 0.7% — literally the same as no memory at all
The hard tier is essentially unsolved by search. "Ford Mustang needs air filter, where can I use my loyalty discounts?" → should recall the user shops at Target. There's no search query that connects car maintenance to grocery store loyalty programs.
The dataset + harness is open source (MIT). You can plug in your own memory system and test: https://github.com/kevin-hs-sohn/memaware
Interested in what approaches people are trying. Seems like you need some kind of pre-loaded overview of the user's full history rather than per-query retrieval.
[link] [comments]




