Sanity check on Milla Jovovich's MemPalace: Mixed metrics, bypassed judges, and that 96.6% LongMemEval score

Reddit r/LocalLLaMA / 4/11/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The author questions the validity of MemPalace’s headline “96.6% LongMemEval” figure, arguing it is a retrieval Recall@5 metric rather than the end-to-end QA accuracy used by other systems.
  • MemPalace’s benchmark code is described as bypassing LongMemEval’s GPT-4o judge and skipping answer-generation entirely, with scoring based only on recall (any/all) at rank k and NDCG.
  • The comparison table in the README is criticized for mixing fundamentally different evaluation types—MemPalace’s retrieval-only Recall@5 versus competitors’ LLM-judged QA accuracy.
  • Prior reported competitor scores are said to vary by which LLM is used for answer generation/judging, supporting the claim that those numbers reflect end-to-end QA evaluation.
  • A “sanity check” is requested from others familiar with LongMemEval, and the author notes a MemPalace retraction (April 7) on other issues, though the metric mismatch is allegedly not addressed there.

Disclosure up front: I work on a different open-source memory system (bitterbot-desktop, ~125 stars vs MemPalace's ~40k so calibrate accordingly). We're trying to solve the same problem from different angles, and I went and read MemPalace's benchmark code specifically because their headline number is so much higher than the rest of the field, and I wanted to understand the gap.

What I found left me genuinely uncertain about how to read it, and I'd like a sanity check from people who know LongMemEval better than I do.

Here's where I get stuck:

  1. The comparison table is mixing two different metrics

    The README claims: MemPal raw 96.6% > Mastra 94.87% > Hindsight 91.4%.

    If you open benchmarks/longmemeval_bench.py, MemPalace explicitly reimplements its own metrics to avoid the LongMemEval dependency. It skips the answer-generation step and never calls the GPT-4o judge. Here's the entire

    scoring function:

    def evaluate_retrieval(rankings, correct_ids, corpus_ids, k):

"""Evaluate retrieval at rank k."""

top_k_ids = set(corpus_ids[idx] for idx in rankings[:k])

recall_any = float(any(cid in top_k_ids for cid in correct_ids))

recall_all = float(all(cid in top_k_ids for cid in correct_ids))

ndcg_score = ndcg(rankings, correct_ids, corpus_ids, k)

return recall_any, recall_all, ndcg_score

That's it. No answer generation, no LLM judge, no QA scoring. recall_any@5 is the headline number.

So:

- MemPalace's 96.6% is Recall@5: "Did the gold-evidence session appear in the top 5 retrieved sessions?"

- Mastra's 94.87% and Hindsight's 91.4% are end-to-end QA accuracy: "Did the model produce the right answer to the question, judged by an LLM?"

We know the competitors are reporting QA accuracy because their own research blogs cite scores that vary by which LLM they use as the answer model. Mastra reports 84.23% with GPT-4o and 94.87% with GPT-5-mini (https://mastra.ai/research/observational-memory). Hindsight reports 91.4% with Gemini-3 Pro, 89.0% with OSS-120B, and 83.6% with OSS-20B. That variance only happens if you're actually generating answers and judging them, it's not a thing for pure retrieval scores.

Putting Recall@5 next to end-to-end QA accuracy in a comparison table without an asterisk is a structural mismatch, and the README doesn't flag it. Worth noting: MemPalace published a dated retraction note on April 7 acknowledging several other issues (the AAAK token-savings example was wrong, AAAK actually regresses retrieval, the "+34% palace boost" is just metadata filtering) but the metric-mismatch in the comparison table isn't mentioned. Either nobody has raised it yet, or they don't see it as one. I'd like to know which.

  1. The deeper issue: retrieval may not be the bottleneck anymore

Mastra's research blog explicitly notes that their QA accuracy outperforms the oracle (a configuration given only the gold-evidence conversations, no retrieval needed at all). That's a meaningful claim, it implies that for top-tier systems on LongMemEval, the bottleneck is no longer retrieval. It's reading, reasoning, temporal inference, and abstention.

The structural implication: MemPalace is reporting on a part of the benchmark that's no longer the field's bottleneck, then comparing that number against systems being measured on the part that is. We don't know what MemPalace would score under the QA judge, they haven't run it, but the comparison table reads as if the numbers are commensurable when they aren't. They're measuring different halves of the problem.

Where credit is due

I went in hoping to validate MemPalace's actual core finding: that raw verbatim text + ChromaDB default embeddings beats extraction-based memory systems like Mem0, Mastra, and Supermemory at the retrieval step. MemPalace just keeps everything verbatim and lets cosine search find it. If that result holds up and the 96.6% R@5 has been independently reproduced on M2 Ultra (https://github.com/milla-jovovich/mempalace/issues/39) then the entire "use an LLM to manage memory" paradigm may be over-engineered. That's a real negative result against a lot of work in the space, including, candidly, parts of my own. It deserves more attention than the leaderboard ranking does, regardless of how the headline is framed.

The engineering is real, and public self-correction (like the AAAK retraction) is rare and good. I just want to make sure we're actually comparing apples to apples before the field updates its priors based on a mixed-metric leaderboard.

What I'm doing about it

I'm working on a retrieval-only runner so I can post a true 1:1 R@5 number against my own system. First attempt is hitting embeddings timeouts, so it'll be a few days, but I'll come back with results either way they land.

The actual question

Specifically: am I right that evaluate_retrieval in benchmarks/longmemeval_bench.py never calls an LLM and never compares hypothesized answers to gold answers? And am I right that Mastra and Hindsight are reporting QA accuracy on the same longmemeval_s split, which is a different metric?

If anyone has read the script and the linked competitor blogs and disagrees with that reading, I want to be told.

submitted by /u/DepthOk4115
[link] [comments]