I've been reviewing how various AI memory systems evaluate their performance and noticed a fundamental issue with cross-system comparison.
Most systems benchmark on LOCOMO (Maharana et al., ACL 2024), but the evaluation methods vary significantly. LOCOMO's official metric (Token-Overlap F1) gives GPT-4 full context 32.1% and human performance 87.9%. However, memory system developers report scores of 60-67% using custom evaluation criteria such as retrieval accuracy or keyword matching rather than the original F1 metric.
Since each system measures something different, the resulting scores are not directly comparable — yet they are frequently presented side by side as if they are.
Has anyone else noticed this issue? How do you approach evaluating memory systems when there is no standardized scoring methodology?
[link] [comments]



