WMB-100K – open source benchmark for AI memory systems at 100K turns

Reddit r/LocalLLaMA / 3/23/2026

💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

AIメモリシステムの評価が従来は小規模（数百〜約1,000ターン）に限られがちだった点を踏まえ、WMB-100Kは100,000ターン規模でのベンチマークを提示している。
本ベンチマークには3,134問と5段階の難易度に加え、「誤った自信（false memory）」を検知するプローブも含まれ、間違いの深刻度に焦点を当てている。
データセットはオープンに提供され、実行コストは約$0.07とされ、手軽に検証・比較できる設計になっている。
さまざまなAIメモリ関連システムの性能比較を促すことが目的で、GitHubリンクがコメント欄で共有されている。
「I don't know」は許容されても、確信を持って誤情報を返す問題を評価に組み込むことで、実運用に近いテスト観点を導入している。

WMB-100K – open source benchmark for AI memory systems at 100K turns

Been thinking about how AI memory systems are only ever tested at tiny scales — LOCOMO does 600 turns, LongMemEval does around 1,000. But real usage doesn't look like that.

WMB-100K tests 100,000 turns, with 3,134 questions across 5 difficulty levels. Also includes false memory probes — because "I don't know" is fine, but confidently giving wrong info is a real problem.

Dataset's included, costs about $0.07 to run.

Curious to see how different systems perform. GitHub link in the comments.

submitted by /u/Efficient_Joke3384
[link] [comments]

Regulating Prompt Markets: Securities Law, Intellectual Property, and the Trading of Prompt Assets

Dev.to

Mercor competitor Deccan AI raises $25M, sources experts from India

Dev.to

How We Got Local MCP Servers Working in Claude Cowork (The Missing Guide)

Dev.to

How Should Students Document AI Usage in Academic Work?

Dev.to

I built a PWA fitness tracker with AI that supports 86 sports — as a solo developer

Dev.to

WMB-100K – open source benchmark for AI memory systems at 100K turns

Key Points

Related Articles

Regulating Prompt Markets: Securities Law, Intellectual Property, and the Trading of Prompt Assets

Mercor competitor Deccan AI raises $25M, sources experts from India

How We Got Local MCP Servers Working in Claude Cowork (The Missing Guide)

How Should Students Document AI Usage in Academic Work?

I built a PWA fitness tracker with AI that supports 86 sports — as a solo developer

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer