GISTBench: Evaluating LLM User Understanding via Evidence-Based Interest Verification

arXiv cs.AI / 4/1/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

GISTBench is introduced as a benchmark to measure how well LLMs can infer and verify user interests from interaction histories in recommendation systems, moving beyond pure item-prediction metrics.
The paper proposes two metric families—Interest Groundedness (precision/recall to penalize hallucinated categories and reward coverage) and Interest Specificity (to evaluate how distinct the verified user profiles are).
A synthetic dataset is released, built from real engagement traces from a global short-form video platform and including both implicit/explicit signals plus textual descriptions.
The authors validate dataset fidelity via user surveys and test eight open-weight LLMs (7B–120B), finding notable bottlenecks in accurately counting and attributing engagement signals across diverse interaction types.
Overall results suggest current LLMs still struggle with evidence-based verification of user interests, especially when engagement signals vary in type and structure.

Abstract

We introduce GISTBench, a benchmark for evaluating Large Language Models' (LLMs) ability to understand users from their interaction histories in recommendation systems. Unlike traditional RecSys benchmarks that focus on item prediction accuracy, our benchmark evaluates how well LLMs can extract and verify user interests from engagement data. We propose two novel metric families: Interest Groundedness (IG), decomposed into precision and recall components to separately penalize hallucinated interest categories and reward coverage, and Interest Specificity (IS), which assesses the distinctiveness of verified LLM-predicted user profiles. We release a synthetic dataset constructed on real user interactions on a global short-form video platform. Our dataset contains both implicit and explicit engagement signals and rich textual descriptions. We validate our dataset fidelity against user surveys, and evaluate eight open-weight LLMs spanning 7B to 120B parameters. Our findings reveal performance bottlenecks in current LLMs, particularly their limited ability to accurately count and attribute engagement signals across heterogeneous interaction types.

Black Hat Asia

AI Business

Knowledge Governance For The Agentic Economy.

Dev.to

AI server farms heat up the neighborhood for miles around, paper finds

The Register

Paperclip: Công Cụ Miễn Phí Biến AI Thành Đội Phát Triển Phần Mềm

Dev.to

Does the Claude “leak” actually change anything in practice?

Reddit r/LocalLLaMA

GISTBench: Evaluating LLM User Understanding via Evidence-Based Interest Verification

Key Points

Abstract

Related Articles

Black Hat Asia

Knowledge Governance For The Agentic Economy.

AI server farms heat up the neighborhood for miles around, paper finds

Paperclip: Công Cụ Miễn Phí Biến AI Thành Đội Phát Triển Phần Mềm

Does the Claude “leak” actually change anything in practice?

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer