HippoCamp: Benchmarking Contextual Agents on Personal Computers

arXiv cs.AI / 4/2/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

HippoCampは、個人PC上のマルチモーダルなファイル管理を対象に、文脈に基づくエージェントの能力を評価するための新しいベンチマークを提案しています。
実世界の多様なユーザープロファイルに基づいてデバイス規模のファイルシステム（42.4GB、2K超のファイル）を構築し、581件のQAペアで検索・根拠の知覚・多段推論を測定します。
さらに46.1K件のステップ単位の精密アノテーションされた軌跡を提供し、失敗箇所をきめ細かく診断できるようにしています。
評価の結果、最新の商用マルチモーダル/エージェント手法でもユーザープロファイリング精度は48.3%にとどまり、長期的なリトリーバルや高密度な個人ファイル内でのクロスモーダル推論が特に苦手だと示されています。
失敗診断では、マルチモーダル知覚とエビデンス（根拠）グラウンディングが主要なボトルネックとして特定され、次世代のパーソナルAIアシスタント開発に向けた課題を明確にします。

Abstract

We present HippoCamp, a new benchmark designed to evaluate agents' capabilities on multimodal file management. Unlike existing agent benchmarks that focus on tasks like web interaction, tool use, or software automation in generic settings, HippoCamp evaluates agents in user-centric environments to model individual user profiles and search massive personal files for context-aware reasoning. Our benchmark instantiates device-scale file systems over real-world profiles spanning diverse modalities, comprising 42.4 GB of data across over 2K real-world files. Building upon the raw files, we construct 581 QA pairs to assess agents' capabilities in search, evidence perception, and multi-step reasoning. To facilitate fine-grained analysis, we provide 46.1K densely annotated structured trajectories for step-wise failure diagnosis. We evaluate a wide range of state-of-the-art multimodal large language models (MLLMs) and agentic methods on HippoCamp. Our comprehensive experiments reveal a significant performance gap: even the most advanced commercial models achieve only 48.3% accuracy in user profiling, struggling particularly with long-horizon retrieval and cross-modal reasoning within dense personal file systems. Furthermore, our step-wise failure diagnosis identifies multimodal perception and evidence grounding as the primary bottlenecks. Ultimately, HippoCamp exposes the critical limitations of current agents in realistic, user-centric environments and provides a robust foundation for developing next-generation personal AI assistants.

Black Hat Asia

AI Business

Unitree's IPO

ChinaTalk

Did you know your GIGABYTE laptop has a built-in AI coding assistant? Meet GiMATE Coder 🤖

Dev.to

Benchmarking Batch Deep Reinforcement Learning Algorithms

Dev.to

A bug in Bun may have been the root cause of the Claude Code source code leak.

Reddit r/LocalLLaMA

HippoCamp: Benchmarking Contextual Agents on Personal Computers

Key Points

Abstract

Related Articles

Black Hat Asia

Unitree's IPO

Did you know your GIGABYTE laptop has a built-in AI coding assistant? Meet GiMATE Coder 🤖

Benchmarking Batch Deep Reinforcement Learning Algorithms

A bug in Bun may have been the root cause of the Claude Code source code leak.

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer