ESL-Bench: An Event-Driven Synthetic Longitudinal Benchmark for Health Agents

arXiv cs.AI / 4/6/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces ESL-Bench, a new event-driven synthetic longitudinal benchmark designed for evaluating “health agents” that must reason over multi-source, time-extended patient trajectories.
  • ESL-Bench generates 100 synthetic users with 1–5 year timelines combining continuous device streams, sparse clinical exams, and episodic life events, while providing explicit ground-truth indicator impact parameters.
  • The framework models each health indicator using baseline stochastic processes triggered by discrete events with sigmoid onset and exponential decay, subject to physiological saturation/projection constraints.
  • A hybrid pipeline uses LLM-based planning for sparse semantic artifacts and algorithmic simulation for dense indicator dynamics, enabling programmatically computable answers for evaluation queries.
  • Experiments with 13 methods show DB-native agents outperform memory-augmented RAG (48–58% vs. 30–38%), with the biggest gains on Comparison and Explanation tasks requiring multi-hop evidence attribution.

Abstract

Longitudinal health agents must reason across multi-source trajectories that combine continuous device streams, sparse clinical exams, and episodic life events - yet evaluating them is hard: real-world data cannot be released at scale, and temporally grounded attribution questions seldom admit definitive answers without structured ground truth. We present ESL-Bench, an event-driven synthesis framework and benchmark providing 100 synthetic users, each with a 1-5 year trajectory comprising a health profile, a multi-phase narrative plan, daily device measurements, periodic exam records, and an event log with explicit per-indicator impact parameters. Each indicator follows a baseline stochastic process driven by discrete events with sigmoid-onset, exponential-decay kernels under saturation and projection constraints; a hybrid pipeline delegates sparse semantic artifacts to LLM-based planning and dense indicator dynamics to algorithmic simulation with hard physiological bounds. Users are each paired with 100 evaluation queries across five dimensions - Lookup, Trend, Comparison, Anomaly, Explanation - stratified into Easy, Medium, and Hard tiers, with all ground-truth answers programmatically computable from the recorded event-indicator relationships. Evaluating 13 methods spanning LLMs with tools, DB-native agents, and memory-augmented RAG, we find that DB agents (48-58%) substantially outperform memory RAG baselines (30-38%), with the gap concentrated on Comparison and Explanation queries where multi-hop reasoning and evidence attribution are required.