If Only My CGM Could Speak: A Privacy-Preserving Agent for Question Answering over Continuous Glucose Data

arXiv cs.AI / 4/21/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces CGM-Agent, a privacy-preserving framework that enables question answering over continuous glucose monitor (CGM) data rather than relying on static summaries.
  • In the proposed architecture, an LLM acts only as a reasoning component to choose analytical functions, while all computation runs locally so that users’ health data never leaves their device.
  • The authors create a benchmark of 4,180 questions using parameterized templates plus real user queries, with answers validated via deterministic program execution.
  • Experiments with six leading LLMs show strong performance (94% value accuracy on synthetic queries and 88% on ambiguous real-world queries), with most errors caused by intent and temporal ambiguity.
  • The results indicate that lightweight models can perform competitively within the agent design, and the team releases code and the benchmark to advance trustworthy health agents.

Abstract

Continuous glucose monitors (CGMs) used in diabetes care collect rich personal health data that could improve day-to-day self-management. However, current patient platforms only offer static summaries which do not support inquisitive user queries. Large language models (LLMs) could enable free-form inquiries about continuous glucose data, but deploying them over sensitive health records raises privacy and accuracy concerns. In this paper, we present CGM-Agent, a privacy-preserving framework for question answering over personal glucose data. In our design, the LLM serves purely as a reasoning engine that selects analytical functions. All computation occurs locally, and personal health data never leaves the user's device. For evaluation, we construct a benchmark of 4,180 questions combining parameterized question templates with real user queries and ground truth derived from deterministic program execution. Evaluating 6 leading LLMs, we find that top models achieve 94\% value accuracy on synthetic queries and 88\% on ambiguous real-world queries. Errors stem primarily from intent and temporal ambiguity rather than computational failures. Additionally, lightweight models achieve competitive performance in our agent design, suggesting opportunities for low-cost deployment. We release our code and benchmark to support future work on trustworthy health agents.