Statistics, Not Scale: Modular Medical Dialogue with Bayesian Belief Engine

arXiv cs.LG / 4/23/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that deploying LLMs as autonomous diagnostic agents conflates natural-language communication with probabilistic reasoning, and treats this as an architectural flaw rather than just an engineering limitation.
  • It introduces BMBE (Bayesian Medical Belief Engine), a modular framework that uses an LLM only to parse patient utterances into structured evidence and generate questions, while all diagnostic inference is handled by a deterministic, auditable Bayesian backend.
  • By keeping patient data out of the LLM and isolating the statistical engine as a swappable module, the system is designed to be privacy-preserving by construction and adaptable to different target populations without retraining.
  • The authors claim three capabilities that ordinary autonomous LLMs supposedly cannot provide: calibrated selective diagnosis via an adjustable accuracy–coverage tradeoff, a separation-of-components performance gap where a cheap sensor plus the Bayesian engine beats a frontier standalone model at lower cost, and improved robustness to adversarial or misleading communication styles.
  • Experiments on empirical and LLM-generated knowledge bases reportedly show that the benefits come from the architecture (not extra information), outperforming frontier LLM baselines.

Abstract

Large language models are increasingly deployed as autonomous diagnostic agents, yet they conflate two fundamentally different capabilities: natural-language communication and probabilistic reasoning. We argue that this conflation is an architectural flaw, not an engineering shortcoming. We introduce BMBE (Bayesian Medical Belief Engine), a modular diagnostic dialogue framework that enforces a strict separation between language and reasoning: an LLM serves only as a sensor, parsing patient utterances into structured evidence and verbalising questions, while all diagnostic inference resides in a deterministic, auditable Bayesian engine. Because patient data never enters the LLM, the architecture is private by construction; because the statistical backend is a standalone module, it can be replaced per target population without retraining. This separation yields three properties no autonomous LLM can offer: calibrated selective diagnosis with a continuously adjustable accuracy-coverage tradeoff, a statistical separation gap where even a cheap sensor paired with the engine outperforms a frontier standalone model from the same family at a fraction of the cost, and robustness to adversarial patient communication styles that cause standalone doctors to collapse. We validate across empirical and LLM-generated knowledge bases against frontier LLMs, confirming the advantage is architectural, not informational.