Retrieve, Then Classify: Corpus-Grounded Automation of Clinical Value Set Authoring

arXiv cs.CL / 4/17/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • The paper introduces Retrieval-Augmented Set Completion (RASC) for clinical value set authoring by first retrieving similar existing value sets from a curated corpus and then classifying candidate codes rather than generating codes directly with an LLM.
  • RASC is designed to reduce the effective output space by shrinking from the full standardized vocabulary to a much smaller retrieved candidate pool, improving statistical efficiency.
  • Experiments on 11,803 publicly available VSAC value sets (the first large-scale benchmark for this task) show a SAPBert cross-encoder achieving AUROC~0.852 and value-set-level F1~0.298, outperforming a simpler MLP and retrieval-only baselines.
  • Compared with zero-shot GPT-4o (value-set-level F1~0.105 and many returned codes missing from VSAC), RASC substantially reduces irrelevant candidates per true positive and the advantage grows with larger value set sizes.
  • The authors report consistent gains across multiple classifier types (SAPBert-based cross-encoder and LightGBM) and provide code and benchmark dataset creation scripts on GitHub.

Abstract

Clinical value set authoring -- the task of identifying all codes in a standardized vocabulary that define a clinical concept -- is a recurring bottleneck in clinical quality measurement and phenotyping. A natural approach is to prompt a large language model (LLM) to generate the required codes directly, but structured clinical vocabularies are large, version-controlled, and not reliably memorized during pretraining. We propose Retrieval-Augmented Set Completion (RASC): retrieve the K most similar existing value sets from a curated corpus to form a candidate pool, then apply a classifier to each candidate code. Theoretically, retrieve-and-select can reduce statistical complexity by shrinking the effective output space from the full vocabulary to a much smaller retrieved candidate pool. We demonstrate the utility of RASC on 11,803 publicly available VSAC value sets, constructing the first large-scale benchmark for this task. A cross-encoder fine-tuned on SAPBert achieves AUROC~0.852 and value-set-level F1~0.298, outperforming a simpler three-layer Multilayer Perceptron (AUROC~0.799, F1~0.250) and both reduce the number of irrelevant candidates per true positive from 12.3 (retrieval-only) to approximately 3.2 and 4.4 respectively. Zero-shot GPT-4o achieves value-set-level F1~0.105, with 48.6\% of returned codes absent from VSAC entirely. This performance gap widens with increasing value set size, consistent with RASC's theoretical advantage. We observe similar performance gains across two other classifier model types, namely a cross-encoder initialized from pre-trained SAPBert and a LightGBM model, demonstrating that RASC's benefits extend beyond a single model class. The code to download and create the benchmark dataset, as well as the model training code is available at: \href{https://github.com/mukhes3/RASC}{https://github.com/mukhes3/RASC}.