Injecting Structured Biomedical Knowledge into Language Models: Continual Pretraining vs. GraphRAG

arXiv cs.CL / 4/21/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • The paper compares two ways to inject structured biomedical knowledge from the UMLS Metathesaurus into language models: continual pretraining (embedding knowledge into model parameters) and GraphRAG (querying a knowledge graph at inference time).
  • It builds a large UMLS-derived biomedical knowledge graph (3.4M concepts, 34.2M relations) in Neo4j, and generates an ~100M-token text corpus to continually pretrain BERT-based models (BERTUMLS, BioBERTUMLS).
  • Across six BLURB benchmarks, BERTUMLS outperforms the base BERT, especially on knowledge-intensive QA tasks, while benefits for BioBERTUMLS are more mixed due to potential diminishing returns when the base model already has biomedical knowledge.
  • On QA evaluations (PubMedQA and BioASQ), GraphRAG applied to LLaMA 3-8B improves accuracy by over 3 points on PubMedQA and 5 points on BioASQ without retraining, providing transparent, multi-hop, and easily updatable knowledge access.
  • The authors release the processed UMLS Neo4j graph to enable reproducible research.

Abstract

The injection of domain-specific knowledge is crucial for adapting language models (LMs) to specialized fields such as biomedicine. While most current approaches rely on unstructured text corpora, this study explores two complementary strategies for leveraging structured knowledge from the UMLS Metathesaurus: (i) Continual pretraining that embeds knowledge into model parameters, and (ii) Graph Retrieval-Augmented Generation (GraphRAG) that consults a knowledge graph at inference time. We first construct a large-scale biomedical knowledge graph from UMLS (3.4 million concepts and 34.2 million relations), stored in Neo4j for efficient querying. We then derive a ~100-million-token textual corpus from this graph to continually pretrain two models: BERTUMLS (from BERT) and BioBERTUMLS (from BioBERT). We evaluate these models on six BLURB (Biomedical Language Understanding and Reasoning Benchmark) datasets spanning five task types and evaluate GraphRAG on the two QA (Question Answering) datasets (PubMedQA, BioASQ). On BLURB tasks, BERTUMLS improves over BERT, with the largest gains on knowledge-intensive QA. Effects on BioBERT are more nuanced, suggesting diminishing returns when the base model already encodes substantial biomedical text knowledge. Finally, augmenting LLaMA 3-8B with our GraphRAG pipeline yields over than 3 points accuracy on PubMedQA and 5 points on BioASQ without any retraining, delivering transparent, multi-hop, and easily updated knowledge access. We release the processed UMLS Neo4j graph to support reproducibility.