OptimusKG: Unifying biomedical knowledge in a modern multimodal graph

arXiv cs.AI / 5/1/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • The paper introduces OptimusKG, a multimodal biomedical labeled property graph designed to unify knowledge from structured and semi-structured sources while preserving schema-level constraints and type-specific metadata.
  • OptimusKG is built as an LPG with a top-level schema for nodes and edges and retains granular properties, cross-references, and provenance across molecular, anatomical, clinical, and environmental domains.
  • The released graph is large-scale, containing 190,531 nodes (10 entity types) and 21,813,816 edges (26 relation types) with over 67 million property instances spanning 150 property keys sourced from 18 ontologies and controlled vocabularies.
  • To validate the graph, the authors used a multimodal literature-checking agent (PaperQA3) and found that 70.0% of sampled edges had supporting evidence, while 83.4% of sampled false edges lacked such evidence.
  • The dataset is distributed as Apache Parquet files to support graph-based machine learning and knowledge-grounded retrieval with large language models, including biomedical discovery tasks like hypothesis generation.

Abstract

Biomedical knowledge graphs (KGs) are widely used in the life sciences, yet many are derived from unstructured documents and therefore lack schema-level constrains, whereas graphs assembled from structured resources are difficult to harmonize into a unified representation. We present OptimusKG, a multimodal biomedical labeled property graph (LPG) built from structured and semi-structured resources to preserve factual, type-specific metadata across molecular, anatomical, clinical, and environmental domains. OptimusKG contains 190,531 nodes across 10 entity types, 21,813,816 edges across 26 relation types, and 67,249,863 property instances encoding 110,276,843 values across 150 distinct property keys, derived from 18 ontologies and controlled vocabularies. The graph enforces a top-level schema for nodes and edges and retains granular, type-specific properties, cross-references, and provenance across molecular, anatomical, clinical, and environmental domains. We assessed the validity of OptimusKG by evaluating whether graph relationships are supported by evidence from the scientific literature using a multimodal agent, PaperQA3. PaperQA3 identified supporting evidence for 70.0% of sampled edges, whereas 83.4% of sampled false edges received no supporting evidence. Edges without literature support were concentrated in associations derived from experimental and functional genomics resources, suggesting that OptimusKG captures biomedical knowledge that may precede synthesis in the scientific literature. OptimusKG is distributed as Apache Parquet files, providing a standardized resource for graph-based machine learning, knowledge-grounded retrieval with large language models, and biomedical discovery use cases such as hypothesis generation.