A phenotype-driven and evidence-governed framework for knowledge graph enrichment and hypotheses discovery in population data

arXiv cs.AI / 4/21/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that today’s knowledge graph (KG) building methods are largely confirmatory and proposes a shift toward phenotype-driven, hypothesis-focused discovery.
  • It presents a unified pipeline that uses GNNs for phenotype discovery and combines causal inference, probabilistic reasoning, and LLM-based hypothesis generation and claim extraction.
  • KG expansion is framed as a multi-objective optimization problem that scores candidate claims by relevance, structural validation, and novelty, using Pareto-optimal selection to avoid redundant or trivial facts.
  • Experiments on heterogeneous population datasets show improved interpretability of phenotypes, discovery of context-dependent causal relationships, and high-quality, evidence-aligned claims.
  • In retrieval-augmented setups, the approach boosts performance (Recall@5=0.98) and lowers hallucinations (0.05) compared with rule-based and LLM-only baselines.

Abstract

Current knowledge graph (KG) construction methods are confirmatory, focusing on recovering known relationships rather than identifying novel or context-dependent nodes. This paper proposes a phenotype-driven and evidence-governed framework that shifts the paradigm toward structured hypothesis discovery and controlled KG expansion. The approach integrates graph neural networks (GNNs) for phenotype discovery, causal inference, probabilistic reasoning and large language models (LLMs) for hypothesis generation and claim extraction within a unified pipeline. The framework prioritizes relationships that are both structurally supported by data and underexplored in the literature. KG expansion is formulated as a multi-objective optimization problem, where candidate claims are jointly evaluated in terms of relevance, structural validation and novelty. Pareto-optimal selection enables the identification of non-dominated claims that balance confirmation and discovery, avoiding trivial or redundant knowledge inclusion. Experiments on heterogeneous population datasets demonstrate that the proposed framework produces more interpretable phenotypes, reveals context-dependent causal structures and generates high-quality claims that align with both data and scientific evidence. Compared to rule-based and LLM-only baselines, the method achieves the best trade-off across plausibility, novelty, validation and relevance. In retrieval-augmented settings, it significantly improves performance (Recall@5=0.98) while reducing hallucination rates (0.05), highlighting its effectiveness in grounding LLM outputs.