Automated Standardization of Legacy Biomedical Metadata Using an Ontology-Constrained LLM Agent

arXiv cs.AI / 4/13/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper addresses a common problem in biomedical research datasets: legacy metadata are often incomplete or noncompliant with community standards, reducing findability, interoperability, and reuse.
It proposes an ontology-constrained LLM system for metadata standardization that improves on prior prompt-only approaches by treating constraints as actionable rather than static text.
The system queries authoritative biomedical terminology services in real time to fetch canonically correct vocabulary terms, rather than relying solely on the LLM’s training knowledge.
Evaluated on 839 legacy HuBMAP records against an expert-curated gold standard, the approach shows consistent accuracy gains from adding real-time tool access over using the LLM alone.
The results suggest a practical and scalable path toward producing FAIR datasets by combining LLMs, ontology constraints, and live terminology tooling.

Abstract

Scientific metadata are often incomplete and noncompliant with community standards, limiting dataset findability, interoperability, and reuse. When reporting guidelines exist, they typically lack machine-actionable representations. Producing FAIR datasets requires encoding metadata standards as machine-actionable templates with rich field specifications and precise value constraints. Recent work has shown that LLMs guided by field names and ontology constraints can improve metadata standardization, but these approaches treat constraints as static text prompts, relying on the model's training knowledge alone. We present an LLM-based metadata standardization system that queries authoritative biomedical terminology services in real time to retrieve canonically correct vocabulary terms on demand. We evaluate this approach on 839 legacy metadata records from the Human BioMolecular Atlas Program (HuBMAP) using an expert-curated gold standard for exact-match assessment. Our evaluation shows that augmenting the LLM with real-time tool access consistently improves prediction accuracy over the LLM alone across both ontology-constrained and non-ontology-constrained fields, demonstrating a practical, scalable approach to automated standardization of biomedical metadata.