BioMiner: A Multi-modal System for Automated Mining of Protein-Ligand Bioactivity Data from Literature

arXiv cs.AI / 4/25/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

The paper introduces BioMiner, a multi-modal framework that automates extraction of protein–ligand bioactivity data from scientific literature by explicitly separating “bioactivity semantics” from “ligand structure” reconstruction.
BioMiner infers bioactivity meaning via direct reasoning, while ligand structures are resolved through chemically grounded visual semantic reasoning using multi-modal LLMs, with exact molecular construction handled by chemistry domain tools.
It also presents BioVista, a benchmark dataset containing 16,457 curated bioactivity entries from 500 publications, enabling rigorous evaluation and development.
BioMiner reports an F1 score of 0.32 for bioactivity triplets and demonstrates practical impact through three use cases: building a pre-training database (3.9% downstream improvement), improving human-in-the-loop NLRP3 data quality (38.6% vs. QSAR baselines, plus 16 novel-scaffold hit candidates), and accelerating protein–ligand bioactivity annotation (5.59× faster with 5.75% accuracy gains).
Overall, the work addresses a key bottleneck in automated bioactivity extraction by combining semantic understanding across text/tables/figures with chemistry-grounded structure reconstruction.

Abstract

Protein-ligand bioactivity data published in the literature are essential for drug discovery, yet manual curation struggles to keep pace with rapidly growing literature. Automated bioactivity extraction remains challenging because it requires not only interpreting biochemical semantics distributed across text, tables, and figures, but also reconstructing chemically exact ligand structures (e.g., Markush structures). To address this bottleneck, we introduce BioMiner, a multi-modal extraction framework that explicitly separates bioactivity semantic interpretation from ligand structure construction. Within BioMiner, bioactivity semantics are inferred through direct reasoning, while chemical structures are resolved via a chemical-structure-grounded visual semantic reasoning paradigm, in which multi-modal large language models operate on chemically grounded visual representations to infer inter-structure relationships, and exact molecular construction is delegated to domain chemistry tools. For rigorous evaluation and method development, we further establish BioVista, a comprehensive benchmark comprising 16,457 bioactivity entries curated from 500 publications. BioMiner validates its extraction ability and provides a quantitative baseline, achieving an F1 score of 0.32 for bioactivity triplets. BioMiner's practical utility is demonstrated via three applications: (1) extracting 82,262 data from 11,683 papers to build a pre-training database that improves downstream models performance by 3.9%; (2) enabling a human-in-the-loop workflow that doubles the number of high-quality NLRP3 bioactivity data, helping 38.6% improvement over 28 QSAR models and identification of 16 hit candidates with novel scaffolds; and (3) accelerating protein-ligand complex bioactivity annotation, achieving a 5.59-fold speed increase and 5.75% accuracy improvement over manual workflows in PoseBusters dataset.