PolyReal: A Benchmark for Real-World Polymer Science Workflows

arXiv cs.CV / 4/6/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces PolyReal, a new multimodal benchmark designed to test multimodal large language models (MLLMs) on real-world polymer science workflows rather than only abstract knowledge questions.
  • PolyReal evaluates five practice-grounded capabilities across the polymer experimentation lifecycle, including foundational knowledge use, lab safety analysis, experiment mechanism reasoning, raw data extraction, and performance/application exploration.
  • Results on leading MLLMs show a capability imbalance: models do well on knowledge-intensive tasks (e.g., experiment mechanism reasoning) but decline sharply on practice-based tasks such as lab safety analysis and extracting information from raw data.
  • The findings suggest a significant gap between an MLLM’s ability to reason about science and its ability to apply that knowledge in context-dependent, operational laboratory settings.
  • PolyReal is positioned as a more practical evaluation tool for assessing AI systems intended for real scientific experimentation workflows.

Abstract

Multimodal Large Language Models (MLLMs) excel in general domains but struggle with complex, real-world science. We posit that polymer science, an interdisciplinary field spanning chemistry, physics, biology, and engineering, is an ideal high-stakes testbed due to its diverse multimodal data. Yet, existing benchmarks related to polymer science largely overlook real-world workflows, limiting their practical utility and failing to systematically evaluate MLLMs across the full, practice-grounded lifecycle of experimentation. We introduce PolyReal, a novel multimodal benchmark grounded in real-world scientific practices to evaluate MLLMs on the full lifecycle of polymer experimentation. It covers five critical capabilities: (1) foundational knowledge application; (2) lab safety analysis; (3) experiment mechanism reasoning; (4) raw data extraction; and (5) performance & application exploration. Our evaluation of leading MLLMs on PolyReal reveals a capability imbalance. While models perform well on knowledge-intensive reasoning (e.g., Experiment Mechanism Reasoning), they drop sharply on practice-based tasks (e.g., Lab Safety Analysis and Raw Data Extraction). This exposes a severe gap between abstract scientific knowledge and its practical, context-dependent application, showing that these real-world tasks remain challenging for MLLMs. Thus, PolyReal helps address this evaluation gap and provides a practical benchmark for assessing AI systems in real-world scientific workflows.