BatteryPass-12K: The First Dataset for the Novel Digital Battery Passport Conformance Task

arXiv cs.CL / 5/1/2026

📰 NewsSignals & Early TrendsModels & Research

Key Points

  • The paper introduces a new “digital battery passport (DBP) conformance” classification task and provides the first public benchmark dataset, BatteryPass-12K, built synthetically from real pilot samples.
  • With EU DBP regulations coming into effect soon and no prior public datasets available, the authors release BatteryPass-12K under a permissive CC-BY-4.0 license to enable evaluation and research.
  • The study evaluates 22 language models using zero-shot inference, comparing small LMs, mixture-of-experts (MoE) models, and dense LLMs, and reports that thinking/chain-of-thought style models perform best (e.g., GPT-5.4 with top validation and test F1 results).
  • Additional experiments show that few-shot prompting improves accuracy, frontier models still struggle with the task, scaling parameters alone does not guarantee better performance, and prompt-injection attacks significantly degrade results.
  • Although BatteryPass-12K focuses on pilot samples, the authors suggest it could be leveraged for other battery-domain tasks such as lifecycle reasoning.

Abstract

We introduce a novel task of digital battery passport (DBP) conformance classification and introduce the first public benchmark for the task: BatteryPass-12K, created synthetically from real pilot samples. This is as the EU's battery regulation on DBPs comes into effect soon and there exists no public dataset. We evaluated 22 language models (LMs) in zero-shot inference, spanning small LMs (SLMs), mixture of experts (MoEs), and dense LLMs. We also conducted analysis, additional evaluations of few-shot inference and prompt-injection attacks to find that (1) Thinking models have the best performance (with GPT-5.4 scoring 0.98 (0.03) and 0.71 (0.22) on average as F1 (and confidence interval at 95%) on the validation and test sets, respectively), (2) few-shot examples improve performance significantly, (3) generally capable frontier models find the task challenging, (4) merely scaling model parameters does not necessarily lead to improved performance, as SLMs outperformed some LLMs, and (5) prompt-injection attacks degrade performance. We note that BatteryPass-12K, though limited to real pilot samples, may be useful for other known or emerging tasks in the battery domain, e.g. lifecycle reasoning. We publicly release the dataset under a permissive licence (CC-BY-4.0).