Spark-LLM-Eval: A Distributed Framework for Statistically Rigorous Large Language Model Evaluation

arXiv cs.CL / 4/1/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

Spark-LLM-Eval introduces a distributed, Spark-native framework that evaluates large language models as a data-parallel workload to handle evaluation datasets of hundreds of thousands to millions of samples.
The framework is designed for statistical rigor by attaching bootstrap confidence intervals to metrics and using appropriate significance tests (e.g., paired t-tests, McNemar’s test, or Wilcoxon signed-rank) for model comparisons.
It improves evaluation iteration speed and reduces inference cost via content-addressable response caching stored in Delta Lake, enabling metric-definition changes without re-running model calls.
The paper describes the system architecture and methodology and reports benchmarks indicating linear scaling with cluster size.
The evaluation framework and associated code are released as open source for wider adoption and reproducible large-scale LLM benchmarking.

Abstract

Evaluating large language models at scale remains a practical bottleneck for many organizations. While existing evaluation frameworks work well for thousands of examples, they struggle when datasets grow to hundreds of thousands or millions of samples. This scale is common when assessing model behavior across diverse domains or conducting comprehensive regression testing. We present Spark-LLM-Eval, a distributed evaluation framework built natively on Apache Spark. The system treats evaluation as a data-parallel problem, partitioningexamplesacrossexecutorsandaggregatingresultswithproperstatistical accounting. Beyond raw throughput, we emphasize statistical rigor: every reported metric includes bootstrap confidence intervals, and model comparisons come with appropriate significance tests (paired t-tests, McNemar's test, or Wilcoxon signed-rank, depending on the metric type). The framework also addresses the cost problem inherent in LLM evaluation through content-addressable response caching backed by Delta Lake, which allows iterating on metric definitions without re-running inference. We describe the system architecture, the statistical methodology, and report benchmark results showing linear scaling with cluster size. The framework and all evaluation code are available as open source.