Token Arena: A Continuous Benchmark Unifying Energy and Cognition in AI Inference

arXiv cs.AI / 5/4/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

Key Points

  • TokenArena introduces a continuous AI inference benchmark that evaluates systems at the deployment-relevant endpoint level (provider, model, and SKU/serving configuration), rather than only at model or provider level.
  • It measures performance along five axes—output speed, time to first token, workload-blended price, effective context, and endpoint quality—and combines these with modeled energy to produce composites like joules/dollars per correct answer and endpoint fidelity.
  • Results across 78 endpoints spanning 12 model families show that the same model can vary significantly by endpoint: up to 12.5 points in mean math/code accuracy, up to 12 points in distribution “fingerprint” similarity, up to 10x in tail latency, and 6.2x in modeled joules per correct answer.
  • The benchmark’s workload-aware blended pricing substantially reshapes leaderboards, with many endpoints dropping out when switching between chat, retrieval-augmented, and reasoning presets.
  • The team releases the framework, schema, probe/eval harness, and a v1.0 leaderboard snapshot under CC BY 4.0, positioning TokenArena as a replicable methodology rather than a single fixed ranking.

Abstract

Public inference benchmarks compare AI systems at the model and provider level, but the unit at which deployment decisions are actually made is the endpoint: the (provider, model, stock-keeping-unit) tuple at which a specific quantization, decoding strategy, region, and serving stack is exposed. We introduce TokenArena, a continuous benchmark that measures inference at endpoint granularity along five core axes (output speed, time to first token, workload-blended price, effective context, and quality on the live endpoint) and synthesizes them, together with a modeled energy estimate, into three headline composites: joules per correct answer, dollars per correct answer, and endpoint fidelity (output-distribution similarity to a first-party reference). The framework's novelty is empirical and methodological. Across 78 endpoints serving 12 model families, the same model on different endpoints differs in mean accuracy by up to 12.5 points on math and code, in fingerprint similarity to first party by up to 12 points, in tail latency by an order of magnitude, and in modeled joules per correct answer by a factor of 6.2. We further show that workload-aware blended pricing reorders the leaderboard substantially: 7 of 10 top-ranked endpoints under the chat preset (3:1 input:output) fall out of the top 10 under the retrieval-augmented preset (20:1), and the reasoning preset (1:5) elevates frontier closed models that the chat preset penalizes on price. We release the framework, schema, probe and eval harness, and a v1.0 leaderboard snapshot under CC BY 4.0. TokenArena is a methodology, not a single ranking; we publish full provenance and limitations and welcome external replication.