eTPS — Effective Tokens Per Second: A Better Way to Measure Local LLM Performance

Reddit r/artificial / 5/7/2026

💬 OpinionSignals & Early TrendsIdeas & Deep Analysis

Key Points

  • The article argues that raw tokens per second (TPS) measures token throughput rather than how quickly users get a correct, usable answer, especially in multi-turn tasks.
  • It proposes eTPS (Effective Tokens Per Second), a metric that weights the final accepted output by how clean the path was to reach it, so correction loops, hallucinations, and repeated explanations reduce the score.
  • Example runs on the same prompt and hardware show large disagreements between raw TPS and eTPS, where a faster model can lose if it produces partial or incorrect results.
  • The author emphasizes eTPS is complementary to raw TPS, not a replacement, and notes key limitations: scoring involves human judgment, one task may not represent sustained workflows, and the metric could be gamed without full prompt logging.
  • A full specification is planned, including methodology, a task library, scoring protocol, and reproducibility standards, with open questions about how to penalize confident falsehood vs vague answers and whether hardware normalization should be built into the formula.

We're obsessed with raw tokens per second. Every hardware post leads with it. Every quantization comparison is ranked by it. It's the one number everyone agrees to report.

It's also measuring the wrong thing.

Raw TPS tells you how fast tokens hit the screen. It tells you almost nothing about how quickly you get a correct, usable answer. On sustained, multi-turn workflows, that gap becomes massive.

A faster model that hallucinates, requires multiple corrections, and forgets context you gave it earlier can easily be less useful than a slower model that gets it right the first time.

eTPS (Effective Tokens Per Second) is a complementary metric that measures actual progress toward a useful answer, not just token throughput.

The basic idea: weight the final accepted output by how clean the path to that answer was — first-pass correct scores highest — then divide by total time. Correction loops, hallucinations, and repeated explanations all reduce the score. A response that never reaches a correct answer scores zero regardless of speed.

It doesn't replace raw TPS. It sits next to it.

Results — same prompt, four runs, same hardware:

  • gemma-4-e2b (4.6B): 53.2 raw TPS → eTPS 53.18 ✓
  • qwen3.5-0.8b: 173.1 raw TPS → eTPS 86.57 ✗ partial
  • qwen3.5-9b (optimized): 1.8 raw TPS → eTPS 1.78 ✓
  • qwen3.5-9b (baseline): 0.5 raw TPS → eTPS 0.32 ✗ partial

The 0.8B leads on raw speed by a wide margin and still lost. Raw TPS said it won. eTPS said it didn't.

Hardware: RTX 5060 Laptop, 8GB VRAM. eTPS scores aren't portable across hardware — always report your full setup.

Known limitations (v0.1):

  • Scoring requires human judgment. The line between "needed clarification" and "was factually wrong" isn't always clean. Code generation with objective pass/fail criteria is a cleaner target and the focus of the next benchmark run.
  • One task isn't representative of sustained multi-turn workflows — that's where the metric gets most interesting and where I'm headed next.
  • Easy to game without full system prompt logging. The spec will require it.

These are acknowledged constraints, not hidden flaws.

Full specification coming soon covering methodology, task library, scoring protocol, and reproducibility standards. Before I lock the final weights I'd genuinely like input on two open questions:

How should the penalty differ between a model that confidently states something false versus one that's just vague enough you had to ask a follow-up? And should hardware normalization live in the core formula or be reported separately?

Thoughts welcome.

submitted by /u/axendo
[link] [comments]