Beyond Test-Time Compute Strategies: Advocating Energy-per-Token in LLM Inference
arXiv cs.CL / 2026/3/24
💬 オピニオンSignals & Early TrendsIdeas & Deep AnalysisModels & Research
要点
- The paper argues that LLM inference can be inefficient because real-world tasks often require far less capability, allowing Small Language Models (SLMs) to achieve strong performance with lower compute.
- It analyzes how test-time compute strategies such as Chain-of-Thought prompting and Majority Voting create an energy–accuracy trade-off by increasing reasoning compute even as they may reduce the need for larger models.
- Using MMLU experiments and transformer input–output token dynamics, it shows that transformer hardware energy behavior is nonlinear, motivating inference methods that account for physical energy curves.
- The authors propose energy-aware evaluation metrics—especially Energy-per-Token—to complement accuracy benchmarks, and suggest dynamically regulating reasoning depth during CoT generation.
- The work envisions an energy-aware routing approach that jointly chooses the model and inference strategy to balance accuracy with sustainable AI deployment goals.