Cloud to Edge: Benchmarking LLM Inference On Hardware-Accelerated Single-Board Computers
arXiv cs.AI / 4/29/2026
💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research
Key Points
- The paper addresses why running LLM inference locally on single-board computers can be challenging compared to cloud deployments, especially for privacy, latency, and cost-sensitive environments like defense and OT.
- It argues that current edge LLM benchmarking is insufficient because it often uses CPU-only setups, covers single-board computers poorly, and relies on evaluation tasks that do not measure hardware effectiveness in a multi-dimensional way.
- The authors propose a multi-dimensional benchmarking methodology that evaluates both inference performance and hardware efficiency across four IoT-suitable edge configurations using the latest available accelerators.
- The results show that hardware accelerators such as NPUs and GPUs improve practical deployment trade-offs, with measurements capturing power efficiency, device size, and token throughput.
- The study provides actionable guidance for deploying generative AI in privacy-sensitive and connectivity-limited scenarios, including unmanned vehicles and portable rugged operations.



