Benchmarking Local Language Models for Social Robots using Edge Devices

arXiv cs.RO / 5/6/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

Key Points

  • The paper addresses a lack of systematic benchmarks for evaluating open-source local LLMs on edge devices for social-educational robots, focusing on responsiveness and privacy under tight compute constraints.
  • It benchmarks 25 models on edge hardware (primarily Raspberry Pi 4, with checks on Raspberry Pi 5 and a laptop GPU) using three evaluation dimensions: inference efficiency, general knowledge (MMLU subset), and teaching effectiveness (LLM-rated quality validated by human raters).
  • Results show large model-to-model trade-offs, with inference throughput and energy efficiency varying by more than an order of magnitude, MMLU accuracy ranging from near-random up to 57.2%, and teaching effectiveness that does not monotonically track either efficiency or knowledge scores.
  • Granite4 Tiny Hybrid (7B) is identified as a strong overall choice, balancing efficiency and knowledge while achieving high teaching-relevant performance, and human validation largely confirms the automated ranking.
  • The authors use the findings to propose a three-tier local inference architecture for the Robot Study Companion (RSC) to better balance latency, accuracy, and compute limits on resource-constrained hardware.

Abstract

Social-educational robots designed for socially interactive pedagogical support, such as the Robot Study Companion (RSC), rely on responsive, privacy-preserving interaction despite severely limited compute. However, there is a gap in systematic benchmarking of language models for edge computing in pedagogical applications. This paper benchmarks 25 open-source language models for local deployment on edge hardware. We evaluate each model across three dimensions: inference efficiency (tokens per second, energy consumption), general knowledge (a six-category MMLU subset), and teaching effectiveness (LLM-rated pedagogical quality), validated against five independent human raters using the Raspberry Pi(RPi)4 as the primary platform, with additional comparisons on the RPi5 and a laptop GPU. Results reveal pronounced trade-offs: throughput and energy efficiency vary by over an order of magnitude across models, MMLU accuracy ranges from near-random to 57.2%, and teaching effectiveness does not correlate monotonically with either metric. Among the evaluated models, Granite4 Tiny Hybrid (7B) achieves a strong overall balance, reaching 2.5 tokens per second, 0.90 tokens per joule, and 54.6% MMLU accuracy; high MMLU accuracy does not appear necessary for strong teaching scores. Human validation on four representative models preserved the automated rank ordering (Pearson r = 0.967, n = 4). Based on these findings, we propose a three-tier local inference architecture for the RSC that balances responsiveness and accuracy on resource-constrained hardware.