Benchmarking Local Language Models for Social Robots using Edge Devices

arXiv cs.RO / 5/6/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

共有:

Key Points

The paper addresses a lack of systematic benchmarks for evaluating open-source local LLMs on edge devices for social-educational robots, focusing on responsiveness and privacy under tight compute constraints.
It benchmarks 25 models on edge hardware (primarily Raspberry Pi 4, with checks on Raspberry Pi 5 and a laptop GPU) using three evaluation dimensions: inference efficiency, general knowledge (MMLU subset), and teaching effectiveness (LLM-rated quality validated by human raters).
Results show large model-to-model trade-offs, with inference throughput and energy efficiency varying by more than an order of magnitude, MMLU accuracy ranging from near-random up to 57.2%, and teaching effectiveness that does not monotonically track either efficiency or knowledge scores.
Granite4 Tiny Hybrid (7B) is identified as a strong overall choice, balancing efficiency and knowledge while achieving high teaching-relevant performance, and human validation largely confirms the automated ranking.
The authors use the findings to propose a three-tier local inference architecture for the Robot Study Companion (RSC) to better balance latency, accuracy, and compute limits on resource-constrained hardware.

Abstract

Social-educational robots designed for socially interactive pedagogical support, such as the Robot Study Companion (RSC), rely on responsive, privacy-preserving interaction despite severely limited compute. However, there is a gap in systematic benchmarking of language models for edge computing in pedagogical applications. This paper benchmarks 25 open-source language models for local deployment on edge hardware. We evaluate each model across three dimensions: inference efficiency (tokens per second, energy consumption), general knowledge (a six-category MMLU subset), and teaching effectiveness (LLM-rated pedagogical quality), validated against five independent human raters using the Raspberry Pi(RPi)4 as the primary platform, with additional comparisons on the RPi5 and a laptop GPU. Results reveal pronounced trade-offs: throughput and energy efficiency vary by over an order of magnitude across models, MMLU accuracy ranges from near-random to 57.2%, and teaching effectiveness does not correlate monotonically with either metric. Among the evaluated models, Granite4 Tiny Hybrid (7B) achieves a strong overall balance, reaching 2.5 tokens per second, 0.90 tokens per joule, and 54.6% MMLU accuracy; high MMLU accuracy does not appear necessary for strong teaching scores. Human validation on four representative models preserved the automated rank ordering (Pearson r = 0.967, n = 4). Based on these findings, we propose a three-tier local inference architecture for the RSC that balances responsiveness and accuracy on resource-constrained hardware.

Antwerp startup Maurice & Nora raises €1M to address rising care demand

Tech.eu

SIFS (SIFS Is Fast Search) - local code search for coding agents

Dev.to

Discover Amazing AI Bots in EClaw's Bot Plaza: The GitHub for AI Personalities

Dev.to

BizNode's semantic memory (Qdrant) makes your bot smarter over time — it remembers past conversations and answers...

Dev.to

Amd radeon ai pro r9700 32GB VS 2x RTX 5060TI 16GB for local setup?

Reddit r/LocalLLaMA

Benchmarking Local Language Models for Social Robots using Edge Devices

Key Points

Abstract

Related Articles

Antwerp startup Maurice & Nora raises €1M to address rising care demand

SIFS (SIFS Is Fast Search) - local code search for coding agents

Discover Amazing AI Bots in EClaw's Bot Plaza: The GitHub for AI Personalities

BizNode's semantic memory (Qdrant) makes your bot smarter over time — it remembers past conversations and answers...

Amd radeon ai pro r9700 32GB VS 2x RTX 5060TI 16GB for local setup?

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

Key Points

Abstract

Related Articles

Antwerp startup Maurice &amp; Nora raises €1M to address rising care demand

SIFS (SIFS Is Fast Search) - local code search for coding agents

Discover Amazing AI Bots in EClaw's Bot Plaza: The GitHub for AI Personalities

BizNode's semantic memory (Qdrant) makes your bot smarter over time — it remembers past conversations and answers...

Amd radeon ai pro r9700 32GB VS 2x RTX 5060TI 16GB for local setup?

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

Antwerp startup Maurice & Nora raises €1M to address rising care demand