A Framework for Low-Latency, LLM-driven Multimodal Interaction on the Pepper Robot
arXiv cs.AI / 3/24/2026
💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces an open-source Android framework for Pepper that targets high latency and loss of paralinguistic cues in common cascaded STT→LLM→TTS pipelines.
- It uses end-to-end Speech-to-Speech (S2S) models to support low-latency interaction while preserving prosody and enabling adaptive intonation.
- The framework extends LLM usage by adding robust Function Calling so the LLM can act as an agentic planner coordinating navigation, gaze control, and tablet interaction.
- It integrates multimodal feedback channels, including vision, touch, and system state, to improve embodied HRI control and perception.
- The system is designed to run on Pepper’s tablet but is also portable to standard Android devices, easing development and experimentation independent of robot hardware.
Related Articles

I built an online background remover and learned a lot from launching it
Dev.to
How AI is Transforming Dynamics 365 Business Central
Dev.to
Algorithmic Gaslighting: A Formal Legal Template to Fight AI Safety Pivots That Cause Psychological Harm
Reddit r/artificial
Do I need different approaches for different types of business information errors?
Dev.to
ShieldCortex: What We Learned Protecting AI Agent Memory
Dev.to