Micro Language Models Enable Instant Responses

arXiv cs.CL / 4/22/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

Key Points

  • Edge devices like smartwatches and smart glasses struggle to run even the smallest large language models due to strict power and compute limits.
  • The paper proposes micro language models (μLMs), ultra-compact 8M–30M parameter models that generate the first 4–8 words on-device instantly.
  • A cloud model then completes the response, with a collaborative generation framework designed to hand off mid-sentence smoothly while the user perceives instant responsiveness.
  • The authors show that useful language generation still works at this extreme scale, with μLMs matching performance of several existing 70M–256M class models.
  • They provide model checkpoints and a demo, plus three error-correction methods for graceful recovery when the on-device opener is wrong.

Abstract

Edge devices such as smartwatches and smart glasses cannot continuously run even the smallest 100M-1B parameter language models due to power and compute constraints, yet cloud inference introduces multi-second latencies that break the illusion of a responsive assistant. We introduce micro language models (\muLMs): ultra-compact models (8M-30M parameters) that instantly generate the first 4-8 words of a contextually grounded response on-device, while a cloud model completes it; thus, masking the cloud latency. We show that useful language generation survives at this extreme scale with our models matching several 70M-256M-class existing models. We design a collaborative generation framework that reframes the cloud model as a continuator rather than a respondent, achieving seamless mid-sentence handoffs and structured graceful recovery via three error correction methods when the local opener goes wrong. Empirical results show that \muLMs can initiate responses that larger models complete seamlessly, demonstrating that orders-of-magnitude asymmetric collaboration is achievable and unlocking responsive AI for extremely resource-constrained devices. The model checkpoint and demo are available at https://github.com/Sensente/micro_language_model_swen_project.