EmbBERT: Attention Under 2 MB Memory

arXiv cs.CL / 3/25/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • EmbBERT is a new ultra-compact transformer-based tiny language model designed specifically to run on ultra-constrained edge devices with only ~2 MB total memory.
  • The architecture combines a compact embedding layer, simplified feed-forward blocks, and an efficient attention mechanism to maintain competitive accuracy despite the extreme memory budget.
  • Experiments on TinyNLP and GLUE show EmbBERT achieves accuracy comparable to state-of-the-art models that use roughly 10x more memory, while also outperforming similarly sized downsized BERT and MAMBA variants.
  • The model is resilient to 8-bit quantization (reducing memory to ~781 kB) and the paper reports that the design scales across sub-megabyte to tens-of-megabytes.
  • An ablation study indicates all major components and the pre-training procedure contribute positively, and the authors release code, scripts, and checkpoints for reproducibility.

Abstract

Transformer architectures based on the attention mechanism have revolutionized natural language processing (NLP), driving major breakthroughs across virtually every NLP task. However, their substantial memory and computational requirements still hinder deployment on ultra-constrained devices such as wearables and Internet-of-Things (IoT) units, where available memory is limited to just a few megabytes. To address this challenge, we introduce EmbBERT, a tiny language model (TLM) architecturally designed for extreme efficiency. The model integrates a compact embedding layer, streamlined feed-forward blocks, and an efficient attention mechanism that together enable optimal performance under strict memory budgets. Through this redesign for the extreme edge, we demonstrate that highly simplified transformer architectures remain remarkably effective under tight resource constraints. EmbBERT requires only 2 MB of total memory, and achieves accuracy performance comparable to the ones of state-of-the-art (SotA) models that require a \mathbf{10\times} memory budget. Extensive experiments on the curated TinyNLP benchmark and the GLUE suite confirm that EmbBERT achieves competitive accuracy, comparable to that of larger SotA models, and consistently outperforms downsized versions of BERT and MAMBA of similar size. Furthermore, we demonstrate the model resilience to 8-bit quantization, which further reduces memory usage to just 781 kB , and the scalability of the EmbBERT architecture across the sub-megabyte to tens-of-megabytes range. Finally, we perform an ablation study demonstrating the positive contributions of all components and the pre-training procedure. All code, scripts, and checkpoints are publicly released to ensure reproducibility: https://github.com/RiccardoBravin/tiny-LLM.