A Family of LLMs Liberated from Static Vocabularies
arXiv cs.CL / 3/18/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces the HAT architecture, a hierarchical autoregressive transformer that converts bytes into word embeddings with an encoder, uses a backbone for autoregressive modeling, and then decodes back into bytes.
- The authors demonstrate how to reuse pretrained Llama 3.1 backbones by adapting them to handle word embeddings, creating byte-level models such as Llama-3.1-8B-TFree-HAT and Llama-3.1-70B-TFree-HAT.
- They also present a 7B model trained from scratch, Llama-TFree-HAT-Pretrained, on nearly 4 trillion words.
- The HAT approach reduces required sequence positions, improves text compression, and increases robustness to intra-word variations, with English and German benchmarks showing improvements over the original Llama 3.1.
- The authors release the models (including about 200 pre-training checkpoints) on Hugging Face.
Related Articles
The massive shift toward edge computing and local processing
Dev.to
Self-Refining Agents in Spec-Driven Development
Dev.to
Week 3: Why I'm Learning 'Boring' ML Before Building with LLMs
Dev.to
The Three-Agent Protocol Is Transferable. The Discipline Isn't.
Dev.to

has anyone tried this? Flash-MoE: Running a 397B Parameter Model on a Laptop
Reddit r/LocalLLaMA