A Family of LLMs Liberated from Static Vocabularies
arXiv cs.CL / 3/18/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces the HAT architecture, a hierarchical autoregressive transformer that converts bytes into word embeddings with an encoder, uses a backbone for autoregressive modeling, and then decodes back into bytes.
- The authors demonstrate how to reuse pretrained Llama 3.1 backbones by adapting them to handle word embeddings, creating byte-level models such as Llama-3.1-8B-TFree-HAT and Llama-3.1-70B-TFree-HAT.
- They also present a 7B model trained from scratch, Llama-TFree-HAT-Pretrained, on nearly 4 trillion words.
- The HAT approach reduces required sequence positions, improves text compression, and increases robustness to intra-word variations, with English and German benchmarks showing improvements over the original Llama 3.1.
- The authors release the models (including about 200 pre-training checkpoints) on Hugging Face.
Related Articles

Attacks On Data Centers, Qwen3.5 In All Sizes, DeepSeek’s Huawei Play, Apple’s Multimodal Tokenizer
The Batch

Your AI generated code is "almost right", and that is actually WORSE than it being "wrong".
Dev.to

Lessons from Academic Plagiarism Tools for SaaS Product Development
Dev.to

**Core Allocation Optimization for Energy‑Efficient Multi‑Core Scheduling in ARINC650 Systems**
Dev.to

KI in der amtlichen Recherche beim DPMA: Was Patentanwälte bei Neuanmeldungen jetzt beachten sollten (Stand: März 2026)
Dev.to