Token Distillation: Attention-aware Input Embeddings For New Tokens
arXiv cs.CL / 3/16/2026
📰 NewsModels & Research
Key Points
- The paper identifies the limitations of static vocabularies in language models and the high cost of adding new tokens through retraining or extra modules.
- It introduces Token Distillation, a method to learn high-quality input embeddings for new tokens by distilling representations from the original tokenization.
- The approach enables rapid initialization of new embeddings and reduces training time while maintaining strong performance across open-weight models.
- Experimental results show Token Distillation outperforms strong baselines across a wide range of models, indicating practical benefits for adapting NLP systems.
Related Articles
Does Synthetic Data Generation of LLMs Help Clinical Text Mining?
Dev.to
The Dawn of the Local AI Era: From iPhone 17 Pro to the Future of NVIDIA RTX
Dev.to
[P] Prompt optimization for analog circuit placement — 97% of expert quality, zero training data
Reddit r/MachineLearning
[R] Looking for arXiv endorser (cs.AI or cs.LG)
Reddit r/MachineLearning

I curated an 'Awesome List' for Generative AI in Jewelry- papers, datasets, open-source models and tools included!
Reddit r/artificial