Token Distillation: Attention-aware Input Embeddings For New Tokens

arXiv cs.CL / 3/16/2026

📰 NewsModels & Research

共有:

Key Points

The paper identifies the limitations of static vocabularies in language models and the high cost of adding new tokens through retraining or extra modules.
It introduces Token Distillation, a method to learn high-quality input embeddings for new tokens by distilling representations from the original tokenization.
The approach enables rapid initialization of new embeddings and reduces training time while maintaining strong performance across open-weight models.
Experimental results show Token Distillation outperforms strong baselines across a wide range of models, indicating practical benefits for adapting NLP systems.

Abstract

Current language models rely on static vocabularies determined at pretraining time, which can lead to decreased performance and increased computational cost for domains underrepresented in the original vocabulary. New tokens can be added to solve this problem, when coupled with a good initialization for their new embeddings. However, existing embedding initialization methods require expensive further training or pretraining of additional modules. In this paper, we propose Token Distillation and show that by distilling representations obtained using the original tokenization, we can quickly learn high-quality input embeddings for new tokens. Experimental results with a wide range of open-weight models show that Token Distillation outperforms even strong baselines.

Does Synthetic Data Generation of LLMs Help Clinical Text Mining?

Dev.to

The Dawn of the Local AI Era: From iPhone 17 Pro to the Future of NVIDIA RTX

Dev.to

[P] Prompt optimization for analog circuit placement — 97% of expert quality, zero training data

Reddit r/MachineLearning

[R] Looking for arXiv endorser (cs.AI or cs.LG)

Reddit r/MachineLearning

I curated an 'Awesome List' for Generative AI in Jewelry- papers, datasets, open-source models and tools included!

Reddit r/artificial

Token Distillation: Attention-aware Input Embeddings For New Tokens

Key Points

Abstract

Related Articles

Does Synthetic Data Generation of LLMs Help Clinical Text Mining?

The Dawn of the Local AI Era: From iPhone 17 Pro to the Future of NVIDIA RTX

[P] Prompt optimization for analog circuit placement — 97% of expert quality, zero training data

[R] Looking for arXiv endorser (cs.AI or cs.LG)

I curated an 'Awesome List' for Generative AI in Jewelry- papers, datasets, open-source models and tools included!

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer