| I personally feel that Tokenisers are one of the least discussed aspects of LM training. Especially considering how big of an impact they have. We talk about the same (in quite some detail) in our new article "Reframing Tokenisers & Building Vocabulary". https://longformthoughts.substack.com/p/reframing-the-processes-of-tokenisers [link] [comments] |
Reframing Tokenisers & Building Vocabulary
Reddit r/LocalLLaMA / 4/7/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The post argues that tokenisers are a relatively under-discussed but highly influential component of language model training.
- It points readers to a Substack article titled “Reframing Tokenisers & Building Vocabulary,” positioning the piece as a deeper examination of the tokenisation process.
- The content frames tokenisation as closely tied to how vocabulary is built and represented, implying practical consequences for training quality and downstream behavior.
- By emphasizing “reframing,” the article suggests readers reconsider common assumptions about tokenisers rather than treating them as a fixed implementation detail.
Related Articles

You can now fine-tune Gemma 4 locally 8GB VRAM + Bug Fixes
Reddit r/LocalLLaMA

Your AI Is a Black Box Because You Didn’t Document It
Dev.to

When AI Uses Stale Government Data: Why Explicit Timestamping Becomes Necessary
Dev.to

From Chaos to Cuts: AI as Your Story Editor
Dev.to

Training a 1.1B SLM at home
Reddit r/LocalLLaMA