What Is a Token
An LLM processes text not in "words" but in units called "tokens." Tokens are often finer than words; frequent words are 1 token, rare words/symbols split into multiple tokens.
Example (GPT-4-line tokenizer)
- "hello" → 1 token
- "electricity" → 1 token
- "prestidigitation" → 4 tokens ("pre", "stid", "ig", "itation")
- "こんにちは" → 3-4 tokens (Japanese is close to character-unit)
- "AI" → 1 token
Token-Count Guide
| Language | Per token |
|---|---|
| English | ~0.75 words, 4 chars |
| Japanese | ~0.5-1 char |
| Chinese | ~0.5-1 char |
| Code | ~3-5 chars |
For text of the same meaning, Japanese consumes 1.5-2x the tokens of English. Mind this in cost estimation.




