3 Concepts to Pin Down Before LLM Development
Understanding tokens, context, and pricing before using LLM APIs makes both code and cost estimation smooth.
Tokens
The smallest unit an LLM processes—not "words" but word fragments and symbols too. Pricing is billed by token count.
- English: 1 word ≈ 1.3 tokens (≈ 4 chars)
- Japanese: 1 char ≈ 1-2 tokens (varies by kana/kanji); ~1.5-2x English for the same meaning
- Code: symbols/whitespace are all tokens
- Images/audio/video are also counted as tokens on a separate budget (varies by API)
How to Check
- OpenAI: tiktoken library, web Tokenizer tool
- Anthropic: count_tokens API (roughly like English)
Sense Values
| Text | Approx tokens |
|---|---|
| "Hello" | 3-5 |
| 1 paragraph (500 chars) | 500-800 |
| 1 article (3000 chars) | 3,000-5,000 |
| 1 book | 80,000-150,000 |