Tokenization
The process of splitting text into subword units (tokens) that LLMs consume — a word like "tokenization" might become two or three tokens depending on the model's tokenizer.
Tokens are the atomic units of LLM input and output. Every model has its own tokenizer that splits text into pieces — typically subword units that balance vocabulary size against representational efficiency. GPT-4 averages ~4 characters per token in English; non-English languages often have 2–3× higher token counts for the same content.
Tokenization matters for cost (you pay per token), context limits (200K-1M token windows), and behavior (some tokenizers handle numbers, code, or special characters better than others). For agent builders, knowing your tokenizer helps with prompt sizing, cost forecasting, and debugging weird model behaviors.
In 2026 most major models use Byte Pair Encoding (BPE) or SentencePiece-style tokenizers. Specialized tokenizers exist for code (CodeBPE) and multilingual content. The choice is usually fixed by the model vendor.
Frequently asked
How many tokens is a typical sentence?+
In English, a typical 100-word paragraph is ~130 tokens. A 1,000-word document is ~1,300 tokens. Code is denser — JSON or terse code can be 0.5–1 token per character.
Why do non-English languages cost more in tokens?+
Tokenizers are trained primarily on English text, so English words tokenize efficiently (1–2 tokens). Languages with different scripts (Chinese, Japanese, Arabic) often need 2–4× more tokens for the same content, driving up cost.