Tokenization
The process of converting raw text into numerical tokens that neural networks can understand.

The First Step in Every LLM
Before any language model can process your text, it needs to convert human-readable words into numbers. This conversion — tokenization — is the very first operation in every LLM pipeline, from ChatGPT to Claude to Gemini.
Understanding tokenization explains why models sometimes struggle with spelling, math, and rare words. It's also why "token count" matters for pricing and context windows.
Every LLM — GPT-4, Claude, Gemini, Llama, Mistral. All use tokenizers.
Common words → 1 token. Rare words → multiple tokens. "walking" = "walk" + "ing"
Token count affects cost, speed, and context limits. Efficient prompts = fewer tokens = lower cost.
Try It: Watch Text Become Tokens
Type anything below to see how a tokenizer splits it into subword tokens:
Numbers show simulated token IDs. Common words map to single tokens; rare words split into subwords.
How Byte-Pair Encoding Works
The most common tokenization algorithm is Byte-Pair Encoding (BPE). It builds a vocabulary by iteratively merging the most frequent adjacent character pairs.
Real-World Examples
Uses ~100K token vocabulary. "walking" → ["walk", "ing"]. Pricing is per-token.
Similar BPE approach. Context window measured in tokens, not characters.
"12345" might tokenize as "123" + "45" — the model never sees it as one number.
Test Your Understanding
Q1.What does tokenization convert text into?
Q2.In BPE, what happens to common words like "the"?
Q3.Why might an LLM struggle with the word "defenestration"?
Q4.What is the main algorithm used for tokenization in modern LLMs?