Tokenization

The process of converting raw text into numerical tokens that neural networks can understand.

Why It Matters

The First Step in Every LLM

Before any language model can process your text, it needs to convert human-readable words into numbers. This conversion — tokenization — is the very first operation in every LLM pipeline, from ChatGPT to Claude to Gemini.

Understanding tokenization explains why models sometimes struggle with spelling, math, and rare words. It's also why "token count" matters for pricing and context windows.

Used By

Every LLM — GPT-4, Claude, Gemini, Llama, Mistral. All use tokenizers.

Key Insight

Common words → 1 token. Rare words → multiple tokens. "walking" = "walk" + "ing"

Why You Care

Token count affects cost, speed, and context limits. Efficient prompts = fewer tokens = lower cost.

Interactive

Try It: Watch Text Become Tokens

Type anything below to see how a tokenizer splits it into subword tokens:

the1996neural30828network27349is318learn20424ing278q313u317i305c299k307ly306

Numbers show simulated token IDs. Common words map to single tokens; rare words split into subwords.

Deep Dive

How Byte-Pair Encoding Works

The most common tokenization algorithm is Byte-Pair Encoding (BPE). It builds a vocabulary by iteratively merging the most frequent adjacent character pairs.

In Practice

Real-World Examples

GPT-4 (tiktoken)

Uses ~100K token vocabulary. "walking" → ["walk", "ing"]. Pricing is per-token.

Claude (SentencePiece)

Similar BPE approach. Context window measured in tokens, not characters.

Why Math Is Hard

"12345" might tokenize as "123" + "45" — the model never sees it as one number.

Knowledge Check

Test Your Understanding

Q1.What does tokenization convert text into?

Q2.In BPE, what happens to common words like "the"?

Q3.Why might an LLM struggle with the word "defenestration"?

Q4.What is the main algorithm used for tokenization in modern LLMs?