Understanding Tokenization: Why GPT Chose the Middle Ground
The complete guide to word-level, character-level, and subword tokenization in modern AI
|
|
Every time you interact with ChatGPT, Claude, or any large language model, your text goes through tokenization. This decision about how to break text into pieces fundamentally shapes what these models can and cannot do.
Modern AI models do not use the obvious approaches. They chose something in between, and understanding why reveals deep insights about the tradeoffs that define artificial intelligence.
|
The Three Fundamental Approaches
1. Word-Level Tokenization
"I love machine learning" → ["I", "love", "machine", "learning"]
Pros:
Natural, simple, fast processing
Cons:
Vocabulary explosion (170,000+ words). Unknown words become [UNKNOWN]. No relationship between "run", "running", "runs".
|
2. Character-Level Tokenization
"I love AI" → ["I", " ", "l", "o", "v", "e", " ", "A", "I"] (9 tokens)
Pros:
Tiny vocabulary (26-256 chars). No unknown tokens. Works for all languages.
Cons:
100-word paragraph becomes 500-600 tokens. 5-6x more computation required.
|
3. Subword-Level (The Goldilocks Solution)
"unhappiness" → ["un", "happiness"] or ["un", "happy", "ness"] (3 tokens)
Used by GPT-3, GPT-4, Claude, LLaMA, and virtually every modern LLM.
|
|
Why GPT Uses Byte-Pair Encoding
How BPE Works
BPE learns optimal vocabulary by analyzing training data and discovering frequent character combinations.
|
Corpus: "low low low lower lowest"
Step 1: l o w _ l o w _ l o w _ l o w e r _ l o w e s t
Step 2: Merge "l"+"o" → lo w _ lo w _ lo w _ lo w e r _
Step 3: Merge "lo"+"w" → low _ low _ low _ low e r _
|
Continue until vocabulary reaches ~50,000 tokens. Common words become single tokens, rare words break into subwords.
|
Why BPE is Optimal
1. Vocabulary Efficiency
GPT-4 uses ~100,000 tokens. Manageable yet comprehensive.
2. No Unknown Tokens
BPE breaks unknown words into subwords, ultimately into bytes.
3. Morphological Awareness
"unhappiness", "happiness", "unhappy" share "happy" subword.
4. Compression
4:1 ratio for English. 20 characters become 5 tokens.
5. Multilingual
Single vocabulary handles all languages simultaneously.
|
GPT-4 Example
|
"Understanding tokenization is crucial for AI."
→ ["Under"]["standing"][" token"]["ization"][" is"][" crucial"][" for"][" AI"]["."]
46 characters → 9 tokens (5:1 ratio)
|
|
|
The Hidden Problems with BPE
1. Character Blindness
"strawberry" → ["straw"]["berry"]. Model sees 2 tokens, not 10 characters. Cannot count r's. Affects spelling, character counting, word reversal.
|
2. Language Bias and Cost Inequality
English: "Hello, how are you?" → 6 tokens
Thai: "สวัสดี คุณเป็นอย่างไร" → 60 tokens (10x more)
Thai speakers pay 10x more for API calls. Fundamental fairness issue.
|
3. Context-Dependent Tokenization
"ice cream" → ["ice"]["cream"] but "I love ice cream" → ["I"]["love"][" ice"]["cream"]. Different token IDs for "ice" and " ice".
|
|
2025 Research and Solutions
STOCHASTOK (March 2025)
Randomly varies splits during training. "strawberry" → ["s"]["traw"]["berry"] or ["straw"]["b"]["erry"]. Model learns internal character structure.
Result: Character-counting improved from 42% to 89% accuracy.
|
VidTok (April 2025)
Extends BPE to video/images. Merges spatial patches based on visual similarity. Unified vocabulary for text, images, video. Reduces video to 70% fewer tokens. GPT-5 likely uses similar approach.
|
MambaByte (Token-Free)
Processes raw bytes using state space models. No vocabulary needed. Perfect language fairness. 2.6x faster inference. Still experimental at scale.
|
Balanced Multilingual BPE
Train with balanced language data. Reduces disparity from 10x to 1.5x. Slightly less English efficiency but dramatically improved fairness.
|
|
Production Implementation
tiktoken (GPT Models)
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
tokens = enc.encode("Your text")
print(len(tokens))
|
|
Cost Calculation
tokens = len(enc.encode(prompt))
cost = (tokens/1000)*0.03 + (output/1000)*0.06
# GPT-4: $0.03/1K in, $0.06/1K out
|
|
Token Truncation
def truncate(text, max_tokens):
tokens = enc.encode(text)
return enc.decode(tokens[:max_tokens])
|
|
SentencePiece (LLaMA, Multilingual)
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
tokens = tok.encode("Your text")
|
|
|
Key Takeaways
1. Tokenization is a tradeoff: Word-level too sparse, character-level too long, subword-level is the practical middle ground.
2. BPE learns from data: Discovers optimal vocabulary by analyzing frequent character pairs in training data.
3. Real limitations exist: Character blindness, language bias, and context sensitivity create tangible problems.
4. 2025 research offers solutions: STOCHASTOK, VidTok, token-free models, and balanced vocabularies.
5. Production requires care: Always count tokens, implement truncation, monitor costs, use appropriate tokenizers.
|
|
ResearchAudio.io
Deep technical analysis of AI research for DevOps engineers
|
|
Want these insights in your inbox? Move this to Primary so you never miss future research.
|
Gmail: Drag to Primary or "Move to" → Primary
|
|
Outlook: Right-click → Move → Inbox
|
|
Apple Mail: Message menu → Move to Inbox
|
|
|