In partnership with

Understanding Tokenization: Why GPT Chose the Middle Ground

Understanding Tokenization: Why GPT Chose the Middle Ground

The complete guide to word-level, character-level, and subword tokenization in modern AI

Every time you interact with ChatGPT, Claude, or any large language model, your text goes through tokenization. This decision about how to break text into pieces fundamentally shapes what these models can and cannot do.

Modern AI models do not use the obvious approaches. They chose something in between, and understanding why reveals deep insights about the tradeoffs that define artificial intelligence.

The Three Fundamental Approaches

1. Word-Level Tokenization

"I love machine learning" → ["I", "love", "machine", "learning"]

Pros:

Natural, simple, fast processing

Cons:

Vocabulary explosion (170,000+ words). Unknown words become [UNKNOWN]. No relationship between "run", "running", "runs".

2. Character-Level Tokenization

"I love AI" → ["I", " ", "l", "o", "v", "e", " ", "A", "I"] (9 tokens)

Pros:

Tiny vocabulary (26-256 chars). No unknown tokens. Works for all languages.

Cons:

100-word paragraph becomes 500-600 tokens. 5-6x more computation required.

3. Subword-Level (The Goldilocks Solution)

"unhappiness" → ["un", "happiness"] or ["un", "happy", "ness"] (3 tokens)

Used by GPT-3, GPT-4, Claude, LLaMA, and virtually every modern LLM.

Why GPT Uses Byte-Pair Encoding

How BPE Works

BPE learns optimal vocabulary by analyzing training data and discovering frequent character combinations.

Corpus: "low low low lower lowest"
Step 1: l o w _ l o w _ l o w _ l o w e r _ l o w e s t
Step 2: Merge "l"+"o" → lo w _ lo w _ lo w _ lo w e r _
Step 3: Merge "lo"+"w" → low _ low _ low _ low e r _

Continue until vocabulary reaches ~50,000 tokens. Common words become single tokens, rare words break into subwords.

Why BPE is Optimal

1. Vocabulary Efficiency

GPT-4 uses ~100,000 tokens. Manageable yet comprehensive.

2. No Unknown Tokens

BPE breaks unknown words into subwords, ultimately into bytes.

3. Morphological Awareness

"unhappiness", "happiness", "unhappy" share "happy" subword.

4. Compression

4:1 ratio for English. 20 characters become 5 tokens.

5. Multilingual

Single vocabulary handles all languages simultaneously.

GPT-4 Example

"Understanding tokenization is crucial for AI."
→ ["Under"]["standing"][" token"]["ization"][" is"][" crucial"][" for"][" AI"]["."]
46 characters → 9 tokens (5:1 ratio)

The Hidden Problems with BPE

1. Character Blindness

"strawberry" → ["straw"]["berry"]. Model sees 2 tokens, not 10 characters. Cannot count r's. Affects spelling, character counting, word reversal.

2. Language Bias and Cost Inequality

English: "Hello, how are you?" → 6 tokens

Thai: "สวัสดี คุณเป็นอย่างไร" → 60 tokens (10x more)

Thai speakers pay 10x more for API calls. Fundamental fairness issue.

3. Context-Dependent Tokenization

"ice cream" → ["ice"]["cream"] but "I love ice cream" → ["I"]["love"][" ice"]["cream"]. Different token IDs for "ice" and " ice".

2025 Research and Solutions

STOCHASTOK (March 2025)

Randomly varies splits during training. "strawberry" → ["s"]["traw"]["berry"] or ["straw"]["b"]["erry"]. Model learns internal character structure.

Result: Character-counting improved from 42% to 89% accuracy.

VidTok (April 2025)

Extends BPE to video/images. Merges spatial patches based on visual similarity. Unified vocabulary for text, images, video. Reduces video to 70% fewer tokens. GPT-5 likely uses similar approach.

MambaByte (Token-Free)

Processes raw bytes using state space models. No vocabulary needed. Perfect language fairness. 2.6x faster inference. Still experimental at scale.

Balanced Multilingual BPE

Train with balanced language data. Reduces disparity from 10x to 1.5x. Slightly less English efficiency but dramatically improved fairness.

Production Implementation

tiktoken (GPT Models)

import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
tokens = enc.encode("Your text")
print(len(tokens))

Cost Calculation

tokens = len(enc.encode(prompt))
cost = (tokens/1000)*0.03 + (output/1000)*0.06
# GPT-4: $0.03/1K in, $0.06/1K out

Token Truncation

def truncate(text, max_tokens):
    tokens = enc.encode(text)
    return enc.decode(tokens[:max_tokens])

SentencePiece (LLaMA, Multilingual)

from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
tokens = tok.encode("Your text")

Key Takeaways

1. Tokenization is a tradeoff: Word-level too sparse, character-level too long, subword-level is the practical middle ground.

2. BPE learns from data: Discovers optimal vocabulary by analyzing frequent character pairs in training data.

3. Real limitations exist: Character blindness, language bias, and context sensitivity create tangible problems.

4. 2025 research offers solutions: STOCHASTOK, VidTok, token-free models, and balanced vocabularies.

5. Production requires care: Always count tokens, implement truncation, monitor costs, use appropriate tokenizers.

ResearchAudio.io

Deep technical analysis of AI research for DevOps engineers

Want these insights in your inbox? Move this to Primary so you never miss future research.

Gmail: Drag to Primary or "Move to" → Primary
Outlook: Right-click → Move → Inbox
Apple Mail: Message menu → Move to Inbox

Go from AI overwhelmed to AI savvy professional

AI keeps coming up at work, but you still don't get it?

That's exactly why 1M+ professionals working at Google, Meta, and OpenAI read Superhuman AI daily.

Here's what you get:

  • Daily AI news that matters for your career - Filtered from 1000s of sources so you know what affects your industry.

  • Step-by-step tutorials you can use immediately - Real prompts and workflows that solve actual business problems.

  • New AI tools tested and reviewed - We try everything to deliver tools that drive real results.

  • All in just 3 minutes a day

Keep Reading

No posts found