Tokenization

Why LLMs Count Tokens, Not Words

⏱ 6 min read · LLM Fundamentals

When you send text to GPT-4 or Claude, you're not charged by the word. You're charged by the token. And if you've ever wondered why "hello" costs the same as "antidisestablishmentarianism," the answer lies in how language models actually process text.

Tokens aren't words. They're subword units — pieces of text that the model's tokenizer breaks your input into before processing. Understanding tokenization is the difference between accidentally spending $50 on a single API call and building cost-efficient LLM applications.

What Tokenization Actually Does

Before a language model can process text, it needs to convert that text into numbers. Words don't work because there are too many of them — English alone has hundreds of thousands of words, and models need to work across multiple languages.

Tokenization solves this by breaking text into smaller units. Common words like "the" or "is" become single tokens. Uncommon words get split into multiple tokens. "Antidisestablishmentarianism" might become five tokens: "anti", "dis", "establish", "ment", "arianism".

This approach balances vocabulary size with coverage. GPT-4's tokenizer has about 100,000 tokens in its vocabulary. That's enough to represent any text in any language, but small enough to be computationally manageable.

Why This Matters for Costs

API pricing is per token, not per word, because tokens are what the model actually processes. Every token requires computation — attention calculations, matrix multiplications, and memory access. More tokens mean more compute, which means higher costs.

This creates counterintuitive pricing. The sentence "I am happy" is 4 tokens. The sentence "I'm happy" is 3 tokens. That apostrophe saves you money because contractions tokenize more efficiently.

Code is especially expensive. Programming languages use lots of special characters, which often tokenize poorly. A 100-word Python function might be 300 tokens, while a 100-word English paragraph might be 130 tokens.

Token efficiency isn't about writing shorter text. It's about writing text that tokenizes well.

The Language Bias

English tokenizes efficiently because most tokenizers are trained primarily on English text. The word "hello" is one token. But "你好" (hello in Chinese) is two tokens, even though it's the same semantic meaning.

This creates a cost disparity. Users working in non-English languages pay more for the same amount of semantic content. A Chinese user might pay 2-3x more than an English user for equivalent text.

Model providers are aware of this and working on more language-balanced tokenizers. But for now, English is the most cost-efficient language for LLM APIs.

Why Not Just Count Words?

Words don't work because language is messy. Is "don't" one word or two? Is "New York" one word or two? What about "COVID-19" or "user@example.com"?

Tokens sidestep these questions by treating text as a sequence of subword units. The tokenizer learns these units from training data, so it naturally handles contractions, compound words, and special characters without explicit rules.

This flexibility is why models can handle code, URLs, and multilingual text without special preprocessing. The tokenizer just breaks everything into tokens, and the model processes those tokens the same way regardless of what they represent.

How to Estimate Token Counts

A rough rule: English text is about 4 characters per token. So a 1,000-character paragraph is roughly 250 tokens. But this varies widely based on the text type.

Technical writing with lots of jargon tokenizes poorly. Casual conversation tokenizes efficiently. Code is the worst — often 1-2 characters per token for languages with lots of special symbols.

The only accurate way to count tokens is to use the actual tokenizer. OpenAI provides tiktoken, a Python library that uses the same tokenizer as GPT-4. Anthropic has similar tools for Claude.

Optimizing for Token Efficiency

Use contractions. "I'm" is cheaper than "I am". Use common words. "Use" is cheaper than "utilize". Avoid unnecessary formatting. Extra spaces, line breaks, and special characters all cost tokens.

But don't sacrifice clarity for token savings. A prompt that saves 10 tokens but produces a worse output will cost you more in the long run when you have to retry the request.

The real optimization is in prompt design. A well-structured prompt that gets the right answer on the first try is cheaper than a poorly-structured prompt that requires multiple attempts, even if the poorly-structured prompt uses fewer tokens per request.

Count tokens accurately before you send them. LLM Utils Token Counter shows you exactly how your text tokenizes across different models.