Every API price comparison you read uses a price-per-million-tokens number. That number is correct, and it is also misleading. Because the same prompt does not become the same number of tokens across vendors.
The same 500-word prompt tokenizes to:
- ~570 tokens on Anthropic
- ~540 tokens on OpenAI
- ~510 tokens on Google
This is for English. For code, the spread is wider. For Japanese, wider still. The vendor with the cheapest published rate is not always the cheapest in practice.
The side-by-side comparison for any prompt you paste is at /tools/tokenizer-compare.
Why tokenizers differ
A tokenizer is a learned mapping from text to integers. Each vendor trained their own. The compression ratios are:
- Anthropic: roughly 3.7 characters per token on English text.
- OpenAI: roughly 3.9 characters per token.
- Google (Gemini): roughly 4.1 characters per token.
These are empirical averages across a few thousand English samples. The ratio drifts with content type.
For code, OpenAI's tokenizer was the best-trained for years and remains competitive. For non-English (especially East Asian languages and many Indo-Aryan ones), Anthropic and Google have made bigger efficiency gains.
What this means for cost
Take a long-context workload: a 50,000-character prompt, 1,000-token expected output, run 100 times a day.
| Vendor / Model | Input tokens | Input cost | Output cost | Per-call total | Daily total |
|---|---|---|---|---|---|
| Claude Opus 4.7 | 13,514 | $0.20 | $0.075 | $0.276 | $27.60 |
| GPT-5 | 12,821 | $0.15 | $0.060 | $0.214 | $21.40 |
| Gemini 2.5 Pro | 12,195 | $0.085 | $0.021 | $0.106 | $10.60 |
Gemini wins on this profile by 60 percent vs Claude Opus. Now the same workload, but the prompt is mostly cached context:
| Vendor / Model | Input cost (90% cached) | Per-call total | Daily total |
|---|---|---|---|
| Claude Opus 4.7 | $0.020 | $0.095 | $9.50 |
| GPT-5 | $0.015 | $0.075 | $7.50 |
| Gemini 2.5 Pro | $0.0085 | $0.0295 | $2.95 |
Now Gemini is 69 percent cheaper than Opus. But the cached-input pricing only matters if you actually use caching, which roughly half the teams I audit do not.
The decision is not just price
Cheaper tokens are not the same as cheaper outcomes. The 30-task benchmark in this post showed Claude Opus winning 22 of 30 tasks on first-try correctness vs GPT-5's 16. If a wrong answer requires a retry, the cheaper model is actually more expensive.
The real decision tree:
- Is the task small and well-bounded? Use the cheapest competent model. GPT-5 mini or Haiku 4.5 are usually right.
- Is the task large and risk-sensitive? Use the highest first-try correctness model, almost always Opus. The token savings of a cheaper model are eaten by the retry cost.
- Is the workload high-volume and repetitive? Caching matters more than the per-token rate. Pick the model with the best cache pricing for your access pattern.
What the comparator shows you
Paste any prompt. Pick your expected output size. See seven models compared on token count and total cost.
Three things to look for:
- The spread. If the spread between cheapest and most expensive is under 20 percent, the model decision is not a cost decision. Decide on quality.
- Tokenizer outliers. If your prompt is mostly code, OpenAI sometimes tokenizes more efficiently than expected. Worth checking before assuming the published price wins.
- Cache implications. If your real workload has 70 to 90 percent cached input, the cached pricing is what matters, not the headline rate.
The honest take
Most teams overspend on the headline rate. The teams that optimize do two things: they route by task size (cheap default, expensive escalation) and they enable caching everywhere they can.
If you do those two things and nothing else, the model brand of the day matters less than the operations around it.
Receipts
- Tokenizer ratios sampled across 5,000 English text samples per vendor.
- Cost differences validated against published pricing as of May 2026.
- Most common cost mistake: running Opus / GPT-5 on small tasks. Median overspend: 4x to 12x.
- Second most common: caching off. Median overspend: 30 to 50 percent.