You're halfway through pasting your 50-page architecture document into ChatGPT when it stops responding. Or Claude gives you an error: "Your message exceeds the maximum context length." Or you get charged $47 for a single API call and have no idea why.
Context windows and token limits are the invisible walls of LLM work. They determine what's possible, what's expensive, and what will silently fail.
After burning through $3,200 in "learning experiences" over the past 18 months and running millions of tokens through every major model, here's everything you need to know about context windows, token limits, and how to work within them without losing your mind or your budget.
What Actually Is a Context Window?
A context window is the maximum amount of text an LLM can "see" at one time. This includes:
- Your system prompt
- Your conversation history
- Your current prompt
- The model's previous responses
- Any documents or code you've pasted
- The model's thinking process (if visible)
- Everything the model outputs in its response
Think of it like RAM for an AI model. Once you hit the limit, the model can't process anything more.
The Math
Most models measure context in tokens, not words or characters.
1 token ≈ 4 characters (rough average for English) 1 token ≈ 0.75 words (rough average)
Example:
- "Hello world" = 2 tokens
- "Infrastructure as Code" = 4 tokens
- "AWS::EC2::Instance" = 5 tokens (special characters count)
This means:
- A 200,000 token context window ≈ 150,000 words
- A 1 million token context window ≈ 750,000 words
For reference, "War and Peace" is about 580,000 words. GPT-5.3 Codex can fit the entire novel in its context window with room left over.
Current Context Windows (March 2026)
Here's what every major model can handle:
| Model | Context Window | Equivalent | Best For |
|---|---|---|---|
| GPT-5.3 Codex | 1M tokens | ~750k words | Entire codebases, massive documentation |
| GPT-4.1 Turbo | 128k tokens | ~96k words | Long documents, multi-file analysis |
| GPT-4o | 128k tokens | ~96k words | Fast multimodal tasks |
| Claude 4.6 Opus | 200k tokens | ~150k words | Book-length analysis, deep research |
| Claude 4 Sonnet | 200k tokens | ~150k words | Cost-effective long context |
| Claude 3.5 Haiku | 200k tokens | ~150k words | Fast long-context tasks |
| Gemini 2.5 Pro | 2M tokens | ~1.5M words | Largest context, video analysis |
| Gemini 2.0 Flash | 1M tokens | ~750k words | Fast large context |
| DeepSeek R1 | 64k tokens | ~48k words | Reasoning-focused, shorter context |
| Llama 4 Maverick | 128k tokens | ~96k words | Open source, mid-range context |
Key Insight: Bigger isn't always better. A 2 million token context costs significantly more than a 128k context. Use the smallest context that fits your task.
The Three Hidden Token Costs
Most people only count their input tokens. That's a mistake.
Cost 1: System Prompt (Always Included)
Every API call includes your system prompt in the context window:
System prompt: 1,200 tokens
Your message: 5,000 tokens
Total input: 6,200 tokens
If you're sending 100 API calls, that's 120,000 tokens of system prompt you're paying for repeatedly.
Solution: Keep system prompts under 500 tokens for high-volume use cases.
Cost 2: Conversation History (Grows Over Time)
In a chat interface, every message stays in context:
Turn 1: System (500) + User (200) + Assistant (300) = 1,000 tokens
Turn 2: Previous (1,000) + User (250) + Assistant (400) = 1,650 tokens
Turn 3: Previous (1,650) + User (180) + Assistant (350) = 2,180 tokens
Turn 10: 8,430 tokens total
By turn 50, you're sending 40,000+ tokens per request even if each individual message is small.
Solution: Summarize old conversation history or start fresh conversations for new topics.
Cost 3: Output Tokens (Often Forgotten)
Models charge differently for input vs output tokens:
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Output Multiplier |
|---|---|---|---|
| GPT-4.1 Turbo | $2.50 | $10.00 | 4x more expensive |
| Claude 4 Sonnet | $3.00 | $15.00 | 5x more expensive |
| Gemini 2.5 Pro | $1.25 | $5.00 | 4x more expensive |
| GPT-5.3 Codex | $10.00 | $30.00 | 3x more expensive |
If you ask for "detailed explanations" or "generate comprehensive documentation," you're paying 4-5x more for the response than the question.
Example:
Your prompt: 500 tokens × $3.00/1M = $0.0015
Model response: 5,000 tokens × $15.00/1M = $0.075
Total cost: $0.0765 (98% from output tokens)
Solution: Ask for concise responses. Use "be brief" or "summarize in 3 bullets" when detailed output isn't needed.
Real Example: The $47 Mistake
Last month I helped a developer debug a $47 API call. Here's what happened:
What they did:
prompt = f"Review this entire codebase for security issues:\n\n{codebase_files}"
response = client.chat.completions.create(
model="gpt-4-turbo",
messages=[{"role": "user", "content": prompt}]
)
The breakdown:
- Codebase: 127 files, 48,000 lines of code
- Input tokens: 156,000 tokens
- Model output: 12,000 tokens (detailed report)
- Cost: $0.39 (input) + $0.12 (output) = $0.51... wait, that's not $47?
The real problem: They ran this in a loop across 92 different repositories without realizing each call was independent.
92 repos × $0.51/repo = $46.92
Plus, most of the code was irrelevant boilerplate. They only needed to scan security-sensitive files (authentication, authorization, database queries).
The fix:
# Filter to security-relevant files only
security_files = [f for f in files if any(pattern in f for pattern in
['auth', 'password', 'token', 'secret', 'database', 'sql', 'api_key'])]
# Reduced to 8,200 tokens average
# Cost per repo: $0.09
# Total for 92 repos: $8.28
Savings: $38.64 (82% reduction)
How to Check Token Count Before Sending
Never send a prompt without knowing the token count first.
Method 1: Use a Token Counter Tool
I built a free one specifically for this: Token Counter
- Paste your text
- Select your model
- See token count + cost estimate instantly
- Supports 23 models
No installation, no API key, runs in your browser.
Method 2: Use tiktoken (Python)
import tiktoken
# For OpenAI models
encoding = tiktoken.encoding_for_model("gpt-4")
tokens = encoding.encode("Your text here")
print(f"Token count: {len(tokens)}")
# For Claude (approximate)
encoding = tiktoken.get_encoding("cl100k_base")
tokens = encoding.encode("Your text here")
print(f"Approximate tokens: {len(tokens)}")
Method 3: Use the API
Most APIs return token usage in the response:
response = client.chat.completions.create(...)
print(f"Input tokens: {response.usage.prompt_tokens}")
print(f"Output tokens: {response.usage.completion_tokens}")
print(f"Total: {response.usage.total_tokens}")
Strategies for Working Within Token Limits
Strategy 1: Chunk and Summarize
For documents larger than the context window:
def process_large_document(doc, chunk_size=50000):
chunks = split_into_chunks(doc, chunk_size)
summaries = []
for chunk in chunks:
summary = llm.summarize(chunk, max_length=500)
summaries.append(summary)
# Now analyze the summaries (much smaller)
final_analysis = llm.analyze("\n\n".join(summaries))
return final_analysis
Real use case: I used this to analyze a 400-page compliance document. Cost: $2.30 instead of $18.40.
Strategy 2: Extract-Then-Process
Don't send entire files when you only need specific parts:
# Bad: Send entire 200KB Terraform file
response = llm.review(entire_terraform_file)
# Good: Extract resources first
resources = extract_resources(terraform_file, types=['aws_s3_bucket', 'aws_iam_role'])
response = llm.review(resources) # 10% of original size
Strategy 3: Use Smaller Models for Filtering
Use a cheap, fast model to filter, then a smart expensive model to analyze:
# Step 1: Filter with GPT-4o Mini ($0.15/1M tokens)
relevant_logs = gpt4o_mini.filter(all_logs, criteria="errors or warnings only")
# Step 2: Analyze with Claude Opus ($15/1M tokens)
root_cause = claude_opus.analyze(relevant_logs)
Cost comparison:
- Send all logs to Claude: $8.20
- Filter with GPT-4o Mini → analyze with Claude: $0.42 + $1.10 = $1.52
Savings: $6.68 (81% reduction)
Strategy 4: Sliding Window
For sequential analysis (logs, code reviews, debugging):
def sliding_window_analysis(items, window_size=20, overlap=5):
results = []
for i in range(0, len(items), window_size - overlap):
window = items[i:i+window_size]
analysis = llm.analyze(window)
results.append(analysis)
return combine_results(results)
Keeps context manageable while maintaining continuity through overlap.
Strategy 5: Semantic Search First
Use embeddings to find relevant chunks before sending to LLM:
# Step 1: Find most relevant chunks (cheap)
relevant_chunks = semantic_search(query, document_chunks, top_k=5)
# Step 2: Only send relevant chunks to LLM (expensive)
answer = llm.answer(query, context=relevant_chunks)
This is the RAG (Retrieval Augmented Generation) pattern. Massively reduces token usage.
Context Window Errors and How to Fix Them
Error 1: "Maximum context length exceeded"
What it means: Your input + expected output > context window
How to fix:
- Check token count with a counter tool
- Reduce input size (chunk, filter, extract)
- Switch to a model with larger context window
- Use a summarization pipeline
Error 2: Truncated Output
What it means: Model hit the output token limit mid-response
How to fix:
# Set max_tokens explicitly
response = client.chat.completions.create(
model="gpt-4-turbo",
max_tokens=4000, # Reserve space for full output
messages=[...]
)
Or ask for shorter responses:
"Summarize in 5 bullet points" instead of "Explain in detail"
Error 3: "Context too long" in Chat UI
What it means: Conversation history has grown too large
How to fix:
- Start a new conversation
- Summarize previous conversation and paste summary into new chat
- Use the UI's "clear conversation" feature
Error 4: Inconsistent Responses on Long Context
What it means: Model is losing coherence due to "lost in the middle" problem
How to fix:
- Put most important information at the beginning and end (models pay more attention there)
- Use explicit section markers: "IMPORTANT:", "KEY REQUIREMENT:"
- Break into smaller, focused prompts
Cost Optimization: Token Budgets
Set hard limits on token usage per request:
def safe_llm_call(prompt, max_cost=0.50):
token_count = estimate_tokens(prompt)
estimated_cost = calculate_cost(token_count, model="gpt-4-turbo")
if estimated_cost > max_cost:
raise ValueError(f"Estimated cost ${estimated_cost:.2f} exceeds budget ${max_cost}")
return llm.call(prompt)
Real scenario: We set a $0.25 per-request limit on our CI/CD pipeline code review bot. Prevented a runaway script from racking up $840 in one night.
Model Selection by Context Need
Choose the right model for your context window requirement:
Small Context (<10k tokens)
- Short Q&A
- Simple code generation
- Quick summaries
Best model: Claude 3.5 Haiku ($0.80/1M input, $4.00/1M output)
Medium Context (10k-50k tokens)
- Multi-file code review
- API documentation analysis
- Medium-length documents
Best model: GPT-4o ($2.50/1M input, $10.00/1M output)
Large Context (50k-200k tokens)
- Full documentation sets
- Large codebases
- Book-length content
Best model: Claude 4 Sonnet ($3.00/1M input, $15.00/1M output)
Extra Large Context (200k+ tokens)
- Entire repositories
- Multi-book analysis
- Video transcripts
Best model: Gemini 2.5 Pro ($1.25/1M input, $5.00/1M output) - cheapest per token for huge context
Advanced: Context Window Hacks
Hack 1: Compress Your Prompts
AI models are trained on natural language, but they understand structured formats better:
Before (524 tokens):
Please review this Terraform configuration and check for the following issues:
1. Are there any security groups with overly permissive rules?
2. Are all S3 buckets encrypted at rest?
3. Do all resources have proper tags including Environment, Owner, and CostCenter?
4. Are there any hardcoded secrets or API keys?
5. Is versioning enabled on S3 buckets?
After (178 tokens):
Review Terraform. Check:
- Security groups: 0.0.0.0/0 ingress?
- S3: encryption enabled?
- Tags: Environment, Owner, CostCenter present?
- Secrets: hardcoded credentials?
- S3: versioning enabled?
Savings: 66% fewer tokens, same output quality.
Hack 2: Reference Instead of Paste
For repeated use of the same context:
Bad:
Turn 1: [Paste 10k token document] + "Summarize section 1"
Turn 2: [Paste same 10k token document] + "Summarize section 2"
Turn 3: [Paste same 10k token document] + "Summarize section 3"
Total: 30k tokens
Good:
Turn 1: [Paste 10k token document] + "This is the full document. Summarize section 1"
Turn 2: "Using the document from turn 1, summarize section 2"
Turn 3: "Using the same document, summarize section 3"
Total: 10k tokens (conversation history stays small)
Works in chat UIs where conversation history is maintained.
Hack 3: Use JSON for Structured Data
Before (842 tokens):
Server 1 name is web-server-01, IP address is 10.0.1.45, status is running, last updated on 2026-03-10
Server 2 name is db-server-01, IP address is 10.0.2.33, status is stopped, last updated on 2026-03-09
Server 3 name is api-server-01, IP address is 10.0.1.78, status is running, last updated on 2026-03-11
After (312 tokens):
[
{"name":"web-server-01","ip":"10.0.1.45","status":"running","updated":"2026-03-10"},
{"name":"db-server-01","ip":"10.0.2.33","status":"stopped","updated":"2026-03-09"},
{"name":"api-server-01","ip":"10.0.1.78","status":"running","updated":"2026-03-11"}
]
Savings: 63% fewer tokens
Tools for Context Window Management
Free Tools I Built
- Token Counter - Count tokens across 23 models with cost estimates
- Context Window Visualizer - See how much of each model's context your prompt fills
- AI Pricing Calculator - Estimate monthly costs based on usage patterns
- AI Output Parser - Extract clean data from verbose AI responses (reduces tokens)
- Prompt A/B Comparator - Compare token counts between prompt versions
All free, no signup, run in your browser.
Common Questions
Q: Can I split a request across multiple API calls?
Yes, but you lose context between calls. Use the summarization strategy above.
Q: Do images count toward token limits?
Yes. In GPT-4V and Gemini Pro Vision:
- Low detail image: ~85 tokens
- High detail image: ~170-255 tokens per 512×512 tile
Q: What happens if I exceed the context window?
The API returns an error. Chat UIs either truncate old messages or stop accepting input.
Q: Does a larger context window mean better quality?
No. It means the model can process more text at once. Quality depends on the model architecture, not context size.
Q: Can I increase the context window?
No. Context windows are fixed per model. You can only switch to a model with a larger window.
Checklist: Before Sending Large Context
- Did you check the token count? (use a token counter)
- Is all the context actually necessary? (remove boilerplate, comments, irrelevant files)
- Have you filtered to only relevant sections?
- Did you estimate the cost? (input + expected output)
- Is there a smaller model that would work? (don't use Opus for simple tasks)
- Are you using the most compact format? (JSON instead of prose)
- Did you set max_tokens to prevent runaway costs?
- Do you have a fallback if the context is too large? (chunking strategy)
The Future of Context Windows
Three trends I'm watching:
1. Context Windows Are Growing
- 2023: GPT-4 had 8k tokens (32k extended)
- 2024: GPT-4 Turbo reached 128k tokens
- 2025: Gemini Pro hit 1M tokens
- 2026: Gemini 2.5 Pro now 2M tokens
By 2027, 10M token context windows will be standard.
2. Context Isn't Free
- Larger contexts cost more per token
- Gemini 2M context charges 2x per token vs 128k context
- You're often paying for context you don't use
3. Infinite Context Is Coming
- Models with external memory (RAG built-in)
- Automatic context management
- Pay per "active" tokens, not total context
The constraint is disappearing, but cost optimization will remain critical.
What to Do Next
If you're new to managing context windows:
- Install a token counter - Never send a prompt without knowing the size
- Set cost alerts - Monitor your API usage daily
- Start chunking - Practice breaking large documents into processable pieces
- Use the right model - Don't use Claude Opus for tasks GPT-4o can handle
If you're already managing context:
- Audit your token usage - Find where you're wasting tokens
- Implement summarization - Build pipelines for large documents
- Add cost guards - Prevent runaway scripts with hard limits
- Optimize prompts - Every unnecessary word costs money at scale
Further Reading:
- How to Use AI Coding Assistants for Infrastructure as Code
- Prompt Engineering for Cloud Engineers
- AI Model Comparison Tool
- The Complete AI Pricing Calculator
Questions? Email me at phaqqani@gmail.com or find me on LinkedIn.