March 11, 2026AI + Cloud14 min read

LLMContextWindowsandTokenLimits:TheCompleteGuide(2026)

LLMContext WindowTokensGPT-4ClaudeGeminiAIToken LimitsAPI CostsPrompt EngineeringContext Management

You're halfway through pasting your 50-page architecture document into ChatGPT when it stops responding. Or Claude gives you an error: "Your message exceeds the maximum context length." Or you get charged $47 for a single API call and have no idea why.

Context windows and token limits are the invisible walls of LLM work. They determine what's possible, what's expensive, and what will silently fail.

After burning through $3,200 in "learning experiences" over the past 18 months and running millions of tokens through every major model, here's everything you need to know about context windows, token limits, and how to work within them without losing your mind or your budget.

What Actually Is a Context Window?

A context window is the maximum amount of text an LLM can "see" at one time. This includes:

Your system prompt
Your conversation history
Your current prompt
The model's previous responses
Any documents or code you've pasted
The model's thinking process (if visible)
Everything the model outputs in its response

Think of it like RAM for an AI model. Once you hit the limit, the model can't process anything more.

The Math

Most models measure context in tokens, not words or characters.

1 token ≈ 4 characters (rough average for English) 1 token ≈ 0.75 words (rough average)

Example:

"Hello world" = 2 tokens
"Infrastructure as Code" = 4 tokens
"AWS::EC2::Instance" = 5 tokens (special characters count)

This means:

A 200,000 token context window ≈ 150,000 words
A 1 million token context window ≈ 750,000 words

For reference, "War and Peace" is about 580,000 words. GPT-4.1 can fit the entire novel in its context window with room left over.

Current Context Windows (March 2026)

Here's what every major model can handle:

Model	Context Window	Equivalent	Best For
GPT-5.3 Codex	400k tokens	~300k words	Long coding sessions across many files
GPT-4.1	1M tokens	~750k words	Entire codebases, long documents
GPT-4o	128k tokens	~96k words	Fast multimodal tasks
Claude Opus 4.7	1M tokens	~750k words	Entire codebases, deep research, long-running agents
Claude Sonnet 4.6	1M tokens	~750k words	Cost-effective long context, large refactors
Claude Haiku 4.5	200k tokens	~150k words	Fast short-context tasks
Gemini 2.5 Pro	1M tokens	~750k words	Multimodal, video analysis at scale
Gemini 2.5 Flash	1M tokens	~750k words	Fast large context
DeepSeek R1	64k tokens	~48k words	Reasoning-focused, shorter context
Llama 4 Maverick	1M tokens	~750k words	Open source, very large context

Key Insight: Bigger isn't always better. A 1 million token context costs significantly more than a 128k context. Use the smallest context that fits your task.

The Three Hidden Token Costs

Most people only count their input tokens. That's a mistake.

Cost 1: System Prompt (Always Included)

Every API call includes your system prompt in the context window:

System prompt: 1,200 tokens
Your message: 5,000 tokens
Total input: 6,200 tokens

If you're sending 100 API calls, that's 120,000 tokens of system prompt you're paying for repeatedly.

Solution: Keep system prompts under 500 tokens for high-volume use cases.

Cost 2: Conversation History (Grows Over Time)

In a chat interface, every message stays in context:

Turn 1: System (500) + User (200) + Assistant (300) = 1,000 tokens
Turn 2: Previous (1,000) + User (250) + Assistant (400) = 1,650 tokens
Turn 3: Previous (1,650) + User (180) + Assistant (350) = 2,180 tokens
Turn 10: 8,430 tokens total

By turn 50, you're sending 40,000+ tokens per request even if each individual message is small.

Solution: Summarize old conversation history or start fresh conversations for new topics.

Cost 3: Output Tokens (Often Forgotten)

Models charge differently for input vs output tokens:

Model	Input (per 1M tokens)	Output (per 1M tokens)	Output Multiplier
GPT-4.1	$2.00	$8.00	4x more expensive
Claude Sonnet 4.6	$3.00	$15.00	5x more expensive
Gemini 2.5 Pro	$1.25	$10.00	8x more expensive
GPT-5.3 Codex	$1.75	$14.00	8x more expensive

If you ask for "detailed explanations" or "generate comprehensive documentation," you're paying 4-8x more for the response than the question.

Example:

Your prompt: 500 tokens × $3.00/1M = $0.0015
Model response: 5,000 tokens × $15.00/1M = $0.075
Total cost: $0.0765 (98% from output tokens)

Solution: Ask for concise responses. Use "be brief" or "summarize in 3 bullets" when detailed output isn't needed.

Real Example: The $47 Mistake

Last month I helped a developer debug a $47 API call. Here's what happened:

What they did:

prompt = f"Review this entire codebase for security issues:\n\n{codebase_files}"
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": prompt}]
)

The breakdown:

Codebase: 127 files, 48,000 lines of code
Input tokens: 156,000 tokens
Model output: 12,000 tokens (detailed report)
Cost: $0.39 (input at $2.50/1M) + $0.12 (output at $10/1M) = $0.51... wait, that's not $47?

The real problem: They ran this in a loop across 92 different repositories without realizing each call was independent.

92 repos × $0.51/repo = $46.92

Plus, most of the code was irrelevant boilerplate. They only needed to scan security-sensitive files (authentication, authorization, database queries).

The fix:

# Filter to security-relevant files only
security_files = [f for f in files if any(pattern in f for pattern in
    ['auth', 'password', 'token', 'secret', 'database', 'sql', 'api_key'])]

# Reduced to 8,200 tokens average
# Cost per repo: $0.09
# Total for 92 repos: $8.28

Savings: $38.64 (82% reduction)

How to Check Token Count Before Sending

Never send a prompt without knowing the token count first.

Method 1: Use a Token Counter Tool

I built a free one specifically for this: Token Counter

Paste your text
Select your model
See token count + cost estimate instantly
Supports 25 models

No installation, no API key, runs in your browser.

Method 2: Use tiktoken (Python)

import tiktoken

# For OpenAI models
encoding = tiktoken.encoding_for_model("gpt-4")
tokens = encoding.encode("Your text here")
print(f"Token count: {len(tokens)}")

# For Claude (approximate)
encoding = tiktoken.get_encoding("cl100k_base")
tokens = encoding.encode("Your text here")
print(f"Approximate tokens: {len(tokens)}")

Method 3: Use the API

Most APIs return token usage in the response:

response = client.chat.completions.create(...)
print(f"Input tokens: {response.usage.prompt_tokens}")
print(f"Output tokens: {response.usage.completion_tokens}")
print(f"Total: {response.usage.total_tokens}")

Strategies for Working Within Token Limits

Strategy 1: Chunk and Summarize

For documents larger than the context window:

def process_large_document(doc, chunk_size=50000):
    chunks = split_into_chunks(doc, chunk_size)
    summaries = []

    for chunk in chunks:
        summary = llm.summarize(chunk, max_length=500)
        summaries.append(summary)

    # Now analyze the summaries (much smaller)
    final_analysis = llm.analyze("\n\n".join(summaries))
    return final_analysis

Real use case: I used this to analyze a 400-page compliance document. Cost: $2.30 instead of $18.40.

Strategy 2: Extract-Then-Process

Don't send entire files when you only need specific parts:

# Bad: Send entire 200KB Terraform file
response = llm.review(entire_terraform_file)

# Good: Extract resources first
resources = extract_resources(terraform_file, types=['aws_s3_bucket', 'aws_iam_role'])
response = llm.review(resources)  # 10% of original size

Strategy 3: Use Smaller Models for Filtering

Use a cheap, fast model to filter, then a smart expensive model to analyze:

# Step 1: Filter with GPT-4o Mini ($0.15/1M tokens)
relevant_logs = gpt4o_mini.filter(all_logs, criteria="errors or warnings only")

# Step 2: Analyze with Claude Opus ($15/1M tokens)
root_cause = claude_opus.analyze(relevant_logs)

Cost comparison:

Send all logs to Claude: $8.20
Filter with GPT-4o Mini → analyze with Claude: $0.42 + $1.10 = $1.52

Savings: $6.68 (81% reduction)

Strategy 4: Sliding Window

For sequential analysis (logs, code reviews, debugging):

def sliding_window_analysis(items, window_size=20, overlap=5):
    results = []
    for i in range(0, len(items), window_size - overlap):
        window = items[i:i+window_size]
        analysis = llm.analyze(window)
        results.append(analysis)
    return combine_results(results)

Keeps context manageable while maintaining continuity through overlap.

Strategy 5: Semantic Search First

Use embeddings to find relevant chunks before sending to LLM:

# Step 1: Find most relevant chunks (cheap)
relevant_chunks = semantic_search(query, document_chunks, top_k=5)

# Step 2: Only send relevant chunks to LLM (expensive)
answer = llm.answer(query, context=relevant_chunks)

This is the RAG (Retrieval Augmented Generation) pattern. Massively reduces token usage.

Context Window Errors and How to Fix Them

Error 1: "Maximum context length exceeded"

What it means: Your input + expected output > context window

How to fix:

Check token count with a counter tool
Reduce input size (chunk, filter, extract)
Switch to a model with larger context window
Use a summarization pipeline

Error 2: Truncated Output

What it means: Model hit the output token limit mid-response

How to fix:

# Set max_tokens explicitly
response = client.chat.completions.create(
    model="gpt-4.1",
    max_tokens=4000, # Reserve space for full output
    messages=[...]
)

Or ask for shorter responses:

"Summarize in 5 bullet points" instead of "Explain in detail"

Error 3: "Context too long" in Chat UI

What it means: Conversation history has grown too large

How to fix:

Start a new conversation
Summarize previous conversation and paste summary into new chat
Use the UI's "clear conversation" feature

Error 4: Inconsistent Responses on Long Context

What it means: Model is losing coherence due to "lost in the middle" problem

How to fix:

Put most important information at the beginning and end (models pay more attention there)
Use explicit section markers: "IMPORTANT:", "KEY REQUIREMENT:"
Break into smaller, focused prompts

Cost Optimization: Token Budgets

Set hard limits on token usage per request:

def safe_llm_call(prompt, max_cost=0.50):
    token_count = estimate_tokens(prompt)
    estimated_cost = calculate_cost(token_count, model="gpt-4.1")

    if estimated_cost > max_cost:
        raise ValueError(f"Estimated cost ${estimated_cost:.2f} exceeds budget ${max_cost}")

    return llm.call(prompt)

Real scenario: We set a $0.25 per-request limit on our CI/CD pipeline code review bot. Prevented a runaway script from racking up $840 in one night.

Model Selection by Context Need

Choose the right model for your context window requirement:

Small Context (<10k tokens)

Short Q&A
Simple code generation
Quick summaries

Best model: Claude Haiku 4.5 ($0.80/1M input, $4.00/1M output)

Medium Context (10k-50k tokens)

Multi-file code review
API documentation analysis
Medium-length documents

Best model: GPT-4o ($2.50/1M input, $10.00/1M output)

Large Context (50k-200k tokens)

Full documentation sets
Large codebases
Book-length content

Best model: Claude Sonnet 4.6 ($3.00/1M input, $15.00/1M output)

Extra Large Context (200k+ tokens)

Entire repositories
Multi-book analysis
Video transcripts

Best model: Gemini 2.5 Pro ($1.25/1M input, $10.00/1M output) - cheapest per input token for huge context

Advanced: Context Window Hacks

Hack 1: Compress Your Prompts

AI models are trained on natural language, but they understand structured formats better:

Before (524 tokens):

Please review this Terraform configuration and check for the following issues:
1. Are there any security groups with overly permissive rules?
2. Are all S3 buckets encrypted at rest?
3. Do all resources have proper tags including Environment, Owner, and CostCenter?
4. Are there any hardcoded secrets or API keys?
5. Is versioning enabled on S3 buckets?

After (178 tokens):

Review Terraform. Check:
- Security groups: 0.0.0.0/0 ingress?
- S3: encryption enabled?
- Tags: Environment, Owner, CostCenter present?
- Secrets: hardcoded credentials?
- S3: versioning enabled?

Savings: 66% fewer tokens, same output quality.

Hack 2: Reference Instead of Paste

For repeated use of the same context:

Bad:

Turn 1: [Paste 10k token document] + "Summarize section 1"
Turn 2: [Paste same 10k token document] + "Summarize section 2"
Turn 3: [Paste same 10k token document] + "Summarize section 3"
Total: 30k tokens

Good:

Turn 1: [Paste 10k token document] + "This is the full document. Summarize section 1"
Turn 2: "Using the document from turn 1, summarize section 2"
Turn 3: "Using the same document, summarize section 3"
Total: 10k tokens (conversation history stays small)

Works in chat UIs where conversation history is maintained.

Hack 3: Use JSON for Structured Data

Before (842 tokens):

Server 1 name is web-server-01, IP address is 10.0.1.45, status is running, last updated on 2026-03-10
Server 2 name is db-server-01, IP address is 10.0.2.33, status is stopped, last updated on 2026-03-09
Server 3 name is api-server-01, IP address is 10.0.1.78, status is running, last updated on 2026-03-11

After (312 tokens):

[
  {"name":"web-server-01","ip":"10.0.1.45","status":"running","updated":"2026-03-10"},
  {"name":"db-server-01","ip":"10.0.2.33","status":"stopped","updated":"2026-03-09"},
  {"name":"api-server-01","ip":"10.0.1.78","status":"running","updated":"2026-03-11"}
]

Savings: 63% fewer tokens

Tools for Context Window Management

Free Tools I Built

Token Counter - Count tokens across 25 models with cost estimates
Context Window Visualizer - See how much of each model's context your prompt fills
AI Pricing Calculator - Estimate monthly costs based on usage patterns
AI Output Parser - Extract clean data from verbose AI responses (reduces tokens)
Prompt A/B Comparator - Compare token counts between prompt versions

All free, no signup, run in your browser.

See all 86 tools here

Common Questions

Q: Can I split a request across multiple API calls?

Yes, but you lose context between calls. Use the summarization strategy above.

Q: Do images count toward token limits?

Yes. In GPT-4V and Gemini Pro Vision:

Low detail image: ~85 tokens
High detail image: ~170-255 tokens per 512×512 tile

Q: What happens if I exceed the context window?

The API returns an error. Chat UIs either truncate old messages or stop accepting input.

Q: Does a larger context window mean better quality?

No. It means the model can process more text at once. Quality depends on the model architecture, not context size.

Q: Can I increase the context window?

No. Context windows are fixed per model. You can only switch to a model with a larger window.

Checklist: Before Sending Large Context

Did you check the token count? (use a token counter)
Is all the context actually necessary? (remove boilerplate, comments, irrelevant files)
Have you filtered to only relevant sections?
Did you estimate the cost? (input + expected output)
Is there a smaller model that would work? (don't use Opus for simple tasks)
Are you using the most compact format? (JSON instead of prose)
Did you set max_tokens to prevent runaway costs?
Do you have a fallback if the context is too large? (chunking strategy)

The Future of Context Windows

Three trends I'm watching:

1. Context Windows Are Growing

2023: GPT-4 had 8k tokens (32k extended)
2024: GPT-4 Turbo reached 128k tokens; Gemini 1.5 Pro shipped 1M tokens to early users in February and went GA the same year
2025: GPT-4.1 and Llama 4 Maverick shipped 1M-token context
2026: Claude Opus 4.7 and Sonnet 4.6 ship with 1M context as standard

By 2027, 10M token context windows will be standard.

2. Context Isn't Free

Larger contexts cost more per token
Gemini 2.5 Pro charges 2x per token for the >200k extended context tier vs the standard tier
You're often paying for context you don't use

3. Infinite Context Is Coming

Models with external memory (RAG built-in)
Automatic context management
Pay per "active" tokens, not total context

The constraint is disappearing, but cost optimization will remain critical.

What to Do Next

If you're new to managing context windows:

Install a token counter - Never send a prompt without knowing the size
Set cost alerts - Monitor your API usage daily
Start chunking - Practice breaking large documents into processable pieces
Use the right model - Don't use Claude Opus for tasks GPT-4o can handle

If you're already managing context:

Audit your token usage - Find where you're wasting tokens
Implement summarization - Build pipelines for large documents
Add cost guards - Prevent runaway scripts with hard limits
Optimize prompts - Every unnecessary word costs money at scale

Further Reading:

Questions? Email me at phaqqani@gmail.com or find me on LinkedIn.

What Leaked AI System Prompts Reveal About How Claude, ChatGPT, and Gemini Actually Think

March 30, 2026AI + Cloud7 min read

Why the Same Prompt Costs 30% Less on Claude Than on GPT-5: A Tokenizer Story

May 9, 2026AI + Cloud4 min read

Fine-Tuning vs RAG: The Decision Framework (When to Use Each)

March 11, 2026AI + Cloud13 min read

Stay ahead of the curve

Get new posts on AI, cloud engineering, and the future of tech delivered to your inbox.

All Posts

Back to Blog

March 11, 2026AI + Cloud14 min read

LLMContextWindowsandTokenLimits:TheCompleteGuide(2026)

LLMContext WindowTokensGPT-4ClaudeGeminiAIToken LimitsAPI CostsPrompt EngineeringContext Management

Context windows and token limits are the invisible walls of LLM work. They determine what's possible, what's expensive, and what will silently fail.

What Actually Is a Context Window?

A context window is the maximum amount of text an LLM can "see" at one time. This includes:

Your system prompt
Your conversation history
Your current prompt
The model's previous responses
Any documents or code you've pasted
The model's thinking process (if visible)
Everything the model outputs in its response

Think of it like RAM for an AI model. Once you hit the limit, the model can't process anything more.

The Math

Most models measure context in tokens, not words or characters.

1 token ≈ 4 characters (rough average for English) 1 token ≈ 0.75 words (rough average)

Example:

"Hello world" = 2 tokens
"Infrastructure as Code" = 4 tokens
"AWS::EC2::Instance" = 5 tokens (special characters count)

This means:

A 200,000 token context window ≈ 150,000 words
A 1 million token context window ≈ 750,000 words

For reference, "War and Peace" is about 580,000 words. GPT-4.1 can fit the entire novel in its context window with room left over.

Current Context Windows (March 2026)

Here's what every major model can handle:

Model	Context Window	Equivalent	Best For
GPT-5.3 Codex	400k tokens	~300k words	Long coding sessions across many files
GPT-4.1	1M tokens	~750k words	Entire codebases, long documents
GPT-4o	128k tokens	~96k words	Fast multimodal tasks
Claude Opus 4.7	1M tokens	~750k words	Entire codebases, deep research, long-running agents
Claude Sonnet 4.6	1M tokens	~750k words	Cost-effective long context, large refactors
Claude Haiku 4.5	200k tokens	~150k words	Fast short-context tasks
Gemini 2.5 Pro	1M tokens	~750k words	Multimodal, video analysis at scale
Gemini 2.5 Flash	1M tokens	~750k words	Fast large context
DeepSeek R1	64k tokens	~48k words	Reasoning-focused, shorter context
Llama 4 Maverick	1M tokens	~750k words	Open source, very large context

Key Insight: Bigger isn't always better. A 1 million token context costs significantly more than a 128k context. Use the smallest context that fits your task.

The Three Hidden Token Costs

Most people only count their input tokens. That's a mistake.

Cost 1: System Prompt (Always Included)

Every API call includes your system prompt in the context window:

System prompt: 1,200 tokens
Your message: 5,000 tokens
Total input: 6,200 tokens

If you're sending 100 API calls, that's 120,000 tokens of system prompt you're paying for repeatedly.

Solution: Keep system prompts under 500 tokens for high-volume use cases.

Cost 2: Conversation History (Grows Over Time)

In a chat interface, every message stays in context:

Turn 1: System (500) + User (200) + Assistant (300) = 1,000 tokens
Turn 2: Previous (1,000) + User (250) + Assistant (400) = 1,650 tokens
Turn 3: Previous (1,650) + User (180) + Assistant (350) = 2,180 tokens
Turn 10: 8,430 tokens total

By turn 50, you're sending 40,000+ tokens per request even if each individual message is small.

Solution: Summarize old conversation history or start fresh conversations for new topics.

Cost 3: Output Tokens (Often Forgotten)

Models charge differently for input vs output tokens:

Model	Input (per 1M tokens)	Output (per 1M tokens)	Output Multiplier
GPT-4.1	$2.00	$8.00	4x more expensive
Claude Sonnet 4.6	$3.00	$15.00	5x more expensive
Gemini 2.5 Pro	$1.25	$10.00	8x more expensive
GPT-5.3 Codex	$1.75	$14.00	8x more expensive

If you ask for "detailed explanations" or "generate comprehensive documentation," you're paying 4-8x more for the response than the question.

Example:

Your prompt: 500 tokens × $3.00/1M = $0.0015
Model response: 5,000 tokens × $15.00/1M = $0.075
Total cost: $0.0765 (98% from output tokens)

Solution: Ask for concise responses. Use "be brief" or "summarize in 3 bullets" when detailed output isn't needed.

Real Example: The $47 Mistake

Last month I helped a developer debug a $47 API call. Here's what happened:

What they did:

prompt = f"Review this entire codebase for security issues:\n\n{codebase_files}"
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": prompt}]
)

The breakdown:

Codebase: 127 files, 48,000 lines of code
Input tokens: 156,000 tokens
Model output: 12,000 tokens (detailed report)
Cost: $0.39 (input at $2.50/1M) + $0.12 (output at $10/1M) = $0.51... wait, that's not $47?

The real problem: They ran this in a loop across 92 different repositories without realizing each call was independent.

92 repos × $0.51/repo = $46.92

Plus, most of the code was irrelevant boilerplate. They only needed to scan security-sensitive files (authentication, authorization, database queries).

The fix:

# Filter to security-relevant files only
security_files = [f for f in files if any(pattern in f for pattern in
    ['auth', 'password', 'token', 'secret', 'database', 'sql', 'api_key'])]

# Reduced to 8,200 tokens average
# Cost per repo: $0.09
# Total for 92 repos: $8.28

Savings: $38.64 (82% reduction)

How to Check Token Count Before Sending

Never send a prompt without knowing the token count first.

Method 1: Use a Token Counter Tool

I built a free one specifically for this: Token Counter

Paste your text
Select your model
See token count + cost estimate instantly
Supports 25 models

No installation, no API key, runs in your browser.

Method 2: Use tiktoken (Python)

import tiktoken

# For OpenAI models
encoding = tiktoken.encoding_for_model("gpt-4")
tokens = encoding.encode("Your text here")
print(f"Token count: {len(tokens)}")

# For Claude (approximate)
encoding = tiktoken.get_encoding("cl100k_base")
tokens = encoding.encode("Your text here")
print(f"Approximate tokens: {len(tokens)}")

Method 3: Use the API

Most APIs return token usage in the response:

response = client.chat.completions.create(...)
print(f"Input tokens: {response.usage.prompt_tokens}")
print(f"Output tokens: {response.usage.completion_tokens}")
print(f"Total: {response.usage.total_tokens}")

Strategies for Working Within Token Limits

Strategy 1: Chunk and Summarize

For documents larger than the context window:

def process_large_document(doc, chunk_size=50000):
    chunks = split_into_chunks(doc, chunk_size)
    summaries = []

    for chunk in chunks:
        summary = llm.summarize(chunk, max_length=500)
        summaries.append(summary)

    # Now analyze the summaries (much smaller)
    final_analysis = llm.analyze("\n\n".join(summaries))
    return final_analysis

Real use case: I used this to analyze a 400-page compliance document. Cost: $2.30 instead of $18.40.

Strategy 2: Extract-Then-Process

Don't send entire files when you only need specific parts:

# Bad: Send entire 200KB Terraform file
response = llm.review(entire_terraform_file)

# Good: Extract resources first
resources = extract_resources(terraform_file, types=['aws_s3_bucket', 'aws_iam_role'])
response = llm.review(resources)  # 10% of original size

Strategy 3: Use Smaller Models for Filtering

Use a cheap, fast model to filter, then a smart expensive model to analyze:

# Step 1: Filter with GPT-4o Mini ($0.15/1M tokens)
relevant_logs = gpt4o_mini.filter(all_logs, criteria="errors or warnings only")

# Step 2: Analyze with Claude Opus ($15/1M tokens)
root_cause = claude_opus.analyze(relevant_logs)

Cost comparison:

Send all logs to Claude: $8.20
Filter with GPT-4o Mini → analyze with Claude: $0.42 + $1.10 = $1.52

Savings: $6.68 (81% reduction)

Strategy 4: Sliding Window

For sequential analysis (logs, code reviews, debugging):

def sliding_window_analysis(items, window_size=20, overlap=5):
    results = []
    for i in range(0, len(items), window_size - overlap):
        window = items[i:i+window_size]
        analysis = llm.analyze(window)
        results.append(analysis)
    return combine_results(results)

Keeps context manageable while maintaining continuity through overlap.

Strategy 5: Semantic Search First

Use embeddings to find relevant chunks before sending to LLM:

# Step 1: Find most relevant chunks (cheap)
relevant_chunks = semantic_search(query, document_chunks, top_k=5)

# Step 2: Only send relevant chunks to LLM (expensive)
answer = llm.answer(query, context=relevant_chunks)

This is the RAG (Retrieval Augmented Generation) pattern. Massively reduces token usage.

Context Window Errors and How to Fix Them

Error 1: "Maximum context length exceeded"

What it means: Your input + expected output > context window

How to fix:

Check token count with a counter tool
Reduce input size (chunk, filter, extract)
Switch to a model with larger context window
Use a summarization pipeline

Error 2: Truncated Output

What it means: Model hit the output token limit mid-response

How to fix:

# Set max_tokens explicitly
response = client.chat.completions.create(
    model="gpt-4.1",
    max_tokens=4000, # Reserve space for full output
    messages=[...]
)

Or ask for shorter responses:

"Summarize in 5 bullet points" instead of "Explain in detail"

Error 3: "Context too long" in Chat UI

What it means: Conversation history has grown too large

How to fix:

Start a new conversation
Summarize previous conversation and paste summary into new chat
Use the UI's "clear conversation" feature

Error 4: Inconsistent Responses on Long Context

What it means: Model is losing coherence due to "lost in the middle" problem

How to fix:

Put most important information at the beginning and end (models pay more attention there)
Use explicit section markers: "IMPORTANT:", "KEY REQUIREMENT:"
Break into smaller, focused prompts

Cost Optimization: Token Budgets

Set hard limits on token usage per request:

def safe_llm_call(prompt, max_cost=0.50):
    token_count = estimate_tokens(prompt)
    estimated_cost = calculate_cost(token_count, model="gpt-4.1")

    if estimated_cost > max_cost:
        raise ValueError(f"Estimated cost ${estimated_cost:.2f} exceeds budget ${max_cost}")

    return llm.call(prompt)

Real scenario: We set a $0.25 per-request limit on our CI/CD pipeline code review bot. Prevented a runaway script from racking up $840 in one night.

Model Selection by Context Need

Choose the right model for your context window requirement:

Small Context (<10k tokens)

Short Q&A
Simple code generation
Quick summaries

Best model: Claude Haiku 4.5 ($0.80/1M input, $4.00/1M output)

Medium Context (10k-50k tokens)

Multi-file code review
API documentation analysis
Medium-length documents

Best model: GPT-4o ($2.50/1M input, $10.00/1M output)

Large Context (50k-200k tokens)

Full documentation sets
Large codebases
Book-length content

Best model: Claude Sonnet 4.6 ($3.00/1M input, $15.00/1M output)

Extra Large Context (200k+ tokens)

Entire repositories
Multi-book analysis
Video transcripts

Best model: Gemini 2.5 Pro ($1.25/1M input, $10.00/1M output) - cheapest per input token for huge context

Advanced: Context Window Hacks

Hack 1: Compress Your Prompts

AI models are trained on natural language, but they understand structured formats better:

Before (524 tokens):

Please review this Terraform configuration and check for the following issues:
1. Are there any security groups with overly permissive rules?
2. Are all S3 buckets encrypted at rest?
3. Do all resources have proper tags including Environment, Owner, and CostCenter?
4. Are there any hardcoded secrets or API keys?
5. Is versioning enabled on S3 buckets?

After (178 tokens):

Review Terraform. Check:
- Security groups: 0.0.0.0/0 ingress?
- S3: encryption enabled?
- Tags: Environment, Owner, CostCenter present?
- Secrets: hardcoded credentials?
- S3: versioning enabled?

Savings: 66% fewer tokens, same output quality.

Hack 2: Reference Instead of Paste

For repeated use of the same context:

Bad:

Turn 1: [Paste 10k token document] + "Summarize section 1"
Turn 2: [Paste same 10k token document] + "Summarize section 2"
Turn 3: [Paste same 10k token document] + "Summarize section 3"
Total: 30k tokens

Good:

Turn 1: [Paste 10k token document] + "This is the full document. Summarize section 1"
Turn 2: "Using the document from turn 1, summarize section 2"
Turn 3: "Using the same document, summarize section 3"
Total: 10k tokens (conversation history stays small)

Works in chat UIs where conversation history is maintained.

Hack 3: Use JSON for Structured Data

Before (842 tokens):

Server 1 name is web-server-01, IP address is 10.0.1.45, status is running, last updated on 2026-03-10
Server 2 name is db-server-01, IP address is 10.0.2.33, status is stopped, last updated on 2026-03-09
Server 3 name is api-server-01, IP address is 10.0.1.78, status is running, last updated on 2026-03-11

After (312 tokens):

[
  {"name":"web-server-01","ip":"10.0.1.45","status":"running","updated":"2026-03-10"},
  {"name":"db-server-01","ip":"10.0.2.33","status":"stopped","updated":"2026-03-09"},
  {"name":"api-server-01","ip":"10.0.1.78","status":"running","updated":"2026-03-11"}
]

Savings: 63% fewer tokens

Tools for Context Window Management

Free Tools I Built

Token Counter - Count tokens across 25 models with cost estimates
Context Window Visualizer - See how much of each model's context your prompt fills
AI Pricing Calculator - Estimate monthly costs based on usage patterns
AI Output Parser - Extract clean data from verbose AI responses (reduces tokens)
Prompt A/B Comparator - Compare token counts between prompt versions

All free, no signup, run in your browser.

See all 86 tools here

Common Questions

Q: Can I split a request across multiple API calls?

Yes, but you lose context between calls. Use the summarization strategy above.

Q: Do images count toward token limits?

Yes. In GPT-4V and Gemini Pro Vision:

Low detail image: ~85 tokens
High detail image: ~170-255 tokens per 512×512 tile

Q: What happens if I exceed the context window?

The API returns an error. Chat UIs either truncate old messages or stop accepting input.

Q: Does a larger context window mean better quality?

No. It means the model can process more text at once. Quality depends on the model architecture, not context size.

Q: Can I increase the context window?

No. Context windows are fixed per model. You can only switch to a model with a larger window.

Checklist: Before Sending Large Context

Did you check the token count? (use a token counter)
Is all the context actually necessary? (remove boilerplate, comments, irrelevant files)
Have you filtered to only relevant sections?
Did you estimate the cost? (input + expected output)
Is there a smaller model that would work? (don't use Opus for simple tasks)
Are you using the most compact format? (JSON instead of prose)
Did you set max_tokens to prevent runaway costs?
Do you have a fallback if the context is too large? (chunking strategy)

The Future of Context Windows

Three trends I'm watching:

1. Context Windows Are Growing

2023: GPT-4 had 8k tokens (32k extended)
2024: GPT-4 Turbo reached 128k tokens; Gemini 1.5 Pro shipped 1M tokens to early users in February and went GA the same year
2025: GPT-4.1 and Llama 4 Maverick shipped 1M-token context
2026: Claude Opus 4.7 and Sonnet 4.6 ship with 1M context as standard

By 2027, 10M token context windows will be standard.

2. Context Isn't Free

Larger contexts cost more per token
Gemini 2.5 Pro charges 2x per token for the >200k extended context tier vs the standard tier
You're often paying for context you don't use

3. Infinite Context Is Coming

Models with external memory (RAG built-in)
Automatic context management
Pay per "active" tokens, not total context

The constraint is disappearing, but cost optimization will remain critical.

What to Do Next

If you're new to managing context windows:

Install a token counter - Never send a prompt without knowing the size
Set cost alerts - Monitor your API usage daily
Start chunking - Practice breaking large documents into processable pieces
Use the right model - Don't use Claude Opus for tasks GPT-4o can handle

If you're already managing context:

Audit your token usage - Find where you're wasting tokens
Implement summarization - Build pipelines for large documents
Add cost guards - Prevent runaway scripts with hard limits
Optimize prompts - Every unnecessary word costs money at scale

Further Reading:

Questions? Email me at phaqqani@gmail.com or find me on LinkedIn.

What Leaked AI System Prompts Reveal About How Claude, ChatGPT, and Gemini Actually Think

March 30, 2026AI + Cloud7 min read

Why the Same Prompt Costs 30% Less on Claude Than on GPT-5: A Tokenizer Story

May 9, 2026AI + Cloud4 min read

Fine-Tuning vs RAG: The Decision Framework (When to Use Each)

March 11, 2026AI + Cloud13 min read

Stay ahead of the curve

Get new posts on AI, cloud engineering, and the future of tech delivered to your inbox.

All Posts

LLMContextWindowsandTokenLimits:TheCompleteGuide(2026)

What Actually Is a Context Window?

The Math

Current Context Windows (March 2026)

The Three Hidden Token Costs

Cost 1: System Prompt (Always Included)

Cost 2: Conversation History (Grows Over Time)

Cost 3: Output Tokens (Often Forgotten)

Real Example: The $47 Mistake

How to Check Token Count Before Sending

Method 1: Use a Token Counter Tool

Method 2: Use tiktoken (Python)

Method 3: Use the API

Strategies for Working Within Token Limits

Strategy 1: Chunk and Summarize

Strategy 2: Extract-Then-Process

Strategy 3: Use Smaller Models for Filtering

Strategy 4: Sliding Window

Strategy 5: Semantic Search First

Context Window Errors and How to Fix Them

Error 1: "Maximum context length exceeded"

Error 2: Truncated Output

Error 3: "Context too long" in Chat UI

Error 4: Inconsistent Responses on Long Context

Cost Optimization: Token Budgets

Model Selection by Context Need

Small Context (<10k tokens)

Medium Context (10k-50k tokens)

Large Context (50k-200k tokens)

Extra Large Context (200k+ tokens)

Advanced: Context Window Hacks

Hack 1: Compress Your Prompts

Hack 2: Reference Instead of Paste

Hack 3: Use JSON for Structured Data

Tools for Context Window Management

Free Tools I Built

Common Questions

Q: Can I split a request across multiple API calls?

Q: Do images count toward token limits?

Q: What happens if I exceed the context window?

Q: Does a larger context window mean better quality?

Q: Can I increase the context window?

Checklist: Before Sending Large Context

The Future of Context Windows

What to Do Next

Related Posts

What Leaked AI System Prompts Reveal About How Claude, ChatGPT, and Gemini Actually Think

Why the Same Prompt Costs 30% Less on Claude Than on GPT-5: A Tokenizer Story

Fine-Tuning vs RAG: The Decision Framework (When to Use Each)

Stay ahead of the curve

LLMContextWindowsandTokenLimits:TheCompleteGuide(2026)

What Actually Is a Context Window?

The Math

Current Context Windows (March 2026)

The Three Hidden Token Costs

Cost 1: System Prompt (Always Included)

Cost 2: Conversation History (Grows Over Time)

Cost 3: Output Tokens (Often Forgotten)

Real Example: The $47 Mistake

How to Check Token Count Before Sending

Method 1: Use a Token Counter Tool

Method 2: Use tiktoken (Python)

Method 3: Use the API

Strategies for Working Within Token Limits

Strategy 1: Chunk and Summarize

Strategy 2: Extract-Then-Process

Strategy 3: Use Smaller Models for Filtering

Strategy 4: Sliding Window

Strategy 5: Semantic Search First

Context Window Errors and How to Fix Them

Error 1: "Maximum context length exceeded"

Error 2: Truncated Output

Error 3: "Context too long" in Chat UI

Error 4: Inconsistent Responses on Long Context

Cost Optimization: Token Budgets

Model Selection by Context Need

Small Context (<10k tokens)

Medium Context (10k-50k tokens)

Large Context (50k-200k tokens)

Extra Large Context (200k+ tokens)