Which is better in 2026: GPT-5 or Claude Opus 4.7?

Neither dominates. Across 30 production tasks, Claude Opus 4.7 won 16, GPT-5 won 9, and 5 were ties on first-try correctness. Claude wins on tasks above 12 files (1M context window pays off) and anything requiring a plan. GPT-5 wins on small, well-bounded tasks where speed compounds.

Which model is cheaper to run?

Per task, GPT-5 is 18 percent cheaper at the published rate. But Claude's 22-of-30 first-try correctness vs GPT-5's 16-of-30 means roughly 38 percent fewer wasted retries. For large tasks the effective cost of Claude is lower despite the higher headline rate.

Should I migrate from GPT-5 to Claude in 2026?

If your workload is large refactors, infrastructure, or cross-cutting features, yes. If your workload is small PRs and quick edits, no. Most teams should route by task size rather than pick one model exclusively. A simple rule in CLAUDE.md or AGENTS.md cuts spend 25 to 35 percent.

How was the benchmark scored?

Four axes: correctness (does the diff merge and pass tests on first try), plan quality (does the plan name every file, test, and risk), cost (raw token spend at published pricing), and time-to-first-useful-diff (stopwatch from prompt submit). Same 30 prompts to both models in the same week on the same machine.

Back to Blog

May 7, 2026AI + Cloud6 min read

GPT-5CodevsClaudeOpus4.7:The30-TaskProductionBenchmarkNobodyWantedtoPublish

Claude CodeGPT-5OpenAIAnthropicBenchmarkComparisonCodexAI Coding

I ran the same 30 production-shaped tasks through GPT-5 (via Codex CLI) and Claude Opus 4.7 (via Claude Code, 1M context, native plan mode). Same prompts. Same repos. Same machine. Same network. Same week.

Then I scored every output on four axes: correctness, plan quality, cost, and time-to-first-useful-diff. No vibes. A spreadsheet.

Here is the result. The headline is not what the loudest accounts on X want you to think.

TL;DR

Correctness across 30 tasks: Claude Opus 4.7 won 16, GPT-5 won 9, 5 ties.
Plan quality (when in plan mode): Claude won 21, GPT-5 won 4, 5 ties.
Cost per task (median): GPT-5 was 18 percent cheaper.
Time to first useful diff (median): GPT-5 was 22 percent faster on tasks under 5 files. Claude was 31 percent faster on tasks above 12 files.
The verdict: They are not the same tool anymore. They are not even competing for the same job.

The math for your team.

The setup

Thirty tasks, drawn from three real repos I had open at the same time. None were synthetic. All were tasks I would have done myself this week.

10 small. Single-file bug fixes, regex authorship, one-shot data transforms, doc edits.
10 medium. Cross-file refactors, new endpoint plus tests, schema migration plus backfill script.
10 large. Net-new module of 400+ lines, redesign of a flow across 8+ files, full E2E test suite for a new feature.

Same exact prompts to both. I disabled all tool access except read, edit, write, and bash. No MCP. No skill libraries. No external search. Both models had a CLAUDE.md or AGENTS.md sized identically.

Scoring:

Correctness: does the diff merge clean, do the tests pass, does the feature work end-to-end on the first try? Yes or no.
Plan quality: does the plan name every file, every test, every risk? Three-tier rubric.
Cost: raw token spend at published pricing on the day of the run.
Time to first useful diff: stopwatch from prompt submit to a diff I would not throw away.

The scoreboard

Small tasks (10)

Result	GPT-5	Claude 4.7
Correct on first try	9	9
Plan quality (high)	4	7
Median cost per task	$0.11	$0.13
Median time to first diff	14s	19s

Read: dead heat on correctness. GPT-5 is faster and slightly cheaper. Claude plans more carefully than the task probably needed.

Medium tasks (10)

Result	GPT-5	Claude 4.7
Correct on first try	5	7
Plan quality (high)	3	8
Median cost per task	$0.84	$1.02
Median time to first diff	41s	47s

Read: Claude pulls ahead. The plan-mode advantage starts showing. Two of GPT-5's misses were "looks right, fails the test." That kind of miss is expensive.

Large tasks (10)

Result	GPT-5	Claude 4.7
Correct on first try	2	6
Plan quality (high)	1	9
Median cost per task	$4.10	$4.80
Median time to first diff	2m 18s	1m 35s

Read: the 1M context window earns its keep here. Claude can hold the repo in its head. GPT-5 starts asking for files it has already been given. On the four tasks where GPT-5 failed and Claude succeeded, the failure mode was the same every time: confident edits in the wrong file.

Totals

Axis	GPT-5	Claude 4.7
Correct first try (out of 30)	16	22
High plan quality (out of 30)	8	24
Total cost across 30 tasks	$50.40	$58.90
Total time across 30 tasks	28m 11s	27m 30s

Where each one actually wins

GPT-5 wins

Small, well-bounded tasks. Faster, cheaper, equal correctness. If your team's day is mostly small PRs, this is your default.
Speed-of-thought edits. When the loop is "write, run, eyeball, repeat," GPT-5's lower latency compounds.
Languages with sparse training data. GPT-5 had cleaner output on two Elixir tasks. I do not have a clean read on why.

Claude Opus 4.7 wins

Large changes across many files. The 1M context window is not marketing. It is the reason the large-task correctness gap is 6 to 2.
Anything requiring a plan before code. Plan mode is structurally better. GPT-5 plans when asked. Claude plans because it cannot stop itself, and on hard tasks that is a feature.
Risk-sensitive work. When the task is "do not break X while you change Y," Claude's tendency to over-spec the test bench is the asset, not the bug.

Both stink at

Anything with under-specified acceptance criteria. Both will write code. Neither will tell you the spec is the problem.
Cross-language refactors. Both wandered. Both shipped diffs that compiled in one language and broke the other.
UI judgment. Both will produce a working component. Neither will tell you it looks bad.

The cost picture

If you are reading this thinking "what does this mean for my budget," the honest version is this:

For a team whose daily work skews to small PRs (mobile shops, marketing pages, internal tools), GPT-5 is the cheaper default. Claude on standby for the hard stuff.
For a team whose daily work skews to large refactors, infrastructure, or cross-cutting features (platform teams, backend at scale, data engineering), Claude is the cheaper default. Because first-try correctness compounds. A 22-out-of-30 first-try rate is roughly 38 percent fewer wasted runs than 16-out-of-30.
For a team doing both, route by task size. It is not hard. A claude.md or agents.md line that says "use Sonnet or Haiku for under 5 files, escalate to Opus for over 5 files" cuts the bill 25 to 35 percent without changing the workflow.

The thing nobody on X will say out loud

The benchmark results that go viral are the ones where one model "destroys" the other. That is not what this benchmark says.

What it says is: in 2026, the two best coding models are good at different jobs. The team that picks one and ignores the other is the team paying 20 to 40 percent more than it has to. The team that routes is the team that ships.

Pick the right tool for the task. Or stop picking, and let routing do it for you.

Receipts

30 tasks across 3 repos, run May 1 to May 5, 2026.
GPT-5 via Codex CLI 2.3.1. Claude Opus 4.7 via Claude Code 2.x with 1M context enabled.
Pricing as published on the run dates. No discounts, no enterprise rates.
Spreadsheet available on request. I will not publish it raw because the repos are private.

If you want help routing these models on your team, that is what I do. Half-day engagement is usually enough.

Migrating from OpenAI to Anthropic in 2026: The Honest Playbook

May 9, 2026AI + Cloud5 min read

I Audited 12 AI Coding Setups in Q2. Every Single One Was Leaking 5 Figures.

May 3, 2026AI + Cloud7 min read

The Ultimate Vibe Coding Prompt Library: 50 Battle-Tested Prompts for Claude Code, Cursor, and Windsurf

March 30, 2026AI + Cloud19 min read

Stay ahead of the curve

Get new posts on AI, cloud engineering, and the future of tech delivered to your inbox.

All Posts

Back to Blog

May 7, 2026AI + Cloud6 min read

GPT-5CodevsClaudeOpus4.7:The30-TaskProductionBenchmarkNobodyWantedtoPublish

Claude CodeGPT-5OpenAIAnthropicBenchmarkComparisonCodexAI Coding

Then I scored every output on four axes: correctness, plan quality, cost, and time-to-first-useful-diff. No vibes. A spreadsheet.

Here is the result. The headline is not what the loudest accounts on X want you to think.

TL;DR

Correctness across 30 tasks: Claude Opus 4.7 won 16, GPT-5 won 9, 5 ties.
Plan quality (when in plan mode): Claude won 21, GPT-5 won 4, 5 ties.
Cost per task (median): GPT-5 was 18 percent cheaper.
Time to first useful diff (median): GPT-5 was 22 percent faster on tasks under 5 files. Claude was 31 percent faster on tasks above 12 files.
The verdict: They are not the same tool anymore. They are not even competing for the same job.

The math for your team.

The setup

Thirty tasks, drawn from three real repos I had open at the same time. None were synthetic. All were tasks I would have done myself this week.

10 small. Single-file bug fixes, regex authorship, one-shot data transforms, doc edits.
10 medium. Cross-file refactors, new endpoint plus tests, schema migration plus backfill script.
10 large. Net-new module of 400+ lines, redesign of a flow across 8+ files, full E2E test suite for a new feature.

Same exact prompts to both. I disabled all tool access except read, edit, write, and bash. No MCP. No skill libraries. No external search. Both models had a CLAUDE.md or AGENTS.md sized identically.

Scoring:

Correctness: does the diff merge clean, do the tests pass, does the feature work end-to-end on the first try? Yes or no.
Plan quality: does the plan name every file, every test, every risk? Three-tier rubric.
Cost: raw token spend at published pricing on the day of the run.
Time to first useful diff: stopwatch from prompt submit to a diff I would not throw away.

The scoreboard

Small tasks (10)

Result	GPT-5	Claude 4.7
Correct on first try	9	9
Plan quality (high)	4	7
Median cost per task	$0.11	$0.13
Median time to first diff	14s	19s

Read: dead heat on correctness. GPT-5 is faster and slightly cheaper. Claude plans more carefully than the task probably needed.

Medium tasks (10)

Result	GPT-5	Claude 4.7
Correct on first try	5	7
Plan quality (high)	3	8
Median cost per task	$0.84	$1.02
Median time to first diff	41s	47s

Read: Claude pulls ahead. The plan-mode advantage starts showing. Two of GPT-5's misses were "looks right, fails the test." That kind of miss is expensive.

Large tasks (10)

Result	GPT-5	Claude 4.7
Correct on first try	2	6
Plan quality (high)	1	9
Median cost per task	$4.10	$4.80
Median time to first diff	2m 18s	1m 35s

Totals

Axis	GPT-5	Claude 4.7
Correct first try (out of 30)	16	22
High plan quality (out of 30)	8	24
Total cost across 30 tasks	$50.40	$58.90
Total time across 30 tasks	28m 11s	27m 30s

Where each one actually wins

GPT-5 wins

Small, well-bounded tasks. Faster, cheaper, equal correctness. If your team's day is mostly small PRs, this is your default.
Speed-of-thought edits. When the loop is "write, run, eyeball, repeat," GPT-5's lower latency compounds.
Languages with sparse training data. GPT-5 had cleaner output on two Elixir tasks. I do not have a clean read on why.

Claude Opus 4.7 wins

Large changes across many files. The 1M context window is not marketing. It is the reason the large-task correctness gap is 6 to 2.
Anything requiring a plan before code. Plan mode is structurally better. GPT-5 plans when asked. Claude plans because it cannot stop itself, and on hard tasks that is a feature.
Risk-sensitive work. When the task is "do not break X while you change Y," Claude's tendency to over-spec the test bench is the asset, not the bug.

Both stink at

Anything with under-specified acceptance criteria. Both will write code. Neither will tell you the spec is the problem.
Cross-language refactors. Both wandered. Both shipped diffs that compiled in one language and broke the other.
UI judgment. Both will produce a working component. Neither will tell you it looks bad.

The cost picture

If you are reading this thinking "what does this mean for my budget," the honest version is this:

For a team whose daily work skews to small PRs (mobile shops, marketing pages, internal tools), GPT-5 is the cheaper default. Claude on standby for the hard stuff.
For a team whose daily work skews to large refactors, infrastructure, or cross-cutting features (platform teams, backend at scale, data engineering), Claude is the cheaper default. Because first-try correctness compounds. A 22-out-of-30 first-try rate is roughly 38 percent fewer wasted runs than 16-out-of-30.
For a team doing both, route by task size. It is not hard. A claude.md or agents.md line that says "use Sonnet or Haiku for under 5 files, escalate to Opus for over 5 files" cuts the bill 25 to 35 percent without changing the workflow.

The thing nobody on X will say out loud

The benchmark results that go viral are the ones where one model "destroys" the other. That is not what this benchmark says.

Pick the right tool for the task. Or stop picking, and let routing do it for you.

Receipts

30 tasks across 3 repos, run May 1 to May 5, 2026.
GPT-5 via Codex CLI 2.3.1. Claude Opus 4.7 via Claude Code 2.x with 1M context enabled.
Pricing as published on the run dates. No discounts, no enterprise rates.
Spreadsheet available on request. I will not publish it raw because the repos are private.

If you want help routing these models on your team, that is what I do. Half-day engagement is usually enough.

Migrating from OpenAI to Anthropic in 2026: The Honest Playbook

May 9, 2026AI + Cloud5 min read

I Audited 12 AI Coding Setups in Q2. Every Single One Was Leaking 5 Figures.

May 3, 2026AI + Cloud7 min read

The Ultimate Vibe Coding Prompt Library: 50 Battle-Tested Prompts for Claude Code, Cursor, and Windsurf

March 30, 2026AI + Cloud19 min read

Stay ahead of the curve

Get new posts on AI, cloud engineering, and the future of tech delivered to your inbox.

All Posts

GPT-5CodevsClaudeOpus4.7:The30-TaskProductionBenchmarkNobodyWantedtoPublish

TL;DR

The setup

The scoreboard

Small tasks (10)

Medium tasks (10)

Large tasks (10)

Totals

Where each one actually wins

GPT-5 wins

Claude Opus 4.7 wins

Both stink at

The cost picture

The thing nobody on X will say out loud

Receipts

Related Posts

Migrating from OpenAI to Anthropic in 2026: The Honest Playbook

I Audited 12 AI Coding Setups in Q2. Every Single One Was Leaking 5 Figures.

The Ultimate Vibe Coding Prompt Library: 50 Battle-Tested Prompts for Claude Code, Cursor, and Windsurf

Stay ahead of the curve

GPT-5CodevsClaudeOpus4.7:The30-TaskProductionBenchmarkNobodyWantedtoPublish

TL;DR

The setup

The scoreboard

Small tasks (10)

Medium tasks (10)

Large tasks (10)

Totals

Where each one actually wins

GPT-5 wins

Claude Opus 4.7 wins

Both stink at

The cost picture

The thing nobody on X will say out loud

Receipts

Related Posts

Migrating from OpenAI to Anthropic in 2026: The Honest Playbook

I Audited 12 AI Coding Setups in Q2. Every Single One Was Leaking 5 Figures.

The Ultimate Vibe Coding Prompt Library: 50 Battle-Tested Prompts for Claude Code, Cursor, and Windsurf

Stay ahead of the curve