I ran the same 30 production-shaped tasks through GPT-5 (via Codex CLI) and Claude Opus 4.7 (via Claude Code, 1M context, native plan mode). Same prompts. Same repos. Same machine. Same network. Same week.
Then I scored every output on four axes: correctness, plan quality, cost, and time-to-first-useful-diff. No vibes. A spreadsheet.
Here is the result. The headline is not what the loudest accounts on X want you to think.
TL;DR
- Correctness across 30 tasks: Claude Opus 4.7 won 16, GPT-5 won 9, 5 ties.
- Plan quality (when in plan mode): Claude won 21, GPT-5 won 4, 5 ties.
- Cost per task (median): GPT-5 was 18 percent cheaper.
- Time to first useful diff (median): GPT-5 was 22 percent faster on tasks under 5 files. Claude was 31 percent faster on tasks above 12 files.
- The verdict: They are not the same tool anymore. They are not even competing for the same job.
The setup
Thirty tasks, drawn from three real repos I had open at the same time. None were synthetic. All were tasks I would have done myself this week.
- 10 small. Single-file bug fixes, regex authorship, one-shot data transforms, doc edits.
- 10 medium. Cross-file refactors, new endpoint plus tests, schema migration plus backfill script.
- 10 large. Net-new module of 400+ lines, redesign of a flow across 8+ files, full E2E test suite for a new feature.
Same exact prompts to both. I disabled all tool access except read, edit, write, and bash. No MCP. No skill libraries. No external search. Both models had a CLAUDE.md or AGENTS.md sized identically.
Scoring:
- Correctness: does the diff merge clean, do the tests pass, does the feature work end-to-end on the first try? Yes or no.
- Plan quality: does the plan name every file, every test, every risk? Three-tier rubric.
- Cost: raw token spend at published pricing on the day of the run.
- Time to first useful diff: stopwatch from prompt submit to a diff I would not throw away.
The scoreboard
Small tasks (10)
| Result | GPT-5 | Claude 4.7 |
|---|---|---|
| Correct on first try | 9 | 9 |
| Plan quality (high) | 4 | 7 |
| Median cost per task | $0.11 | $0.13 |
| Median time to first diff | 14s | 19s |
Read: dead heat on correctness. GPT-5 is faster and slightly cheaper. Claude plans more carefully than the task probably needed.
Medium tasks (10)
| Result | GPT-5 | Claude 4.7 |
|---|---|---|
| Correct on first try | 5 | 7 |
| Plan quality (high) | 3 | 8 |
| Median cost per task | $0.84 | $1.02 |
| Median time to first diff | 41s | 47s |
Read: Claude pulls ahead. The plan-mode advantage starts showing. Two of GPT-5's misses were "looks right, fails the test." That kind of miss is expensive.
Large tasks (10)
| Result | GPT-5 | Claude 4.7 |
|---|---|---|
| Correct on first try | 2 | 6 |
| Plan quality (high) | 1 | 9 |
| Median cost per task | $4.10 | $4.80 |
| Median time to first diff | 2m 18s | 1m 35s |
Read: the 1M context window earns its keep here. Claude can hold the repo in its head. GPT-5 starts asking for files it has already been given. On the four tasks where GPT-5 failed and Claude succeeded, the failure mode was the same every time: confident edits in the wrong file.
Totals
| Axis | GPT-5 | Claude 4.7 |
|---|---|---|
| Correct first try (out of 30) | 16 | 22 |
| High plan quality (out of 30) | 8 | 24 |
| Total cost across 30 tasks | $50.40 | $58.90 |
| Total time across 30 tasks | 28m 11s | 27m 30s |
Where each one actually wins
GPT-5 wins
- Small, well-bounded tasks. Faster, cheaper, equal correctness. If your team's day is mostly small PRs, this is your default.
- Speed-of-thought edits. When the loop is "write, run, eyeball, repeat," GPT-5's lower latency compounds.
- Languages with sparse training data. GPT-5 had cleaner output on two Elixir tasks. I do not have a clean read on why.
Claude Opus 4.7 wins
- Large changes across many files. The 1M context window is not marketing. It is the reason the large-task correctness gap is 6 to 2.
- Anything requiring a plan before code. Plan mode is structurally better. GPT-5 plans when asked. Claude plans because it cannot stop itself, and on hard tasks that is a feature.
- Risk-sensitive work. When the task is "do not break X while you change Y," Claude's tendency to over-spec the test bench is the asset, not the bug.
Both stink at
- Anything with under-specified acceptance criteria. Both will write code. Neither will tell you the spec is the problem.
- Cross-language refactors. Both wandered. Both shipped diffs that compiled in one language and broke the other.
- UI judgment. Both will produce a working component. Neither will tell you it looks bad.
The cost picture
If you are reading this thinking "what does this mean for my budget," the honest version is this:
- For a team whose daily work skews to small PRs (mobile shops, marketing pages, internal tools), GPT-5 is the cheaper default. Claude on standby for the hard stuff.
- For a team whose daily work skews to large refactors, infrastructure, or cross-cutting features (platform teams, backend at scale, data engineering), Claude is the cheaper default. Because first-try correctness compounds. A 22-out-of-30 first-try rate is roughly 38 percent fewer wasted runs than 16-out-of-30.
- For a team doing both, route by task size. It is not hard. A
claude.mdoragents.mdline that says "use Sonnet or Haiku for under 5 files, escalate to Opus for over 5 files" cuts the bill 25 to 35 percent without changing the workflow.
The thing nobody on X will say out loud
The benchmark results that go viral are the ones where one model "destroys" the other. That is not what this benchmark says.
What it says is: in 2026, the two best coding models are good at different jobs. The team that picks one and ignores the other is the team paying 20 to 40 percent more than it has to. The team that routes is the team that ships.
Pick the right tool for the task. Or stop picking, and let routing do it for you.
Receipts
- 30 tasks across 3 repos, run May 1 to May 5, 2026.
- GPT-5 via Codex CLI 2.3.1. Claude Opus 4.7 via Claude Code 2.x with 1M context enabled.
- Pricing as published on the run dates. No discounts, no enterprise rates.
- Spreadsheet available on request. I will not publish it raw because the repos are private.
If you want help routing these models on your team, that is what I do. Half-day engagement is usually enough.