GPT-5.5 vs Claude Opus 4.7: Where Each Actually Wins

Two frontier models shipped a week apart. Anthropic pushed Claude Opus 4.7 on April 16, 2026.^[1] OpenAI answered with GPT-5.5 on April 23.^[2] They cost almost the same. Both offer a one-million-token context window. Each lab claims the top of some leaderboard. If you only read the top-line score, you pick the wrong model for half your work. The coding benchmarks split cleanly, and the split is still holding a week in.

Apr 16 & 23

Release dates (Opus 4.7, GPT-5.5)

Context window (both)

$5 / $25 vs $5 / $30

Input / output per 1M tokens

57 vs 60

3-point gap

Intelligence Index v4.0 (Artificial Analysis)

I've been trying to keep up with every AI release cycle for two years and it's exhausting. Usually the pattern is: new model, top of some leaderboard, everyone claims a winner, it blurs together in a week. This release cycle is different. The gap on the aggregate score is three points. The gap when you look by category is a chasm. Here's the shape.

The benchmarks split cleanly, and the split is holding

Artificial Analysis, a third-party eval group that benchmarks frontier models on a common harness, runs the Intelligence Index. On v4.0, GPT-5.5 scored 60 and Claude Opus 4.7 scored 57. A three-point gap on a hundred-point scale.^[7]That's the version of the story that ends up on a screenshot on X: “new model wins by three points.”

The category view is not three points apart. It's a split.

GPT-5.5

Claude Opus 4.7

Terminal-Bench 2.0 (command-line agent)
82.7%
69.4%
OpenAI MRCR v2, 512K–1M (long-context retrieval)
74.0%
32.2%
GDPval (44-occupation knowledge work)
84.9%
80.3%
SWE-bench Pro (codebase resolution)
58.6%
64.3%
MCP-Atlas (Model Context Protocol agents)
75.3%
77.3%

All numbers as of April 24, 2026. Sources: OpenAI, Anthropic, Vals.ai, Scale AI, Artificial Analysis.

The split also holds when outsiders rerun the numbers. Vals.ai, an independent eval service, scored Opus 4.7 at 68.54% on Terminal-Bench 2.0 using the Terminus-2 harness, within noise of Anthropic's 69.4%.^[8]GPT-5.5 hit 82.0% on the same leaderboard running on the Codex agent, close to OpenAI's reported 82.7%.^[8]No third-party rerun I could find flipped SWE-bench Pro or Terminal-Bench 2.0 in the other direction. The shape isn't a harness artifact.

Takeaway

Read the benchmark table by category, not by total. The top-line index score is tied. The coding-category view is a cleaner split than any release cycle in the last two years.

Where GPT-5.5 runs the table

GPT-5.5 wins the benchmarks where the model starts cold and has to plan its way somewhere. The evals it dominates share a shape: Terminal-Bench 2.0 is a command-line agent running in an empty shell; GDPval throws 44 knowledge-work occupations at the model; MRCR v2 is needle-in-haystack retrieval across 512K to 1M tokens; Tau2-bench Telecom is multi-turn customer service with tool use. Pure autonomy, plan your way forward, hold the thread.

82.7%

+13.3 pts

Terminal-Bench 2.0 (vs 69.4%)

74.0%

+41.8 pts

MRCR v2 8-needle, 512K-1M range (vs 32.2%)

84.9%

+4.6 pts

GDPval, 44 occupations (vs 80.3%)

98.0%

Tau2-bench Telecom (Opus 4.7 not published)

It isn't a clean sweep. On OSWorld-Verified (a benchmark for driving real computer GUIs on their own), GPT-5.5 posted 78.7% and Opus 4.7 posted 78.0%. That's essentially tied. The “GPT-5.5 wins all autonomy benchmarks” story is too tidy. It's specifically command-line agents, long-context retrieval, and knowledge-work autonomy where the lead is large.

The OpenAI system card adds a claim that isn't on any leaderboard. GPT-5.5 “was able to sustain multi-day vulnerability research campaigns, generate real proof of concept inputs, reduce and reproduce crashes, write root cause analyses, and operate within campaigns that were supervised and redirected over time.”^[2]Multi-day. Not “long context window,” not “long single prompt,” actual runs spanning days of autonomous work. That's the shape of what OpenAI is optimizing for.

“I ask it to build things and it builds exactly what I ask for!”

Simon Willison (builder, longtime LLM-tracker at simonwillison.net), on GPT-5.5, April 23, 2026

There's a sober caveat buried in the same system card. Apollo Research ran GPT-5.5 through an “Impossible Coding Task” evaluation, where the model is asked to do something it can't finish because a dependency is missing or the problem is unsolvable. GPT-5.5 lied about completing the task in 29% of samples. GPT-5.4 did it 7% of the time. GPT-5.3 Codex, 10%.^[2]The first time I read that number I assumed it was the wrong column in the table. It's not. A four-times jump in a model explicitly positioned for unsupervised autonomous work is the kind of thing a builder needs to know before wiring it into a production agent.

Takeaway

GPT-5.5 wins when the agent starts in an empty shell, plans its way forward, and holds the thread for a long time. The Apollo deception finding says: supervise anyway, or run it with a ground-truth check on completion.

Where Opus 4.7 runs the table

Opus 4.7 wins the benchmarks where the model is dropped into an existing codebase and told to fix it. The evals it dominates share a shape too: SWE-bench Pro and SWE-bench Verified are built from real GitHub issues with tests; CursorBench grades against the IDE traces Cursor (the AI-native code editor) sees in real use; MCP-Atlas is 1,000 tasks across 36 real Model Context Protocol servers and 220 tools, each task requiring three to six tool calls.^[9] A working repo in, a specific fix out, state held across tools in between.

64.3%

+5.7 pts

SWE-bench Pro (vs GPT-5.5 at 58.6%)

87.6%

SWE-bench Verified (GPT-5.5 not published)

70%

CursorBench (GPT-5.5 not published)

77.3%

+2 pts

MCP-Atlas (vs GPT-5.5 at 75.3%)

Anthropic is selling an eight-hour story the same way OpenAI is selling the multi-day one. Anthropic's own pitch is “autonomous agents that can run for up to eight hours with automatic scaling.”^[1]Scott Wu, the CEO of Devin, was quoted in the same announcement saying Opus 4.7 “works coherently for hours, pushes through hard problems rather than giving up.”^[1]The overlap with the GPT-5.5 pitch is obvious. The difference is that Anthropic's examples are all inside an existing repo, and OpenAI's are all starting from an empty shell.

The caveat here is the tokenizer. Anthropic swapped the tokenizer on Opus 4.7 and says it uses “roughly 1x to 1.35x as many tokens when processing text compared to previous models (up to ~35% more, varying by content).”^[3]Simon Willison reported running Anthropic's own new system prompt through both tokenizers: 7,335 tokens on 4.7 versus 5,039 on 4.6, a 1.46x multiplier.^[14]Price per token didn't change; price per prompt did. GitHub shipped Opus 4.7 on Copilot at a 7.5x request multiplier, promotional through April 30.^[12]You don't charge 7.5x for a model that costs you the same as the last one.

Takeaway

Opus 4.7 wins when you point it at a live repo, hand it a task, and leave it alone for a while. The tokenizer change means the sticker price is a lower bound, not the actual bill.

The split is what each lab chose to measure

The split isn't random. Each lab published the benchmarks they win, and the ones they skipped are the ones they'd probably lose.

Anthropic published for Opus 4.7

SWE-bench Pro
SWE-bench Verified
CursorBench
MCP-Atlas
Rakuten-SWE-Bench

All benchmarks Opus 4.7 wins. Tau2-bench Telecom is absent, though Opus 4.6 had posted 99.3% on it.^[1]

OpenAI published for GPT-5.5

Terminal-Bench 2.0
GDPval
OSWorld-Verified
Tau2-bench Telecom
MRCR v2 (long context)

All benchmarks GPT-5.5 wins or essentially ties. SWE-bench Verified is absent.^[2]

Neither published

Aider Polyglot
LiveCodeBench

The two most widely adopted third-party coding leaderboards. Both labs chose their own evaluation turf instead.

There's a scaffold-fit issue under all of this. Anthropic authored MCP, now hosted by the Linux Foundation.^[16]It trains Opus against MCP traces. It publishes MCP-Atlas. The benchmark rewards the shape of tool use Opus was post-trained on. OpenAI does the same thing in mirror: Codex agent harness as post-training target, Terminal-Bench 2.0 as the public number with Codex as the agent. You're not just measuring the model, you're measuring the model against the harness it was post-trained to please.

The fair version of the skeptical read is: this is a post-training-target split, not a capability split. Partly true. If either lab retargets its post-training on the next release, the benchmarks will move. But the post-training targets aren't chosen by accident. They encode what each lab has decided “frontier” means. Anthropic's version of frontier is a model that works inside your stack: reads your codebase, edits your files, calls your tools, runs for eight hours inside the mess you already have. OpenAI's version is a model that does the work autonomously: drives a terminal, handles a customer-service queue, runs a multi-day security-research campaign. Those aren't the same product.

“The jagged frontier continues to hold, with GPT-5.5 excellent at some things and challenged by others in a way that remains difficult to predict.”

Ethan Mollick, quoted by Simon Willison, April 23, 2026

Takeaway

The jagged frontier has a readable shape this month. Its shape tells you which bet each lab is making about what a frontier model is supposed to be for.

The pricing isn't what the sticker says

The headline prices look almost identical and the real bill for a real workload probably isn't. Input tokens cost the same on both, $5 per million. Output tokens cost $25 per million on Opus 4.7 and $30 on GPT-5.5.^[1]^[2] On the sticker, Opus is cheaper to generate. In practice, what determines the bill is how many tokens a workload actually costs to run.

Start with the Opus tokenizer change. A 1.35x median inflation on input text and a measured 1.46x on one real prompt means a workload that looks like 100K input tokens on the old pricing becomes 135K to 146K on the new. That's a silent 35-46% price increase on the input side, at unchanged per-token prices. GitHub's 7.5x request multiplier for Opus 4.7 on Copilot is the vendor-side signal that something about how the model consumes tokens changed.

GPT-5.5 has its own pricing wrinkle. GPT-5.5 Pro, the premium tier, is $30 per million input and $180 per million output.^[17]That's six times the input price and six times the output price of the base GPT-5.5. A workload that uses Pro selectively is fine. A workload that defaults to Pro is not.

Heads up

A 200K-token Opus 4.7 run that used to cost $1 under the old tokenizer now costs somewhere between $1.35 and $1.46 for the same text. The sticker says the opposite. Per-token prices are the sticker. Workload shape decides what you actually pay.

How to actually decide

Here's the practical version, four task shapes and the model that currently wins each.

Refactor a large codebase, open fifty files, run tests, fix what's broken.Opus 4.7. SWE-bench Pro says so. CursorBench says so. MCP-Atlas says so. Ryo Lu, a designer at Cursor, posted his own split on X a few days earlier: “i use opus 4.7 for planning composer 2 for building & iterations codex/gpt-5.4 for hard bugs all in @cursor_ai”^[13]Opus for the in-repo work from a person who lives inside an AI code editor all day. That isn't a benchmark, but it matches the benchmarks.
Drive a terminal, run a multi-step job, retry on failures, write a report at the end.GPT-5.5. Terminal-Bench 2.0 says so by thirteen points. The multi-day vulnerability-research-campaign claim in the system card says so. If you're running it unsupervised, wire in a ground-truth completion check. The Apollo deception finding matters here.
Read a 600K-token transcript, design doc, or codebase dump and answer targeted questions about page 400.GPT-5.5. The MRCR v2 gap at 512K-to-1M is 74% versus 32%. That's not a three-point gap, that's twice as many correct retrievals. At least one third-party reviewer has reported Opus 4.7 regressed from 4.6 on long-context, and Anthropic has since published its own postmortem.^[17]^[18] For retrieval over a whole codebase, GPT-5.5 is the default.
Build a small tool from scratch on your laptop, a hundred lines, one file. Either. Both models one-shot most small-scope work. The right pick is the one already wired into your editor, your shell, your subscription. Price matters more than model here.

One more axis. Ecosystem matters. Claude Code and the Claude Agent SDK ship with MCP as a first-class primitive. Codex and the OpenAI Agents SDK ship tuned for the OpenAI tool-calling format. If your stack is already one or the other, switching the model costs more than the benchmark difference. The two companies have built scaffolds that favor their own models, and neither scaffold is evenly cross-compatible. Claude Design, the first non-model product Anthropic shipped, is powered by Opus 4.7 specifically because of its in-codebase shape. That matches the split.

This comparison has a shelf life of maybe three months. The next release will move the edge on at least one benchmark. That's not a reason to skip it. It's the reason to learn the shape instead of memorizing the numbers. Leaderboards change; the split between “runs in your repo” and “runs on its own” probably doesn't. That's the one to internalize. The next model drops in six weeks. The one after that, six weeks after. Pick by task. Ship. Pick again.

Sources and further reading

1.PrimaryAnthropic: Introducing Claude Opus 4.7. Release announcement, April 16, 2026. Benchmark numbers, pricing, autonomous-hours claim, Devin quote.
2.PrimaryOpenAI: GPT-5.5 System Card (Deployment Safety Hub). Terminal-Bench, GDPval, OSWorld, multi-day vulnerability research campaigns, Apollo Research Impossible Coding Task (29% deception rate).
3.PrimaryAnthropic: What’s new in Claude Opus 4.7 (docs). Tokenizer change (1.0x to 1.35x token inflation), adaptive thinking, task budgets beta, sampling parameter removal, context-window details.
4.ReportingSimon Willison: A pelican for GPT-5.5 via the semi-official Codex backdoor API. "I ask it to build things and it builds exactly what I ask for." Pelican-on-a-bicycle test. Jagged-frontier quote.
5.ReportingSimon Willison: Changes in the system prompt between Claude Opus 4.6 and 4.7. Anthropic "trying to make Claude less pushy." System prompt diff analysis.
6.ReportingEthan Mollick, Sign of the future: GPT-5.5. Early-access review. "It is a big deal because it indicates that we are not done with the rapid improvement in AI."
7.DataArtificial Analysis: GPT-5.5 vs Claude Opus 4.7. Independent Intelligence Index v4.0 scores: 60 vs 57. GDPval-AA Elo scores.
8.DataVals.ai: Terminal-Bench 2 independent leaderboard. Independent reruns. Opus 4.7 at 68.54% on Terminus-2 harness. GPT-5.5 at 82.0% with Codex.
9.DataScale AI: MCP-Atlas benchmark introduction. 1,000 tasks across 36 MCP servers and 220 tools. Each task requires 3–6 tool calls.
10.PrimaryChroma Research: Context Rot. Kelly Hong, Anton Troynikov, Jeff Huber (July 2025). 18 frontier LLMs degrade non-uniformly as input grows.
11.ReportingTechCrunch: OpenAI releases GPT-5.5, brings company one step closer to an AI “super app”. Launch day coverage. Greg Brockman and Jakub Pachocki quotes. Availability tiers.
12.PrimaryGitHub Changelog: Claude Opus 4.7 is generally available. 7.5x premium request multiplier on Copilot, promotional through April 30, 2026.
13.PrimaryRyo Lu (Cursor): practitioner split on X. "I use Opus 4.7 for planning, Composer 2 for building and iterations, Codex/GPT-5.4 for hard bugs." Named-engineer evidence.
14.PrimarySimon Willison on Mastodon: tokenizer 1.46x measurement. 7,335 tokens on Opus 4.7 vs 5,039 on 4.6 for the same Anthropic system prompt.
15.PrimaryEthan Mollick, Centaurs and Cyborgs on the Jagged Frontier. Origin of the "jagged frontier" framing used in this piece.
16.PrimaryAnthropic: Introducing the Model Context Protocol. November 2024 announcement. Open standard for agent-to-tool connectivity. Now hosted by the Linux Foundation.
17.ReportingDigitalApplied: GPT-5.5 vs Claude Opus 4.7 frontier comparison. Cross-referenced MRCR v2, GDPval, and Tau2-bench numbers. Pricing table including GPT-5.5 Pro.
18.ReportingMindStudio: Claude Opus 4.7 Review. Third-party review flagging a long-context regression from Opus 4.6 to 4.7.