← Home AI in 15

AI in 15 — April 24, 2026

April 24, 2026 · 19m 12s
Kate

OpenAI just doubled the price of its flagship model and topped every benchmark on the board. Eighteen hours later, a Chinese lab open-sourced a one-point-six-trillion-parameter model that runs for pennies. And Anthropic spent the same day publishing a postmortem admitting Claude Code really did get dumber for a month. That's one Thursday in frontier AI.

Kate

Welcome to AI in 15 for Friday, April 24, 2026. I'm Kate, your host.

Marcus

And I'm Marcus, your co-host.

Kate

Friday show, Marcus, and it's a genuinely historic twenty-four hours. OpenAI shipped GPT-5.5 and pitched a super-app future. DeepSeek dropped V4 under Apache 2.0 and reset the open-weights price floor. Anthropic published a rare detailed postmortem on Claude Code's quality collapse. Security researchers dismantled the Mythos hype narrative the same week. Claude Desktop got caught installing browser bridges for browsers you don't own. SpaceX told the SEC it's going to manufacture its own GPUs. An open-source mesh-networking project split in two over undisclosed AI-generated code. And ICLR 2026 kicked off in Rio. Let's go.

Kate

OpenAI takes the benchmark crown and doubles the price.

Kate

DeepSeek V4 drops open-weights eighteen hours later.

Kate

And Anthropic admits Claude Code genuinely got worse for a month.

Kate

Lead story, Marcus. OpenAI shipped GPT-5.5 yesterday. Sam Altman calls it their smartest model yet. The first fully retrained base model since GPT-4.5. Walk me through what's actually new.

Marcus

Three things matter, Kate. First, the benchmarks. GPT-5.5 tops the Artificial Analysis Intelligence Index at sixty, three points ahead of Claude Opus 4.7 and Gemini 3.1 Pro Preview, both at fifty-seven. Terminal-Bench 2.0 at eighty-two-point-seven percent versus sixty-nine for Opus and sixty-eight for Gemini. On GDPval, which measures knowledge work across forty-four occupations, GPT-5.5 matches or beats human professionals in eighty-five percent of comparisons. SWE-Bench Pro at fifty-eight-point-six. And the spicy one, offensive-security benchmarks. A ten percent miss rate on vulnerability detection, versus GPT-5 at forty and Opus 4.6 at eighteen.

Kate

Second?

Marcus

The price. Five dollars per million input tokens, thirty per million output for standard. Double GPT-5.4's rate. GPT-5.5 Pro is thirty and a hundred and eighty. That's the first meaningful price increase on a flagship model in over a year, and it explodes the narrative that inference costs only go down.

Kate

And third, this super-app framing.

Marcus

Altman and Greg Brockman are selling the workflow, not the model. ChatGPT, Codex, and the Atlas browser fused into one agentic surface that, quote, moves across tools until a task is finished. Mark Chen, their chief research officer, highlighted drug-discovery assistance. They shipped a Codex-built 3D dungeon demo as proof of end-to-end agentic capability.

Kate

Why does the framing matter?

Marcus

Because OpenAI has stopped competing on model quality alone. They're competing for the whole developer-and-knowledge-worker surface. If your agent lives inside ChatGPT, browses through Atlas, and writes code in Codex, OpenAI doesn't need to have the single best model on every benchmark. They need to have the place where your work happens. The doubled API price tells you compute demand on the new model is significant. The super-app pitch tells you they're done selling wholesale. They want to be the retail layer too.

Kate

Quick hits, Marcus. And the first one is the most geopolitically loaded. DeepSeek.

Marcus

Less than eighteen hours after GPT-5.5 shipped, DeepSeek dropped the V4 series under an Apache 2.0 license. V4-Pro is one-point-six trillion total parameters, forty-nine billion active per token, Mixture-of-Experts. V4-Flash is a smaller two-hundred-and-eighty-four-billion-parameter variant. Both ship with a one-million-token context window as standard. Both are multimodal.

Kate

And the benchmarks.

Marcus

V4-Pro hits eighty-seven-point-five MMLU-Pro, ninety-point-one GPQA Diamond, ninety-three-point-five on LiveCodeBench which is best-in-class, a Codeforces rating of three thousand two hundred and six, and eighty-one on SWE-bench. Ninety-seven percent needle-in-a-haystack accuracy across one million tokens. That puts it roughly two months behind GPT-5.5 and Opus 4.7 on reasoning, but ahead of every other open-weight model on the planet.

Kate

And the price.

Marcus

Here's where it gets uncomfortable for Western labs. V4-Pro is a dollar seventy-four per million input, three dollars forty-eight output. V4-Flash is twenty-eight cents on output. GPT-5.5 is thirty dollars on output. That's not a pricing difference. That's two orders of magnitude.

Kate

And this is the third cycle in a row DeepSeek has done this.

Marcus

Close the gap in roughly eight weeks, release the weights openly, set the floor for global AI pricing. The architectural tricks they're disclosing, hybrid Compressed Sparse Attention, manifold-constrained hyper-connections, Engram Conditional Memory, are things US labs keep proprietary. Kate, I want to be honest with listeners about the context here. A Chinese lab isn't open-sourcing a trillion-parameter model because they love the free software movement. Commoditizing Western frontier R&D is a coordinated strategic objective, and the open weights are the weapon. For enterprises in regulated industries the Flash tier at twenty-eight cents is genuinely transformative. For the business case behind every Western AI subscription it's a direct assault. Both things are true at once.

Kate

Anthropic published a postmortem yesterday, Marcus. Developers have been complaining for weeks that Claude Code got dumber. Anthropic now admits they were right.

Marcus

Three separate issues compounding on each other, Kate. First, a March fourth change lowered the default reasoning effort from high to medium to reduce latency. Users actually preferred higher intelligence. Reverted April seventh. Second, and this is the damaging one, a March twenty-sixth prompt-caching optimization designed to clear Claude's stale thinking once per idle session instead, quote, kept happening every turn for the rest of the session due to a bug. That made Claude feel forgetful and repetitive for nearly a month. Third, an April sixteenth system-prompt directive capping inter-tool text at twenty-five words caused a measured three percent drop on coding evals for both Opus 4.6 and 4.7. Reverted April twentieth. All fixes shipped in v2.1.116. Anthropic reset usage limits for all subscribers on April twenty-third as goodwill.

Kate

The timing is extraordinary.

Marcus

Same day as GPT-5.5. The charitable reading is accountability. The cynical reading is damage control disguised as transparency. What's not in dispute is that three interacting bugs on the flagship coding product in six weeks is a serious quality-assurance failure for a lab that markets itself on engineering rigor. And it validates a concern developers have been raising all year, that frontier labs silently A-B test cost-saving optimizations in production and users only find out when behavior visibly breaks. Reporting on Hacker News says OpenAI is now offering unlimited tokens until summer to pull enterprise accounts off Anthropic, and at least one large shop is already testing GPT-5.5 in production as a result.

Kate

Mythos, Marcus. Anthropic's flagship cybersecurity model. The one gated behind Project Glasswing because it was supposedly too dangerous to release. How's that narrative holding up?

Marcus

It's collapsing. Patrick Garrity, a vulnerability researcher, disputed Anthropic's claim of, quote, thousands of additional high-severity vulnerabilities, estimating the real number at, quote, maybe forty, or maybe none at all. Analyst Devansh found the thousands figure was extrapolated from just one hundred ninety-eight manually-reviewed reports. Mozilla's CTO said they found no category of vulnerability that humans can't also discover. A key Linux-kernel bug Anthropic credited to Mythos was actually found by publicly-available Opus 4.6. Tim Mackey at Black Duck called the marketing, quote, effectively a challenge, not dissimilar to a capture-the-flag exercise.

Kate

And XBOW's piece.

Marcus

That's the twist. XBOW published research arguing GPT-5.5, generally available today to anyone with an API key, already matches or exceeds Mythos-class offensive-security performance. Ten percent miss rate on vulnerability detection. Black-box results beating GPT-5 even when GPT-5 had source code access. XBOW writes that with source code provided, GPT-5.5, quote, effectively killed our benchmark.

Kate

So Anthropic gated a model for safety reasons while a competitor shipped the equivalent openly.

Marcus

Which pulls the rug out from both claims simultaneously. The safety framing looks like marketing. The capability framing looks exaggerated. And the practical implication for every defender is unambiguous. Assume an attacker with a twenty-dollar API key has a world-class vulnerability researcher on staff starting today. That changes patch timelines, threat modeling, and the economics of defensive security for every company listening.

Kate

Speaking of Anthropic not matching its own positioning. A privacy story broke this week about Claude Desktop.

Marcus

Privacy consultant Alexander Hanff disclosed that installing Claude Desktop silently drops a Native Messaging manifest into seven Chromium-based browsers. Brave, Edge, Arc, Vivaldi, and Chromium variants Anthropic's own docs say aren't supported. Including browsers the user hasn't installed yet. The manifest pre-authorizes three extension IDs to communicate with a local binary at user-privilege level, outside the browser sandbox, with no permission prompt.

Kate

Why does that matter?

Marcus

It's a persistent, pre-authorized bridge from browser extensions into a local executable. Hanff calls it a potential breach of Article 5(3) of the EU ePrivacy Directive, which requires explicit consent before storing information on a user's device. Anthropic's own documentation puts Claude-for-Chrome's prompt-injection success rate at twenty-three-point-six percent without mitigations, eleven-point-two with them. Meaning a successful injection could pivot through the bridge into the local binary. Anthropic didn't respond to press requests.

Kate

Second Anthropic trust-boundary story this week.

Marcus

Paired with the Claude Code postmortem, the pattern is unflattering. A company that markets safety and transparency is shipping production behavior that contradicts both. The defensive take is, quote, this is how Native Messaging works, and there's some truth to that. The novel part is pre-authorizing browsers the user doesn't own, and doing it silently. Expect regulatory pickup in the EU within weeks.

Kate

Compute buildout update, Marcus. We covered Google's TPU 8t and 8i yesterday in depth. What's the new angle today?

Marcus

SpaceX's S-1 IPO filing dropped, reportedly targeting a one-point-seven-five-trillion-dollar valuation. And buried in the capital-expenditure section, SpaceX explicitly lists, quote, manufacturing our own GPUs. This is Terafab, the joint venture with xAI, Tesla, and Intel announced in March, targeting one terawatt of AI compute output annually on Intel's 14A process. SpaceX told investors it may not have enough chip supply to fuel growth and is positioning for what it calls a twenty-eight-point-five-trillion-dollar enterprise AI opportunity.

Kate

One terawatt. What does that even mean in context?

Marcus

It means nobody can underwrite the ceiling anymore, Kate. The AI compute buildout is structural now, not speculative. Four hyperscalers plus Terafab plus Stargate all building parallel compute stacks. The most interesting data point on Google's TPU announcement was actually that Nvidia was co-marketing the Vera Rubin partnership the same day. Nobody can displace Nvidia at scale yet, so everyone is hedging. SpaceX making its own GPUs is the most speculative bet of the bunch, but it's a signal that the Musk ecosystem is vertically consolidating across launch, compute, model, and chip fabrication. That's a lot of balls in the air for one man.

Kate

Open-source governance story. MeshCore just split in two, Marcus.

Marcus

MeshCore is an open-source mesh-networking firmware project. Core contributor Andy Kirby was found to have quietly used Claude Code to build large portions of the ecosystem, standalone devices, mobile app, web flasher, while simultaneously applying for the MeshCore trademark on March twenty-ninth without telling the rest of the team and rebranding his work as the, quote, official MeshOS line. A Discord community poll showed overwhelming concern about undisclosed AI-generated firmware. The remaining team relaunched official channels at meshcore.io.

Kate

This is a preview of what's coming.

Marcus

It's one of the first high-profile open-source splits explicitly over undisclosed AI-generated code. Every FOSS project is going to face this governance question within twelve months. Is AI-assisted code acceptable, and does the community have a right to know the provenance? In security-sensitive domains like mesh radio firmware, the answer matters a lot. A mesh network that communities rely on for emergency communications is not the place to discover that most of the codebase was generated in a tool the maintainer never disclosed. Expect more of these fractures.

Kate

Quick last one. ICLR 2026 is running this week, Marcus. What should listeners know?

Marcus

April twenty-third through twenty-seventh at Riocentro Convention Center in Rio de Janeiro. Microsoft has over a hundred accepted papers. Apple, Google, and Naver Labs are sponsors. Notable sessions I'd flag. Visual Planning, Let's Think Only with Images, which is a direct challenge to text-centric reasoning chains. BIRD-INTERACT on text-to-SQL evaluation. And OSCAR, Online Soft Compression for RAG, which is relevant to the long-context work DeepSeek just shipped in V4.

Kate

Why should anyone outside research care?

Marcus

Because ICLR is where next year's frontier-model ideas are being debated right now. Soft-compression RAG and pure-vision planning feed directly into the next wave of multimodal, long-context systems. The kind of research that ends up in production models twelve to eighteen months later. If you want a preview of what GPT-5.6 or Claude Opus 5 will be arguing about, the papers being presented in Rio today are it.

Kate

Friday big picture, Marcus. What do today's stories add up to?

Marcus

Three frontier-model stories landed in one twenty-four-hour window, Kate, and they point in opposite directions. GPT-5.5 takes the benchmark crown and doubles the price, which says frontier compute is getting more expensive, not less. DeepSeek V4 resets the open-weights price floor at a fraction of that, which says commoditization pressure is relentless and mostly coming from China. And Anthropic publishes a postmortem admitting a month of silent degradation on its flagship coding product, which says the trust layer underneath all these subscriptions is thinner than the marketing suggests.

Kate

One sentence to close.

Marcus

The competitive picture has never been more fluid, and the lab that was positioning itself as the safety-and-engineering gold standard is having the worst week of any frontier lab in eighteen months, Kate. Whoever owns the surface where work actually happens, chips, clouds, IDEs, browsers, trust, wins the next chapter. And right now that race is wide open.

Kate

That's your AI in 15 for Friday, April 24, 2026. See you tomorrow.