AI in 15 — April 12, 2026
Eight out of eight. That's how many small, cheap AI models detected the same flagship vulnerability that Anthropic said was too dangerous to release. Turns out you don't need a restricted frontier model to find a 27-year-old bug. You might just need eleven cents.
Welcome to AI in 15 for Sunday, April 12, 2026. I'm Kate, your host.
And I'm Marcus, your co-host.
Happy Sunday, Marcus. We've got a great show. A viral counter-study challenges everything Anthropic claimed about Mythos being uniquely dangerous. Berkeley researchers score perfect marks on top AI benchmarks without solving a single task. Google drops Gemma 4 and it's beating models ten times its size. OpenAI acquires another company and kills the product. Tesla gets its first European self-driving approval. And Iran is making AI propaganda with Lego minifigures. Let's get into it.
Small models match Mythos on vulnerability detection and the cybersecurity moat debate explodes.
AI benchmarks are fundamentally broken.
And Tesla's Full Self-Driving lands in the Netherlands.
Marcus, we've been covering Anthropic's Mythos all week. The zero-days, Project Glasswing, the Treasury meeting. But now a security firm called AISLE has dropped a counter-study that's gone completely viral. What did they find?
AISLE took the specific vulnerable code snippets that Anthropic showcased, the ones from OpenBSD and FreeBSD, and ran them through small open-weight models. Models as small as 3.6 billion parameters costing eleven cents per million tokens. And Kate, eight out of eight models detected the flagship FreeBSD exploit. A 5.1 billion parameter model recovered the core analysis of that 27-year-old OpenBSD bug in a single API call.
Wait. So the vulnerabilities that Anthropic deemed too dangerous to release publicly, that triggered an emergency meeting at Treasury, those same vulnerabilities were found by models you could run on a laptop?
That's their claim. They tested GPT-OSS-20b, Kimi K2, DeepSeek R1, Qwen3, Gemma 4, among others. On some tasks, small open models actually outperformed frontier models. Their central conclusion is that the moat in AI cybersecurity is the system, not the model.
But there's been serious pushback on this, right?
Significant pushback. Security researcher tptacek made what I think is the strongest counterargument on Hacker News. He compared it to Heartbleed. If you isolate the vulnerable code and hand it to any competent programmer, the bug is obvious. The hard part was never understanding the bug. It was finding it in a massive codebase with millions of lines of code. AISLE essentially gave the models the answer key and asked if they could read it.
So it's the difference between finding a needle in a haystack and being handed the needle and asked what it is.
Exactly. And Charlie Eriksen added important nuance in Fortune. He said smaller models require more technical skill, careful prompting, and better-designed tooling. Whereas Mythos democratizes the capability for anyone with a computer to develop powerful offensive cyber tools. So the question isn't just can small models do this. It's how much expertise do you need around them to make it work.
Where do you land on this, Marcus?
I think both sides are partly right, which is the uncomfortable answer. Anthropic probably oversold the uniqueness of Mythos's detection capabilities. But AISLE undersold the importance of the scaffolding, the agentic pipeline that finds and chains vulnerabilities across real codebases. The truth is in the middle. The era of AI-assisted vulnerability hunting has arrived, and it's not limited to one company's frontier model. But building the full system that operates at scale is still genuinely hard.
So the moat isn't the model, but the system around it might still be a moat.
That's the nuanced take, yes. And honestly, either way, the defensive implications are the same. Everyone needs to be hardening their systems right now.
From AI security to AI credibility. Berkeley researchers just published something that should make every AI company very nervous. They scored perfect or near-perfect on eight of the most prominent AI benchmarks without solving a single actual task. Marcus, how?
The exploits ranged from embarrassingly simple to genuinely clever. On FieldWorkArena, the validator literally never checked whether the answer was correct. You could submit an empty JSON object and get a perfect score. On SWE-bench, the one every coding AI company loves to cite, they created a pytest hook that rewrote every test result to passing. The tests never actually ran.
You're telling me the benchmark that companies put in their pitch decks could be gamed by rewriting the test harness?
One hundred percent on SWE-bench Verified. One hundred percent on SWE-bench Pro. WebArena was even worse. Agents could navigate to file URLs and read the answer keys directly from the task configuration files. And for OSWorld, the gold reference files were sitting on public HuggingFace URLs embedded in the task metadata. The evaluator compared gold versus gold. Perfect match every time.
This is devastating.
They identified seven recurring vulnerability patterns across these benchmarks. No isolation between agent and evaluator environments. Answers shipped alongside tests. Eval called on untrusted input. Unsanitized LLM judge inputs. The list goes on. And they're building an automated scanner called BenchJack to probe evaluation pipelines before publication.
So the next time a company announces state-of-the-art SWE-bench scores...
Take it with a massive grain of salt. Look, I'm not saying every company is gaming benchmarks. But the infrastructure that's supposed to verify AI capability is itself fundamentally broken. If the ruler is made of rubber, every measurement is suspect. This demands a complete rethinking of how we evaluate AI agents, and investors and enterprise buyers should be far more skeptical of benchmark-driven claims.
Speaking of models proving their worth, Google released Gemma 4 under Apache 2.0. Four model sizes, from tiny to thirty-one billion parameters. Marcus, the benchmarks here are striking even given what we just discussed about benchmarks.
Fair point on the irony. But the numbers are worth noting. The 31B Dense model outperforms Meta's Llama 4 on AIME math, LiveCodeBench, GPQA Diamond, and agentic task benchmarks. All four variants support multimodal inputs, text plus images, and the edge models handle audio too. Context windows up to 256K tokens. And being Apache 2.0 means completely unrestricted commercial use.
A 31 billion parameter model beating models with hundreds of billions of parameters. That's the efficiency story again.
This ties directly to the Mythos counter-study. We're seeing across multiple domains that smaller, more efficient models are becoming surprisingly competitive with frontier systems. Google is deploying models that run on Raspberry Pis. The intelligence-per-parameter ratio is improving so fast that raw model size is becoming less predictive of capability. That's good news for anyone who can't afford a hundred-billion-dollar compute budget.
And Google, the company everyone accused of keeping AI behind APIs, is now the most aggressive open-source player. The irony is thick.
Everyone's switching lanes, Kate. And Google seems to be winning this particular lane.
OpenAI has acquired Cirrus Labs, the company behind Cirrus CI, a continuous integration platform used by PostgreSQL, SciPy, and other major open-source projects. And unlike the Astral acquisition, the product is dying. Cirrus CI shuts down June 1.
The team joins OpenAI's Agent Infrastructure group, citing the shift to agentic engineering as their motivation. They are relicensing some of their open-source tools under more permissive licenses, so the community can maintain them. But the CI platform itself is gone.
PostgreSQL and SciPy have already filed migration issues.
And this is now a pattern. OpenAI has completed nearly as many acquisitions in 2026 as in all of 2025. Thirteen deals tracked by Tracxn. The strategy is clear: acqui-hire engineering talent to build agentic infrastructure. But the collateral damage to the open-source ecosystem is real. When AI companies can extract entire teams at premium prices, the projects that depend on those teams are left scrambling.
It's a sustainability question for open source itself.
Absolutely. The talent market for AI infrastructure engineers is so hot that maintaining a viable open-source business may become impossible. Why charge five dollars a month for a CI service when OpenAI will buy your company for your team?
Tesla news. The Dutch vehicle authority just granted Tesla its first European approval for Full Self-Driving Supervised. Marcus, this is Level 2, not full autonomy, but it's a milestone.
The RDW approved it after more than eighteen months of testing, making the Netherlands the first European country to authorize FSD on public roads. Critically, this is supervised. Drivers must stay focused, can't use phones, and remain fully responsible at all times. The RDW emphasized that the EU version is not comparable to the US version, though they didn't specify exactly what differs.
The Netherlands is an interesting test case. Dense cities, bike lanes everywhere, mixed traffic.
Hacker News commenters raised exactly that point. One wrote, in the Netherlands roads are mixed with old people walking their dog and children on bicycles going to school. We're not in Phoenix anymore. This is a vastly more complex driving environment than Tesla's US training data, so it'll be a rigorous real-world test. A broader European rollout is targeted for summer 2026, but that depends on individual country approvals and an EU-wide vote that hasn't even started.
Quick hit on a lighter story. AI-generated Lego videos depicting Trump and Netanyahu as minifigures have been flooding social media, racking up millions of views. These are coming from an account linked to Iranian media.
The account calls itself Akhbar Enfejari, which translates to Explosive News. The videos feature Iranian military commanders rapping over Lego-style animations mocking American military operations. The production quality is high enough that analysts question whether this has government backing. YouTube has pulled at least one video and banned the producer's account. The BBC spoke to the man behind it.
This feels like a new frontier for state propaganda.
It is. And I'd note the sophistication here warrants healthy skepticism about the claimed independence from the Iranian regime. The bandwidth and AI capabilities needed suggest at minimum unofficial cooperation. It's the most vivid example yet of AI-generated content being weaponized for information warfare. The cost and skill barrier for producing polished propaganda has essentially collapsed.
Sunday big picture. Marcus, small models match Mythos, benchmarks turn out to be gameable, Google gives away competitive models for free, and the efficiency revolution keeps accelerating. What's the takeaway this weekend?
The walls are coming down. The idea that AI capability is concentrated in a few frontier labs behind closed APIs is crumbling in real time. AISLE showed that vulnerability detection isn't a moat. Berkeley showed that benchmarks, the primary tool for claiming superiority, are broken. Google's Gemma 4 showed that a 31 billion parameter open model can beat closed models with ten times the parameters. The competitive advantage is shifting from who has the biggest model to who builds the best system around it.
And that should worry companies whose entire strategy is being the model provider.
It should worry companies whose pitch is trust us, our model is better, here are the benchmarks to prove it. When the benchmarks are unreliable and smaller models keep closing the gap, that pitch gets harder to sustain. The winners in the next phase won't be the ones with the most parameters. They'll be the ones with the best tooling, the best integration, and frankly, the most trust from their users.
Trust again. It keeps coming back to trust.
Every week, Kate. Every week.
That's your AI in 15 for Sunday, April 12, 2026. Enjoy the rest of your weekend, and we'll see you tomorrow.