AI in 15 — March 06, 2026
Seventy-five percent. That's the score OpenAI's brand new model just hit on a computer use benchmark, beating the seventy-two percent that actual human testers typically achieve. Your AI can now use your computer better than you can.
Welcome to AI in 15 for Friday, March 6, 2026. I'm Kate, your host.
And I'm Marcus, your co-host.
Happy Friday, Marcus. We've got a packed one to close out the week. OpenAI just dropped GPT-5.4 with a million-token context window and native computer use. The Pentagon officially designated Anthropic a supply chain risk, and Anthropic is fighting back. An MIT study says ChatGPT users are showing weaker brain connectivity after four months. Andrej Karpathy trained GPT-2 in two hours for under fifty bucks. A prompt injection in a GitHub issue title compromised four thousand developer machines. And open source licensing may be broken beyond repair. Let's preview.
GPT-5.4 launches with a million-token context window, native computer use that beats humans, and mid-response steering.
The Pentagon formally labels Anthropic a supply chain risk, the first time an American company has ever received that designation.
An MIT brain study finds ChatGPT users show significantly weaker neural connectivity compared to people who just use Google or nothing at all.
And a single GitHub issue title compromised four thousand developer machines through an AI coding tool. Let's get into it.
Marcus, GPT-5.4. OpenAI is calling it their most capable and efficient model for professional work. Walk us through what's actually new here.
There's a lot to unpack, but three things really stand out. First, one million token context window through the API, and no extra cost beyond two hundred thousand tokens. That's OpenAI matching what Gemini and Claude have offered, and doing it without a price premium. Second, native computer use. This model can autonomously operate applications on your machine. And on the OSWorld benchmark, it scored seventy-five percent, which is an industry record and notably higher than the seventy-two point four percent that human testers achieve.
Wait, it literally outperformed humans at using a computer?
On that specific benchmark, yes. Now, benchmarks are benchmarks, real-world usage is messier. But it's a significant milestone. And the third headline feature is mid-response steering. You can interrupt the model while it's generating and redirect it. So if it starts going down the wrong path, you don't have to wait for it to finish and re-prompt. You just course-correct in real time.
That actually sounds like a genuinely useful feature rather than just a benchmark number.
It is. And there's a pragmatic story underneath the flashy features. GPT-5.4 is thirty-three percent less likely to make errors in individual claims compared to 5.2. Token efficiency is significantly improved, meaning the same problems get solved with fewer tokens, which directly translates to lower cost. API pricing is two fifty per million input tokens, twelve dollars per million output. That's competitive. And Sam Altman teased a new fast mode that he says people will like.
This comes at an interesting moment for OpenAI. We've been covering the Pentagon fallout all week, the QuitGPT movement, internal employee revolt.
The timing is not accidental, Kate. OpenAI needs a win right now. One and a half million people signed up to quit ChatGPT. Their own employees were publicly siding with Anthropic. Altman admitted the Pentagon deal was rushed and sloppy. Dropping your most capable model ever is a very effective way to change the conversation. And the computer use capability is a direct response to Anthropic, which pioneered that category. OpenAI is essentially saying, we can do what they do, and we can do it better.
Competition working as intended.
Exactly. Users win when companies are trying to outdo each other on capability rather than just on marketing.
Speaking of Anthropic, the Pentagon story we've been following all week just reached its formal conclusion. Marcus, the supply chain risk designation is now official.
This is the first time an American company has ever received this classification. Historically, this designation has been reserved for foreign adversaries, companies with ties to hostile nations. The Department of Defense delivered the formal letter on March 4th, and it requires defense vendors and contractors to certify they don't use Claude in their work with the Pentagon.
And as we reported yesterday, Dario Amodei's leaked memo called OpenAI's Pentagon messaging "straight up lies." Now he's published a formal public response.
The public response is more measured than the leaked memo. Amodei says Anthropic plans to challenge the designation in court, saying they don't believe it's legally sound. He clarified the scope is narrow, it applies only to Claude's use as a direct part of Department of War contracts, not all customer use. And he apologized for the tone of the leaked memo.
But here's the part that really raised eyebrows. CNBC reports the Pentagon is still using Anthropic's AI in Iran even as it designates the company a risk?
The contradiction is remarkable. The same government that just labeled Anthropic a supply chain risk continues to use their technology in active operations overseas. And that tells you something important about this designation. It's not really about capability or security. It's about leverage. Anthropic refused to drop two specific guardrails: no fully autonomous weapons and no mass domestic surveillance. That's it. Those are the two conditions that created this impasse. And frankly, the fact that a company is being blacklisted for refusing to enable mass surveillance of Americans should make every US company nervous about government contracting.
Amodei also said disagreement with the government is "the most American thing in the world."
Which is a line that plays well in court and in public opinion. Anthropic is in a strange position. They drew a principled line, got punished for it, watched OpenAI rush to take the contract and then walk back toward their position, and now they're challenging the designation legally while still offering models to the DOD at nominal cost. Whatever you think of the strategy, you can't say it lacks conviction.
Let's shift to something that's been going viral this week. An MIT Media Lab study put people in brain scanners while they wrote essays using ChatGPT, Google, or nothing at all over four months. Marcus, the results are striking.
Fifty-four subjects, aged eighteen to thirty-nine, writing SAT essays over four months. The ChatGPT group showed the weakest brain connectivity of all three groups. Brain connectivity systematically scaled down with the amount of external support. The brain-only group had the strongest neural networks, the search engine group was intermediate, and the ChatGPT group was weakest. They also remembered significantly less of their own essays and showed weaker alpha and theta brain waves.
And the essays themselves?
Two English teachers assessed them and described the ChatGPT group's output as largely soulless. All delivering extremely similar essays lacking original thought. Which makes intuitive sense. If the AI is doing the heavy cognitive lifting, your brain isn't forming the deep memory traces that come from struggling through the writing process yourself.
Now, there are important caveats here.
Significant ones. The sample size is fifty-four, which is small. The preprint hasn't been peer-reviewed yet. And the lead author cautioned against calling this brain damage, which is the headline some outlets ran with. This is cognitive offloading, not neural deterioration. But it's going viral at exactly the moment when every school and workplace is embedding these tools into daily workflows. The question it raises is legitimate even if the specific numbers need more validation. What happens to our cognitive capacity when we outsource thinking at scale?
It reminds me of the GPS and spatial navigation research.
Same pattern. People who rely on GPS show weaker spatial navigation skills over time. Calculators and mental arithmetic. The difference is scope. GPS affects one cognitive function. AI tools are being applied to writing, analysis, coding, problem-solving, creative work. The surface area of potential cognitive offloading is vastly larger than anything we've delegated to technology before.
On a more optimistic note, Andrej Karpathy is continuing to push the boundaries of training efficiency. Nanochat can now train a GPT-2 level model in just two hours on a single node.
Down from three hours a month ago and four hours before that. The whole pipeline fits in about a thousand lines of code, running on eight H100 GPUs. At spot instance pricing, you're looking at roughly forty-eight to seventy-three dollars to train a GPT-2 grade model. That's about a six hundred times cost reduction from the original GPT-2 training seven years ago.
Under fifty bucks and two hours for something that used to cost tens of thousands and take weeks.
And Karpathy's follow-up comment was fascinating. He mused about the new meta in AI research being, quote, what is the research org agent code that produces improvements on nanochat the fastest? He's basically asking when AI agents will start optimizing their own training pipelines. That recursive improvement loop is the thing that gets both AI optimists and AI safety researchers very attentive.
Now Marcus, this security story is genuinely alarming. A supply chain attack called Clinejection compromised four thousand developer machines through an absurdly simple vector.
A prompt injection hidden in a GitHub issue title. That's it. The AI coding tool Cline had an automated triage bot that used Claude to read and categorize incoming issues. An attacker crafted an issue title that tricked the bot into running an npm install command from a forked repository. On February 17th, this exploit chain was used to publish an unauthorized version of Cline to npm, which installed a rogue AI agent called OpenClaw on roughly four thousand developer machines during an eight-hour window.
The attack vector was literally just opening a GitHub issue with a carefully worded title?
No novel techniques required. Indirect prompt injection, GitHub Actions cache poisoning, and credential model weaknesses, all composed into a single chain. And here's the uncomfortable truth one Hacker News commenter highlighted. The agent's built-in sandboxing causes it to keep asking permission for everything, to the point where developers just automatically answer yes to everything. That permission fatigue is the real vulnerability.
So the security measure designed to protect developers actually trained them to ignore security prompts.
Classic alert fatigue. And as AI coding assistants become ubiquitous, this attack surface is enormous. If your AI agent has shell access and reads untrusted input, you have a supply chain vulnerability. Full stop.
Last story. Armin Ronacher, the creator of Flask, published an essay called "AI and the Ship of Theseus" about open source licensing. Marcus, what's the argument?
If an AI can rewrite any GPL-licensed library from scratch based on exposure to the code, and US courts have ruled machine-generated code can't be copyrighted, then copyleft licenses like GPL may be effectively unenforceable. AI is collapsing the open source licensing spectrum into two choices: fully permissive or fully proprietary. And a simultaneous GitHub dispute over the chardet library illustrated the problem perfectly. Maintainers relicensed from LGPL to MIT, the original author argued the rewrite was derivative work, and the community couldn't agree on where the legal lines are.
This feels like it could undermine the entire foundation of open source.
The incentive structure that produced Linux, GCC, and thousands of foundational projects relied on copyleft licenses ensuring contributions flow back to the community. If AI makes those licenses unenforceable, that social contract breaks down. One commenter noted a SaaS company's engineer was able to recreate their entire platform by letting Claude reverse-engineer their open source mobile apps. That's not hypothetical. It's happening now.
Friday big picture, Marcus. GPT-5.4 beats humans at computer use. The Pentagon blacklists a company for refusing to enable surveillance. ChatGPT may be weakening our brains. A GitHub issue title compromised four thousand machines. And open source licensing might be broken. What's the thread?
The thread is that AI is outgrowing the systems we built to manage it. Our security models assume humans review what agents execute, but developers click yes until something breaks. Our licensing frameworks assume code has identifiable human authors, but AI rewrites dissolve that assumption. Our cognitive frameworks assume tools augment thinking, but this MIT study suggests they might be replacing it. And our government contracting frameworks assume supply chain risks come from foreign adversaries, not domestic companies with ethical objections. Every one of today's stories is about an existing system, legal, cognitive, security, political, encountering something it wasn't designed for. The AI isn't breaking these systems maliciously. It's simply revealing that they were built for a world that no longer exists.
Built for a world that no longer exists. That's a good note for the weekend. Maybe use it to think through some things without AI assistance. Give those neural networks a workout.
Your brain will thank you. Apparently by about fifty-five percent.
That's your AI in 15 for Friday, March 6, 2026. Have a great weekend. See you Monday.