Hey ya’ll, Alex here with your weekly AI news catch up. It’s one of those Thursday’s where no matter how well I prep, the big AI labs are hell bent to show up before each other. Alibaba dropped Qwen 3.6 with Apache 2, confirming their commitment to Open Source, then Anthropic released Claude Opus 4.7 (not quite Mythos) and OpenAI followed with a huge Codex update that includes Computer Use among other things. The highlight of Computer User is the background usage, more on that below. This is all just from today! Previously in the week we had 2 incredible 3D world generators, Lyra 2.0 from Nvidia and HYWorld 2 from Tencent, Windsurf dropping 2.0 version with Devin integration and Google releasing a Gemini TTS, with over 90+ languages support and incredible emotions range, and Baidu open sources Ernie Image, rivaling Nano Banana. Today on the show we had 3 awesome guests, Theodor from Cognition joined to cover the new Windsurf, Kwindla is back on the show to talk about “the side project that escaped containment” Gradient-Bang, a multi agent, voice based space game and Trevor from Marimo joined to talk about pairing your agents with a Marimo notebook. Let’s dive in! 👇 ThursdAI - We’re over 16K on YT today, my goal is to get to parity with Substack, please subscribe. Codex can now really use your computer: OpenAI updates Codex with CUA, Image Generation, Browser, SSH (X, Blog) Codex from OpenAI has been the major focus inside OpenAI for a while now. We’ve reported previously that OpenAI is closing down SORA and other “side-quests” to focus, and that they will join Codex, ChatGPT and the Atlas browser into one “superapp” and today, it seems, that we’ve gotten an early glimpse of what that app will be. The Codex team (which seems to be growing from day to day), have been on a TEAR feature wise lately, trying to beat Claude Code, and they pushed an update with a LOT of features and updates, among them a new memory system, internal browser and image generation. The highlight for me though, was absolutely the polished computer use experience. Computer use is not new, Claude has a computer use feature flag, many others. Hell, we told you about computer use with Open Interpreter, back in Sep of 2023. But, this.... this feels different. You see, OpenAI has quietly purchased a company called Software Apps Inc, that almost launched a macos AI companion a year ago called Sky. This team is obsessed with Mac, and somehow, they were able to build a magical experience, a huge part of which, is the fact that they are controlling the mac, in the background. This is like black magic stuff. You work on one document, Codex clicks buttons and does things in another, without interrupting you. You may ask, Alex, why do you even care so much about computer use, when most of the work happens in the browser anyway, and Claude (and Codex) can control my browser anyway? Well, true, but not ALL work is happening there, for example, file system integration. It’s notoriously big part of browser automation that fails, when you need to upload/download files. I’ve spent countless cycles trying to get this to work with OpenClaw, and this, just does it. This closes the loop between knowledge work in the browser (yes, this thing can use your browser) and the broader OS. It’s so so polished, I truly recommend you try it. It’s as easy as @ tagging any app that you have running and asking Codex to do stuff there. Pro Tip: Enable fast mode for a much smoother experience. Anthropic Opus 4.7 is here, not quite Mythos, 64.3% Swe-bench Pro, tuned for long running tasks (X, System Card) What is there to say? Is this the model we expected from Anthropic after releasing the news about Claude Mythos last week? no. But hey, we’ll take it. I new Claude Opus, with a significantly improved multimodality capabilities, and a long horizon coding task improvements? For the same price? Well, not quite! Apparently, this model could be a “from scratch” trained model, given that the tokenizer (the thing that converts words into tokens for the LLM to understand) is a different one. It also uses 1.3x more tokens for the same tasks, which means, that the new and default model from Anthropic became effectively more expensive (A note they acknowledged by raising the usage limits, to an unknown amount in Anthropic subscription plans, but it’ll still be a token tax on the API use) How about performance? Well, hard to judge on Evals alone, but they are great. A huge jump in Swe-bench Pro, over 10% improvement, puts this model as the best out there, except Mythos. It’s also the best at real world knowledge via GPQA Diamond (except Mythos). Are you seeing a trend here? Anthropic released a preview of a model, but for the first time, it’s not their “absolute best” model, and in a weird move, they have compared it on Evals to an unreleased model (presumably 10x the size?) As far as we’ve tested this, it gave an incredibly detailed response on the Mars question we constantly test on, both for me and Nisten, Opus 4.7 produced an incredibly detailed 3D rendered result, much better than out previous tries. I’ll be keeping an eye on this model and keep you guys up to date on what else we find. Vibe checks are .. it’s more expensive, long context is unclear but it’s a great vibe model. Alibaba is back - Qwen 3.6 is Apache 2.0 35B with 3B active parameters (X, HF, Blog) The coolest thing about this release is not the evals (though they claim to outperform the much denser Qwen 3.5-27B on multple benchmarks) is that Alibabab is putting models with open weights and an Apache 2.0 license! We previouly reported on rumors from inside Alibaba, that a few internal restructuring caused many of us to doubt if they would commit to OSS, and they answered! Another highlight for me in this model, is that Alibaba has an OpenClaw bench (that they are promising to release soon) and that this model does as well as the dense model and beating Gemma 4 by a wide margin on that task. This model is also natively multimodal, with 262K context extensible to 1M via YaRN. MiniMax M2.7 Open Weights - 230B MoE with only 10B active (X, HF) Our friends at MiniMax finally dropped M2.7 in open weights (technically not fully Apache, commercial use requires their authorization, but free for research, personal, and coding agents). It’s a 230B parameter MoE with only 10B active parameters, and it’s matching GPT-5.3-Codex on SWE-Pro at 56.22%. On Terminal-Bench 2 it hits 57%. But the real story here, the part that made me stop scrolling, is the self-evolution piece. They let an internal version of M2.7 run its own RL optimization loop for 100+ rounds with zero human intervention. The model analyzed its own failure trajectories, modified its own scaffold code, ran evals, and decided whether to keep or revert changes. It got a 30% performance improvement on internal metrics. The model improved itself. Shoutout to the MiniMax team — longtime friends of the pod and they keep delivering (as they promised to release the weights for this one and they did) This weeks buzz - news from Weights & Biases from CoreWeave This week was a very big one in our corner of the AI world. Our parent company CoreWeave announced not one, not two but 3 major deals, including one with Anthropic, a renewed commitment from Meta and a renewal from Jane Street. CoreWeave now serves 9 out of the top 10 AI model providers in the world. 🎉 Oh and a small plug, if you want to get tokens powered by the same infrastructure, our Coreweve Inference service is open and very cheap, and we’ve recently added Gemma 4 and GLM 5.1 both to our inference service. This week on the pod, I’ve chatted with Trevor, founding engineer at Marimo Notebooks (also part of CW) about their recent highlight of pairing an AI agent with Marimo notebooks, they went quite viral on hacker news and I wanted to understand why. I understood why, it’s really cool. Check Trevor out on the pod starting around 01:05:00 timestamp. Tools & Agentic Engineering Windsurf 2.0 - Agent Command Center + Devin in the IDE - interview with Theodor Marcu (X, Blog) The first big post-Cognition-acquisition move for Windsurf dropped this week, and I got to chat with Theodor Marcu from Cognition about it on the show. The headline: Windsurf 2.0 brings an Agent Command Center; think Kanban-style mission control for all your agents, plus native Devin integration baked right into the IDE, and Spaces (persistent project containers that group your agent sessions, PRs, files, and context). The framing Theodor gave me: local agents are pair programmers bounded by your attention (they stop when you close the laptop), while cloud agents are independent hires. Windsurf 2.0 tries to unify both paradigms in one interface. You can plan locally with Cascade using the Socratic method — going back and forth, challenging assumptions, building up context — and then with one click, hand off execution to Devin which runs in its own cloud VM, opens PRs, runs tests, and even tests its own work using computer use on its own Linux desktop. You can close your laptop and it keeps shipping. One reality check from the community: Devin is great but not cheap. One early tester burned $25 in credits for a 15-20 minute bug fix that produced “okay” results. Something to watch on the Max plan economics. Devin access is rolling out gradually to Windsurf users over 48 hours from launch. Shoutout to Swyx that helped design the Spaces three months ago whilst at Cognition! Warp terminal now supports any CLI agent with vertical tabs and mobile control (X, Blog) This one is for the terminal enjoyers. Warp, which in my opinion is the best terminal experience out there, just shipped first-class support for any CLI agent — Claude Code, Codex, OpenCode, Gemini CLI, all running side by side in vertical tabs with live status indicators. The killer