ThursdAI - The top AI news from the past week

From Weights & Biases, Join AI Evangelist Alex Volkov and a panel of experts to cover everything important that happened in the world of AI from the past week

Every ThursdAI, Alex Volkov hosts a panel of experts, ai engineers, data scientists and prompt spellcasters on twitter spaces, as we discuss everything major and important that happened in the world of AI for the past week. Topics include LLMs, Open source, New capabilities, OpenAI, competitors in AI space, new LLM models, AI art and diffusion aspects and much more. sub.thursdai.news

  1. 2D AGO

    ThursdAI - May 14 - TML Interaction Models, Musk v Altman Disclosures, CW Sandboxes & /goal Takes Over

    Hey everyone, Alex here 👋 I am back live on ThursdAI after a week off, and yes, I am now a married man! Thank you for all the congrats, and also thank you to Ryan and Yam for holding down the fort last week while I tried very hard to disconnect. This week was a relatively chill one in AI land (no, really, for once), which actually let us go deep on some really fascinating stuff. We’ve got Thinking Machines Lab finally shipping their first real research with these wild interaction models, Meta Muse Spark showing up in actual products (and it’s surprisingly good!), the Musk v. Altman trial dropping juicy disclosures, and probably the biggest narrative shift on the show today: all of us are quitting OpenClaw. Yeah, you read that right. We’ll get into why. Also! and this is breaking news from this morning, CoreWeave just launched Sandboxes for your agents. I’ll cover that in This Week’s Buzz, but if you’ve been waiting for production-grade sandbox infrastructure that powers 9 out of 10 major AI labs, today’s your day. Oh, and we had Vic Perez from Krea on to talk about Krea 2, their first foundation image model trained completely from scratch. Let’s dig in. ThursdAI - Highest signal weekly AI news show is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. The Great OpenClaw Exodus towards Hermes 🫠 I’m going to start with what was honestly the most emotional thread of the entire show, because three of us, me, Ryan, AND Wolfram; all independently switched away from OpenClaw this week. And we kicked off the show literally processing this together on air. The story is the same across all of us. OpenClaw was magical back in February when we first brought it to you. Things just worked. But after Anthropic’s pricing changes (we covered this — they made Max-tier subscription usage of Opus through OpenClaw significantly more expensive), and after months of the constant Lego-construction-style breakage on every update, the magic faded. Ryan said it best on the show; he was “constantly fixing OpenClaw” instead of using it. So Ryan went to Codex. Wolfram and I both went to Hermes from Nous Research. And folks, things just work again. That February feeling is back, and with GPT 5.5, it’s an incredible assistant! Why Hermes? A few things: * It’s now the #1 most-used CLI agent on OpenRouter globally, passing OpenClaw and even passing Claude Code on OpenRouter usage. That’s a massive milestone for Nous Research and shows we’re not alone in this migration. * It has /goal (more on this in a sec), steering, and background computer use via the TryCUA integration. * It’s open! which means if you’ve built a system like Wolfram’s “Amy” or my “Wooolfred” or Ryan’s “R2” (yes, we know each other’s assistants’ names better than each other’s kids’ names at this point 😅), you can port your memories, profile, and soul files seamlessly. The migration was so smooth that Wolfram literally had Codex talk to Hermes to plan and execute the migration of his home assistant agent. Two agents collaborating to migrate themselves. We are living in 2026 and it’s easier than ever to switch. If you haven’t tried Hermes, give it a go! Steering is maybe the most underrated addition to Hermes, it’s a Codex feature, but exists in Hermes, with GPT 5.5 you can send a follow-up message, and the agent will see it after the next tool call, not after the whole chain of thought was completed (like OpenClaw defaults to) - this changes the conversation to be much more natural! Agents buying wedding gifts using Stripe wallet! Real quick story: Two weeks ago we covered Stripe’s new wallet APIs that let your agents have actual budgets to spend money on the web. I told my agent (back when it was still OpenClaw) to “go buy us a wedding present, don’t tell me what it is.” It half-worked, half-broke. This week, a giant custom map of our travels that just arrived in the mail. I approved one Stripe push notification and the rest just happened. It’s been paying my traffic tickets via screenshots. I’ve also had Hermes pay traffic tickets for me (HOV lane ones, not like.. DUI, 80% of my drive is Tesla FSD) So so happy that my AI assistant got us a present of his own choosing! And it arrived in physical form. Not perfect (the date there is our proposal date ha, but it’s still cool!) Codex gets remote control! (X) While me and Wolfram moved to Hermes, Ryan Carson moved to Codex, and during the show, I wondered, how does he communicate with his R2? Well, just a few minutes after we concluded the live show, OpenAI dropped some breaking news! Codex is now on mobile, and it connects to any mac (for now), from any iOS/Android device, and you can control your Codex, your whole Mac with Computer Use, your browser with Chrome extension, and everything else Codex can do... on the go! This is a huge unlock for many folks, and for many, I assume this will nearly replace the need for something like OpenClaw/Hermes, be much more secure by default and work flawlessly out of the box! The setup is super easy, after updating your ChatGPT app, you now have a new “Codex” window, and after updating the Codex Mac App, you will be able to pair them, and voila, all your Codex local sessions are on the Ios app as well. This works way better than Claude remote btw, significantly so. The fact that you can now add multiple macs (+ ssh servers, they also added the ability to remote control other servers via SSH) is a huge deal, OpenAI is quickly leap frogging Anthroipc, and many are noticing this and switching away from Claude Code. Big Companies & APIs Meta Muse Spark: The Voice AI That Actually Does Things 🎤 Let’s start with the one I actually got to play with: Meta launched Muse Spark-powered voice conversations across the Meta AI app, WhatsApp, Instagram, Facebook, and the Ray-Ban Meta glasses (X, Announcement). And folks, I was honestly surprised by how good this is. I recorded a 5-minute live test and it’s not cut at all. The voice mode reacts almost instantaneously. It’s multilingual (it correctly identified Russian and Hebrew even if it can’t respond in them yet). It can search the Meta network mid-conversation — I showed it a screenshot of one of my own Instagram Reels and within half a second it found the exact reel and explained what we were discussing. Half a second. It also does live camera AI, where it watches what your phone sees. The only thing it failed to identify? My Meta Ray-Ban glasses. The Meta AI didn’t know what Meta Ray-Bans look like. That was the funniest moment of the whole demo. The team at Meta’s Superintelligence Labs spent 4.5 months building this, and the thing that really stood out to me from the announcement is this line: “Our models are scaling predictably. Muse Spark is an early data point on our trajectory, and we have larger models in development.” Translation: this is the small one. Bigger Muse models are coming. Meta’s superpower here, as always, is distribution. They can shove this into the daily product surface of billions of users. ChatGPT advanced voice mode (still on the GPT-4o family) has gotten genuinely worse lately — I barely use it anymore. Meanwhile Meta is shipping good real-time voice across WhatsApp and Instagram. This is the speed-of-product-integration game, and Meta is winning it. Thinking Machines Lab Previews full duplex Interaction Models 🤯 This is the one Wolfram and I really geeked out on. Mira Murati’s Thinking Machines Lab finally released real research — and it’s a fundamentally different bet than what anyone else is making (X, Blog). They’re calling them interaction models, and TML-Interaction-Small is a 276B parameter MoE with 12B active, trained from scratch for native real-time human-AI collaboration. Note: they announced it, they didn’t release weights or an API yet — limited research preview is coming “in the next few months.” Here’s why this matters and what makes it different from Meta’s voice mode (which is also impressive!): the architecture is 200ms micro-turns where the model is continuously perceiving audio, video, AND text WHILE simultaneously generating output. There’s no turn boundary detection, no VAD harness — the model itself handles all of that natively. It’s full duplex baked into the weights. The demos are fire. The model can: * Speak while listening (live translation in real-time) * Watch you do pushups and proactively count them out loud as you go * Wait silently until someone enters the frame, then say “friend” * Generate a chart while continuing to explain a concept to you The benchmarks: 77.8 on FD-bench v1.5 vs GPT Realtime 2.0 at 46.8, and 0.40s turn-taking latency vs over a second for everyone else. Nisten was unimpressed (he pointed out 1.2 seconds for a 12B-active model on a B300 rack is not exactly snappy), and that’s a fair take — but the capabilities here, particularly visual proactivity and time-awareness, are genuinely novel. The philosophical split is really interesting. While every other lab is racing toward full autonomy, Mira is saying interactivity should scale with intelligence. That’s the bet. And given the all-star team she’s pulled together (people from ChatGPT, Character.ai, Mistral, PyTorch, OpenAI Gym, Fairseq, SAM)... I’m here for it. What I really hope happens: someone leaks the weights. A 276B MoE with 12B active is exactly the kind of model we need to be able to quantize to run on something like the Richie Mini for a fully offline, always-present home assistant. Wolfram, I know you’re thinking the same thing 👀 Musk v. Altman: The Trial Drops Some Wild Disclosures and Testimony Okay this one is half drama, half disclosure goldmine. The trial is happening live as we record, closing statements are TODAY (I transcribed both of them here and

    1h 43m
  2. 📅 ThursdAI - May 7 - Interviews with Sunil Pai, Sally Ann Omalley from AI Engineer Europe

    MAY 8

    📅 ThursdAI - May 7 - Interviews with Sunil Pai, Sally Ann Omalley from AI Engineer Europe

    Hey yall, Alex here (with a scheduled post) I’m taking this week off to get married and celebrate life with family, and touch some grass, but wanted to share the awesome chats I had with some great folks at AI Engineer Europe last week. BTW - Yam and Ryan took over the live show today, if you didn’t happen to catch that, please check out the live on our youtube channel! Ok, now to the actual content. The best thing about the AI Engineer conferences for me is the people I meet. I often have a chance to bring them to the live show (in fact, the live show we recorded there had the most guests yet on an episode! 4 guests including Swyx, Omar Sanseviero, VB from OpenAI and Peter Gostev) But often times I also have an offline chat. I find these conversation to be less about the weeks news, and more about the state of AI Engineering, and the guests themselves. Not quite Lex Friedman pod level, but a different vibe from our live shows. Sunil Pai - Cloudflare (@threepointone) The first conversation in today’s pod is with Sunil Pai, Principle Engineer at Cloudflare. Long time followers of ThursdAI know that I love Cloudflare, they gave me my first big break when I was building Targum (which still runs on Workers), so I had a great time chatting with Sunil! This guy has had several lives. React.js core team at Meta (he self-deprecates — "I'm the one nobody talks about, there's a testing API I shipped that pisses people off"). Then did developer tooling and the CLI at Cloudflare the first time. Left to found PartyKit — open-source deployment platform for real-time multiplayer apps and AI agents, built on Cloudflare Durable Objects. Backed by Sequoia. Acquired by Cloudflare in 2024, and he came back as a Principal Systems Engineer (per his bio: "Worked at Cloudflare once, left and created PartyKit, came back wiser"). Also plays guitar (Les Pauls — it's all over his blog). Co-hosts a live show called Dry Run on Cloudflare TV with Craig Dennis. Our conversation was a very fun one, ranging from Cloudflare agentic offerings, to how engineers should think about writing/reading code in 2026. I had a great time chatting with Sunil and I hope you enjoy getting to know him! Sally Ann O'Malley - Redhat Then I had the pleasure of chatting with Sally, who’s a Principal Engineer at Redhat and contributor to OpenClaw. Sally has one of the more unusual paths in the speaker lineup. Started as a schoolteacher, did a stint at Trader Joe's, then moved to Westford, MA, discovered Red Hat's HQ across the street, and went back to school for a second bachelor's in software engineering at UMass Lowell. Joined Red Hat in 2015, has been there a decade. Worked across OpenShift teams, integrating Kubernetes and Podman into the platform. Recent projects span Image Based Operating Systems, Podman, OpenTelemetry, and Sigstore. Also an instructor at Boston University's Faculty of Computing and Data Sciences and an organizer for DevConf.US. Won the 2025 Paul Cormier Trailblazer Award at Red Hat. Currently a founding contributor on the llm-d project — distributed, scalable, high-performance AI inferencing built on K8s. Heavily involved in Red Hat's InstructLab collaboration with IBM (the small-model distillation system using IBM Granite + Llama). Sally and I had a great conversation, two high energy personalities met! We geeked out about our OpenClaw agents, securing your Clankers, how it is to maintain OpenClaw, and everything in between! She was so stressed about the recording, but dare I say, this was one of the more natural guests I had on the show! I hope you enjoyed this format, please let me know if the comments, and I’ll see you next week! — Alex This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe

    53 min
  3. MAY 1

    📅 ThursdAI - Apr 30 - DeepSeek V4 (1.6T MoE), Cursor SDK Wins WolfBench, Mayo's REDMOD Saves Lives, Stripe Gives Agents a Wallet & more

    Hey everyone, Alex here 👋 Tomorrow is May. May! I genuinely cannot believe we’re four months into 2026 already, and the AI news cycle is showing zero signs of slowing down. This week’s show was a wild one! We opened with what is genuinely one of the most important AI stories I’ve ever covered (Mayo Clinic AI detecting pancreatic cancer THREE YEARS before human radiologists), we covered the return of the Chinese whale with DeepSeek V4, OpenAI got caught in their own system prompt begging GPT-5.5 to please stop talking about goblins, and I literally gave my coding agent a credit card and asked it to buy my fiancée a wedding gift with the new Strip Link skill and CLI! Oh yeah, I’m getting married next Tuesday! 💍 So next week’s show will be a little different. I’ll be back the week after to catch you up on whatever drops in my absence (almost certainly something major, knowing this industry). Lots to get through, so let’s dive in. (also, in the end I have a full month recap of every major launch, don’t miss) ThursdAI - Highest signal weekly AI news show is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. Mayo Clinic’s REDMOD: AI Detects Pancreatic Cancer 3 Years Early 🔥 (X, Blog, Announcement) I know we usually cover Models, Parameter sizes, MoEs and big copmanies. But this is important. This is the use case that justifies the entire AI revolution, the GPU burns, the buildouts. I want humans to WIN, and Cancer to be fixed! Mayo Clinic just published a study in Gut (BMJ) validating an AI model called REDMOD that detects pancreatic cancer on routine CT scans up to three years before clinical diagnosis. The numbers are jaw-dropping: They show 73% sensitivity for catching prediagnostic cancers, compared to 39% for experienced human radiologists (while looking at the same exact CT scans). And maybe the most important bit, at scans taken more than 2 years before diagnosis, the AI catches nearly 3x as many cases as specialists For context: pancreatic cancer has less than 15% five-year survival specifically because 85% of patients are diagnosed after the disease has already spread. This is the cancer that took Steve Jobs. Imagine if Jobs had access to this AI three years before his diagnosis. That’s the impact we’re talking about. As Dr. Ajit Goenka from Mayo Clinic put it, the greatest barrier to saving lives from pancreatic cancer has been the inability to see the disease when it’s still curable. This AI can now identify the signature of cancer from a normal-appearing pancreas. Even better: it runs on CT scans people are already getting for other reasons. No extra screening protocol, no new imaging required. Just smarter analysis of existing data. The model also showed remarkably stable performance across institutions, imaging systems, and protocols, with 90-92% test-retest concordance over serial scans. Mayo Clinic is now moving this into prospective clinical testing through a study called AI-PACED (Artificial Intelligence for Pancreatic Cancer Early Detection). When we say “lets f*****g go” that’s what we mean. Yeah getting more intelligence is cool, but I want a world without decease! Let’s F*****g go mayo clinic! Agentic Commerce - Giving OpenClaw my credit card - safely! Stripe Link Wallet and Infrastructure CLI (X, Announcement, Blog, Announcement) Ok, give an LLM your credit card, what can go wrong.. right? Well, it’s clear that this, increasingly, is the future of commerce. Agents will be shopping for us, and we need solutions here. Well, this week at Stripe Sessions (Stripe’s annual product lineup conference) just delivered. Link Wallet, is a new ... API? CLI? Skill? Definitely a skill, for your agents, to connect with your Stripe Link (the thing that stores your credit cards safely) and then giving your agent a budget, it can go and make purchases in your behalf. Now the trick here, is, every purchase, you get a notification to approve, and the agent never sees your actual credit card number! This I think is the biggest win here. To test it out , first, I showed Wolfred the install instructions, which are literally this: Read link.com/skill.md and get me set up with Link And then I asked Wolfred my OpenClaw assistant to buy me a present of its choice for my upcoming wedding, and that I don’t want to know what the present is, but I can approve the spend! OpenClaw installed this, sent me a link to connect to my Link.com account, I also downloaded the Link app to receive notifications (and had to enable them by hand, it was a bit annoying to discover, but they said they will fix the onboarding) and .. voila, my agent can now go spend my money, and I get these approval notifications: The kicker? The present Wolfred sent us is due to arrive like 2 months after the wedding 😂 But hey, it’s still something! My agent went, chose a wedding gift in budget, asked for my approval to puchase, and filled out the details (asked me for a few of them) and voila, first agentic purchase that did not require my credit card exposed! Stripe announced a whole bunch of other Agentic Commerce Suite features, like Shared Payment Tokens, which are scoped to seller and protected by Radar, MPP (machine payment protocol) and streaming payments using stable coins that are pretty slick and a bunch of other interesting things. This is where the world is moving to, and Stripe is innovating hard here, definitely worth keeping an eye out on what they are Speaking of agents and stripe, they also opened up the waitlist for projects.dev - which is a way for agents to provision accounts fully on their own, get API keys, and set everhing up from scratch. I think it’s a wonderful addition to the agentic tools and agentic internet! Your agent just runs something like stripe projects add cloudflare/workers abd boom, you have a workers deployment, with credentials synced, no dashboard clicking or API creation! Big Companies & APIs GPT-5.5 Goblin Mode: The Funniest Bug Report in AI History (X, Blog) Someone on X noticed that Codex system message for GPT 5.5 that launched last week has this interesting addition: “Never talk about goblins, gremlins, raccoons, trolls, ogres, pigeons, or other animals or creatures unless it is absolutely and unambiguously relevant to the user’s query” and it has it two times! This created a bunch of memes, questions and wonderings about ... why would OpenAI care so much about Goblins. And they finally posted a long writeup on why: the TL;DR there is, GPT 5.5 absolutely LOVES talking about Goblins, trolls and other nerdy creatures. This is a result of them favoring the “nerdy” personality archetype and reinforcing this reward via RL. OpenAI admitted that “Unfortunately, 5.5 started training before we found the root cause of goblins” and so, now, we get 5.5 that LOVES to talk about goblins, can’t stop talking about goblins (unless they are asked to stop by a system prompt) OpenAI also posted the exact instructions of how to “unleash“ the goblin mode on the blog, which I find hilarious, a company that leans into the meme is a company to be celebrated 👏 GPT 5.5 is as good as Claude Mythos on CyberSecurity According to the AI Security institute, GPT 5.5 (not the GPT 5.5 - Cyber version that was announced), the one you have access to, is as good as Claude Mythos on vulnerability finding. We previously reported that Anthropic deemed Claude Mythos as “too dangerous to release publicly” and it turns out that that was either a marketing “Myth”, or Anthropic’s inability to server this huge model like they server Opus. OpenAI Ends Microsoft Azure Exclusivity This piece of news sent quite a lock of shock throughout the industry, somehow, Sam Altman and OpenAI have been able to negotiate through the very strict deal with MIcrosoft and now are available in AWS as well as Microsoft Azure! Apparently the AGI clause is now gone as well! For many startups who are locked into AWS and Bedrock ,this is great news, they are not able to use GPT 5.5 and other OpenAI models directly applying their credits. Other Big Company News Xai released Grok 4.3 - in a quiet release in their API docs, no blogpost, not even an X announcement. The only way I know about this was Artificial Analisys, Arena and Vals AI all posted that it jumped in scores. With the same price as the previous Grok, but only 1M tokens, it seems significantly better that its predecessor jumping (X) Gemini can now generate and export Docs, Sheets, Slides, PDFs directly from chat — available globally for free. Google literally put Microsoft Word and Excel icons in the announcement. They’re giving away what Microsoft charges for with Copilot to 750 million users. (X, Blog) Mistral Medium 3.5 dropped as a 128B dense model with 256K context, 77.6% on SWE-Bench Verified, and configurable reasoning effort. Their Vibe coding agent now supports remote parallel agents and session teleportation. $1.5/$7.5 per million tokens.(X, HF, Blog) Baidu’s ERNIE 5.1 Preview landed at #13 on Arena’s Text leaderboard, making it #1 among all Chinese labs. Speculated to be an 800B/36B active MoE using only 6% of comparable pretraining compute. (X, Announcement) Open Source AI The Whale returns - DeepSeek drops V4 with insane attention innovations (X, Arxiv, HF, HF) Folks, DeepSeek just dropped V4! Two models: V4-Pro at a whopping 1.6 trillion params with 49 billion active, and V4-Flash at 284B total with only 13 billion active. Both support 1 million token context natively! V4-Pro-Max gets 93.5% on LiveCodeBench, beating every other model including Gemini-3.1-Pro. Codeforces rating of 3206, that’s a new record, beating GPT-5.4’s 3168. SWE-Bench Verified at 80.6%, that’s basically tied with Opus-4.6 at 80.8%. But here’s the thing, this model doesn’t overwhelm with evals performance, it’s at par with ot

    1h 37m
  4. APR 24

    📅 Apr 23: OpenAI's Week: GPT-5.5, GPT-Image-2, Codex CUA + Chronicle, + Claude Design, Kimi K2.6, Qwen 3.6-27B

    Hey, Alex here, I’ll try to catch you up, but it’s one of the more intense weeks in AI in recent memory. Here’s the TL;DR - OpenAI dominates across the board this week! Finally launches “spud”, called it GPT 5.5 (and 5.5 Pro), and it’s SOTA on most things,nearly matching the mysterious Claude Mythos but released and we can actually use it (we tested it extensively). OpenAI also took the crown in image generate with the incredible GPT-image-v2 release, beating Nano Banana 2 and pro by a significant margin, the images are incredible, this model can generate working QR codes and 360 images it’s quite bonkers. Codex was updated with Computer Use (which I told you about last week), in-app browser and a bunch of other tools that match GPT 5.5 intelligence. Meanwhile, Anthropic launched an incredible research preview of Claude Design, finally admitted that Claude was dumb and reset quotas across the board, while breaking the trust of the community with removing Claude code from the pro plan. We’ve also got great open source updates, Kimi K2.6 and Qwen 3.6 27B are both great performers! We were live on the stream for almost 4 hours today waiting for GPT 5.5 and finally got it and tested it live on the show + had Peter Gostev on from Arena who had early access and shared with us his insights. Let’s get into it! ThursdAI - Highest signal weekly AI news show is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. OpenAI’s GPT 5.5 is here - SOTA AI intelligence you can actually use (Release Blog) OpenAI finally gave us all access to their latest intelligence boost, GPT 5.5 thinking (and GPT 5.5 Pro). These models take the crown across many benchmarks, including TerminalBench (82.7%), GPDval (84%) and more. You can see the highlited versions on the image above. Though, its not uncommon for OpenAI to do some chart crimes, so @d4m1n created a chart that also showed the full benchmarks, including the ones GPT 5.5 is not beating Opus at, as you can see below, it underperforms on Humanity’s Last Exam, and scaled tool use. But, benchmarks don’t tell the full story. GPT 5.5 uses significantly less tokens, compared to 5.4, about 40% less. It’s also more expensive, but given the lower token usage, it nets out at about ~20% price increase, while being more intelligence and faster. Tons of folks who had early access are reporting the same things, this model excels in long running tasks, Peter Gostev from Arena, who joined our live stream, showed us an incredible demo that ran overnight for over 8h! This model can work until the task is done, no longer just pausing in the middel asking for your input. The real highlight is, paired with the recent GPT-image-2 (which I’ll expand on later in this newsletter), GPT 5.5 becomes an excellent UI designer. This is a big area in which Claude still has moat and OpenAI is trying to catch up here, and the real alpha now is to use both the Image gen and 5.5 in tandem to create beautiful visuals and UIs. The main thing is, after testing it quite a few times, this only works if you generate an image outside of the session that builds the actual UI. we tried a couple of times to do it in 1 session, and the resulting UI doesn’t seem to be remotely close to the generated image. Only after sending this image to a completely fresh session and asking for a “pixel perfect” implementation, did GPT 5.5 start to resemble the input image and rebuild the whole ui in pixel perfect fidelity! GPT Image v2 - SOTA thinking image model, finally beating Nano Banana (Blog, Live) Like we said, OpenAI is dominating this week, and in both instances those are great models. Though, apples to apples comparison, GPT-image-v2 is a much higher jump — from previous models — than GPT 5.5! According to Artificial Analysis, the jump in how many people prefer GPT-image-2 in blind tests compared to other model is the higest we’ve ever seen, over 250 points. And you can clearly see it in the generations as well. Previously this week, we did a live streaming session with Peter Gostev (from Arena) and we did a deep dive comparing this new model to GPT Image 1.5, Nano Banana and Grok Imagine, and it’s a clear winner across most categories. Character consistency is immaculate, high resolution imagery, instruction following, are all so so good it’s a bit hard to explain in text. Reasoning visual intelligence Like with Nano Banana, this model is likely based on a big GPT image, it’s no longer just diffusion, as you can see, it reasons! And apparently the more reasoning you give it (if you choose GPT pro) the better it’ll be. The examples are indeed wild, the model can generate images of code that works, generate functional QR codes and bar codes! The craziest thing people figured out it can do, is functional 360 imagery (equirectangular format), you can just ask the model to create a 360 image of “scene” and then drop this in to a 360 viewer! Peter shows us on the show how he combined GPT 5.5 and Image v2 to create a sort of “street view” from a bunch of 360 images, it blew our minds. He literally spun up an overnight GPT 5.5 task in Codex that planned out the hanging gardens of Babylon, generated hundreds of equirectangular images, stitched them into a walkable interface, and had it running 8+ hours without babysitting. A street view of a place we don’t actually know what it looked like, hallucinated from latent space. What a time. Day one availability is wide: Figma, Canva, Adobe Firefly, fal.ai, and Microsoft Foundry all have it. Nano Banana dominated for what felt like an eternity in AI time (it was really only a few months 😅), and finally OpenAI has a proper answer. OpenAI is dropping models on HF - Privacy Filter, a 1.5B apache 2.0 PII reduction model (X, HF) I’ve told you the’ve been cooking this week! OpenAI open sourced a genuinly useful model called Privacy Filter, that has 1.5B parameters with only 50M active, small enough that it runs in fully offline in your browser (check out this incredible web demo by our friend Xenova) This model is specifically built to anonymize and filter our personally identifiable information (PII), things like names and addresses, but more importantly bank accounts and API keys! This, in the era of agentic assistants is extremely important and I’m very happy that OpenAI is open sourcing here, specifically because while it’s great generally, this model is great for fine-tuning on your own data! Pairing this with something like CrabTrap, a new open source proxy with LLM as a judge for agents like OpenClaw, and you’re hardening your setup so that your private details won’t leak, even if someone manages to prompt inject your agent! In every other week, CrapTrap would deserve a segment of its own, it is really a novel solution to the “AI agent can leak your creds” problem, created by Brew CEO, as they run agents inside Brex, but this week is insane, so... you get a link and we move on 🙂 Claude Design - Anthropic’s figma killer? (try it, deep dive) This launched on Friday (come on Anthropic, why are you launching things on a friday?!) and nearly tanked Figma stock (16% down since). It didn’t help that Mike Krieger who runs product at Anthropic and co-leads Anthropic Labs, quit the Figma board just a few days before this release. Claude Design is a new, separate interface for Claude, with its own usage meter, that exists only on web, and only for Max subs for now. We all know that Claude is great at frontend design, but this is an interface that wraps Claude, with some incredible “designer like” tools. Knobs to edit font sizes, point and click interface to highlight elements for Claude to fix. The highlight for me, what broke my brain on the live stream, was the “talk to the design” feature, where you turn on the microphone, talk to Claude, and while you point, it “knows” what you’re pointing at! So you can say “here, fix THIS thing” without saying what that thing is, and Claude will just fix it, by looking at where your cursor was at the time. This ... this feels like magic. The huge unlock in Claude Design is the initial “brand guidelines” process, in which you ask Claude to create a holistic brand identity (based on your website code, screenshot, Figma file etc) and then, every new project, can have that brand identity preserved, with the right fonts, colors, logos etc. I dropped the show notes from this week and asked for an interactive infographic website using the brand guidelines. This really does feel like a “new kind” of product, I’ve worked with designers before, the interaction model with Claude Design feels very much like working with a designer, showing them what you like and don’t like. And like working with a designer, it’s expensive! Claude Design uses Claude 4.7 and buuurns through tokens! I’ve tapped out of my weekly quota in less than 4 projects! Luckily, Anthropic this week admitted that they’ve dubmed down Claude, and reset the quotas, so I was able to show it on the live show. This week’s Buzz — W&B LEET TUI gets Workspace mode Our W&B LEET TUI went viral a couple weeks back (local terminal UI for watching run stats, metrics, and system health - built for folks training on remote boxes who don’t want to alt-tab to a browser), and the team shipped a big follow-up this week: workspace mode. Multi-run workspaces live, metadata filtering, system metrics (GPU stats included), console logs, and — my favorite — images rendered directly in the terminal . The whole web workspace experience, now in your SSH session. Demo video and full announcement here. pip install wandb, give it a spin. Open Source AI Kimi K2.6 - Opus at home (if you have a data center) (X, HF, Live) Moonshot AI dropped Kimi K2.6 this week, a 1 Trillion parameter MoE with 32B active, 384 experts,

    2h 24m
  5. APR 16

    April 16 - Codex uses your mac in the background, Opus 4.7 release not quite Mythos + 3 interviews

    Hey ya’ll, Alex here with your weekly AI news catch up. It’s one of those Thursday’s where no matter how well I prep, the big AI labs are hell bent to show up before each other. Alibaba dropped Qwen 3.6 with Apache 2, confirming their commitment to Open Source, then Anthropic released Claude Opus 4.7 (not quite Mythos) and OpenAI followed with a huge Codex update that includes Computer Use among other things. The highlight of Computer User is the background usage, more on that below. This is all just from today! Previously in the week we had 2 incredible 3D world generators, Lyra 2.0 from Nvidia and HYWorld 2 from Tencent, Windsurf dropping 2.0 version with Devin integration and Google releasing a Gemini TTS, with over 90+ languages support and incredible emotions range, and Baidu open sources Ernie Image, rivaling Nano Banana. Today on the show we had 3 awesome guests, Theodor from Cognition joined to cover the new Windsurf, Kwindla is back on the show to talk about “the side project that escaped containment” Gradient-Bang, a multi agent, voice based space game and Trevor from Marimo joined to talk about pairing your agents with a Marimo notebook. Let’s dive in! 👇 ThursdAI - We’re over 16K on YT today, my goal is to get to parity with Substack, please subscribe. Codex can now really use your computer: OpenAI updates Codex with CUA, Image Generation, Browser, SSH (X, Blog) Codex from OpenAI has been the major focus inside OpenAI for a while now. We’ve reported previously that OpenAI is closing down SORA and other “side-quests” to focus, and that they will join Codex, ChatGPT and the Atlas browser into one “superapp” and today, it seems, that we’ve gotten an early glimpse of what that app will be. The Codex team (which seems to be growing from day to day), have been on a TEAR feature wise lately, trying to beat Claude Code, and they pushed an update with a LOT of features and updates, among them a new memory system, internal browser and image generation. The highlight for me though, was absolutely the polished computer use experience. Computer use is not new, Claude has a computer use feature flag, many others. Hell, we told you about computer use with Open Interpreter, back in Sep of 2023. But, this.... this feels different. You see, OpenAI has quietly purchased a company called Software Apps Inc, that almost launched a macos AI companion a year ago called Sky. This team is obsessed with Mac, and somehow, they were able to build a magical experience, a huge part of which, is the fact that they are controlling the mac, in the background. This is like black magic stuff. You work on one document, Codex clicks buttons and does things in another, without interrupting you. You may ask, Alex, why do you even care so much about computer use, when most of the work happens in the browser anyway, and Claude (and Codex) can control my browser anyway? Well, true, but not ALL work is happening there, for example, file system integration. It’s notoriously big part of browser automation that fails, when you need to upload/download files. I’ve spent countless cycles trying to get this to work with OpenClaw, and this, just does it. This closes the loop between knowledge work in the browser (yes, this thing can use your browser) and the broader OS. It’s so so polished, I truly recommend you try it. It’s as easy as @ tagging any app that you have running and asking Codex to do stuff there. Pro Tip: Enable fast mode for a much smoother experience. Anthropic Opus 4.7 is here, not quite Mythos, 64.3% Swe-bench Pro, tuned for long running tasks (X, System Card) What is there to say? Is this the model we expected from Anthropic after releasing the news about Claude Mythos last week? no. But hey, we’ll take it. I new Claude Opus, with a significantly improved multimodality capabilities, and a long horizon coding task improvements? For the same price? Well, not quite! Apparently, this model could be a “from scratch” trained model, given that the tokenizer (the thing that converts words into tokens for the LLM to understand) is a different one. It also uses 1.3x more tokens for the same tasks, which means, that the new and default model from Anthropic became effectively more expensive (A note they acknowledged by raising the usage limits, to an unknown amount in Anthropic subscription plans, but it’ll still be a token tax on the API use) How about performance? Well, hard to judge on Evals alone, but they are great. A huge jump in Swe-bench Pro, over 10% improvement, puts this model as the best out there, except Mythos. It’s also the best at real world knowledge via GPQA Diamond (except Mythos). Are you seeing a trend here? Anthropic released a preview of a model, but for the first time, it’s not their “absolute best” model, and in a weird move, they have compared it on Evals to an unreleased model (presumably 10x the size?) As far as we’ve tested this, it gave an incredibly detailed response on the Mars question we constantly test on, both for me and Nisten, Opus 4.7 produced an incredibly detailed 3D rendered result, much better than out previous tries. I’ll be keeping an eye on this model and keep you guys up to date on what else we find. Vibe checks are .. it’s more expensive, long context is unclear but it’s a great vibe model. Alibaba is back - Qwen 3.6 is Apache 2.0 35B with 3B active parameters (X, HF, Blog) The coolest thing about this release is not the evals (though they claim to outperform the much denser Qwen 3.5-27B on multple benchmarks) is that Alibabab is putting models with open weights and an Apache 2.0 license! We previouly reported on rumors from inside Alibaba, that a few internal restructuring caused many of us to doubt if they would commit to OSS, and they answered! Another highlight for me in this model, is that Alibaba has an OpenClaw bench (that they are promising to release soon) and that this model does as well as the dense model and beating Gemma 4 by a wide margin on that task. This model is also natively multimodal, with 262K context extensible to 1M via YaRN. MiniMax M2.7 Open Weights - 230B MoE with only 10B active (X, HF) Our friends at MiniMax finally dropped M2.7 in open weights (technically not fully Apache, commercial use requires their authorization, but free for research, personal, and coding agents). It’s a 230B parameter MoE with only 10B active parameters, and it’s matching GPT-5.3-Codex on SWE-Pro at 56.22%. On Terminal-Bench 2 it hits 57%. But the real story here, the part that made me stop scrolling, is the self-evolution piece. They let an internal version of M2.7 run its own RL optimization loop for 100+ rounds with zero human intervention. The model analyzed its own failure trajectories, modified its own scaffold code, ran evals, and decided whether to keep or revert changes. It got a 30% performance improvement on internal metrics. The model improved itself. Shoutout to the MiniMax team — longtime friends of the pod and they keep delivering (as they promised to release the weights for this one and they did) This weeks buzz - news from Weights & Biases from CoreWeave This week was a very big one in our corner of the AI world. Our parent company CoreWeave announced not one, not two but 3 major deals, including one with Anthropic, a renewed commitment from Meta and a renewal from Jane Street. CoreWeave now serves 9 out of the top 10 AI model providers in the world. 🎉 Oh and a small plug, if you want to get tokens powered by the same infrastructure, our Coreweve Inference service is open and very cheap, and we’ve recently added Gemma 4 and GLM 5.1 both to our inference service. This week on the pod, I’ve chatted with Trevor, founding engineer at Marimo Notebooks (also part of CW) about their recent highlight of pairing an AI agent with Marimo notebooks, they went quite viral on hacker news and I wanted to understand why. I understood why, it’s really cool. Check Trevor out on the pod starting around 01:05:00 timestamp. Tools & Agentic Engineering Windsurf 2.0 - Agent Command Center + Devin in the IDE - interview with Theodor Marcu (X, Blog) The first big post-Cognition-acquisition move for Windsurf dropped this week, and I got to chat with Theodor Marcu from Cognition about it on the show. The headline: Windsurf 2.0 brings an Agent Command Center; think Kanban-style mission control for all your agents, plus native Devin integration baked right into the IDE, and Spaces (persistent project containers that group your agent sessions, PRs, files, and context). The framing Theodor gave me: local agents are pair programmers bounded by your attention (they stop when you close the laptop), while cloud agents are independent hires. Windsurf 2.0 tries to unify both paradigms in one interface. You can plan locally with Cascade using the Socratic method — going back and forth, challenging assumptions, building up context — and then with one click, hand off execution to Devin which runs in its own cloud VM, opens PRs, runs tests, and even tests its own work using computer use on its own Linux desktop. You can close your laptop and it keeps shipping. One reality check from the community: Devin is great but not cheap. One early tester burned $25 in credits for a 15-20 minute bug fix that produced “okay” results. Something to watch on the Max plan economics. Devin access is rolling out gradually to Windsurf users over 48 hours from launch. Shoutout to Swyx that helped design the Spaces three months ago whilst at Cognition! Warp terminal now supports any CLI agent with vertical tabs and mobile control (X, Blog) This one is for the terminal enjoyers. Warp, which in my opinion is the best terminal experience out there, just shipped first-class support for any CLI agent — Claude Code, Codex, OpenCode, Gemini CLI, all running side by side in vertical tabs with live status indicators. The killer

    1h 59m
  6. APR 9

    📅 ThursdAI LIVE from London - Claude Mythos, Codex Resets, Muse Spark & More | w/ Swyx and friends from OpenAI, Deepmind, LMArena and OpenClaw

    Hey yall, Alex here, writing this from sunny London, at the first ever AI Engineer conference in Europe! What a show we have for you today! First, let me catch you up on what’s important: Anthropic, this week announced a whopping $30B ARR up from 19B in Feb, while also telling us about Claude Mythos Preview their next gen HUGE model that they won’t release to the public (yet?) that finds crazy vulnerabilities in existing code bases. Apparently OpenAI will follow up with a similar non-public model soon. The Meta Superintelligence Lab led by Alex Wang finally showed what they were working on, Muse Spark, the smaller of their upcoming models on a complete new infrastructure (MSL announcement, Simon Willison’s deep dive on the 16 hidden tools). In other news: Z.AI released GLM 5.1 in OSS finally (HF weights), Seedance 2.0 finally available in US on Replicate, OpenAI testing out GPT-image-2 on LM Arena under codenames, HappyHorse from Alibaba takes the video crown, and Mila Jovovich (5th Element, Resident Evil) releases agentic memory plugin called MemPalace (Ben Sigman’s transparent correction thread is worth reading). We had 5 guests today on the show, we kick off with @swyx the founder of AI Engineer and host of Latent Space. We then chatted with @petergostev from Arena (formerly LMArena) about Mythos and the compute wars, then Vincent Koc, the second most prolific contributor to OpenClaw, then our friends VB from OpenAI and Omar from DeepMind, both previously at HuggingFace. This is a busy busy show, and given the time-zones, I unfortunately don’t have time for a full weekly writeup, but as always, I will share the raw notes and post the video (lightly edited). ThursdAI - Highest signal weekly AI news show is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. AI Engineer - London ThursdAI came a long way since the first AI Engineer conference, but many who read this don’t know, that was my big break. Swyx invited me to cover the first AIE in San Francisco in 2023, and I remember, I was in an Uber to the airport, the driver asked me what I do, and I, for the first time said “I host a podcast”. I (and ThursdAI) owe a lot to Swyx, and AIE team, and it’s been incredible to see how big they’ve grown and how many great speakers this event hosts! The term AI Engineer has drifted in those 3 years, but also has the term Software Engineer. Swyx predicted this nearly 3 years ago, what I don’t think he predicted, is that all engineers are now AI Engineers, and this includes domains like Agens (OpenClaw), Context and Harness Engineering, Evals and Observability, Voice & Vision all of which are tracks in this conference. I was really surprised to see how many of the talks/speakers here are native to London (after all, Deepmind is from here, OAI, Anthropic, Meta have offices here) and the latest boom in agents, OpenClaw, Pi were all Europe based as well, and they are joined the AI Engineer stage. Oh, and there’s also a Giant Inflatable Claw at the entrance, yup, for pictures and vibes, and to show off how quickly the OpenClaw took over the mind-share. Anthropic announces $30B ARR and Mythos, their next model, will not be released to the public. The thing that everyone will tell you, is that Anthropic is on a roll, this is obviously connected to their upcoming IPO this year. We’ve been covering many issues on their part, but this week we saw them posting about a HUGE increase in ARR, from 19B in February to 30B in April, passing OpenAI at $25B. That last fact though, is kind of disproven because they report on ARR differently, OpenAI apparently only counts their cloud revenue from Microsoft per the information. The growth is undeniable though, and so is the most unprecedented release announcement, Claude Mythos Preview, which was rumored for a bit and now was announced proper. With project Project GlassWing, Anthropic has announced that this model is SO good at cyber security and finding bugs in code, that they cannot share it with the public, and through GlassWing they will share it with companies like Microsoft, Linux, CrowdStrike and a bunch of others, to harden their security. This is it folks, this is the first time, where a model was “announced” but deemed too risky to release. Now, is it truly “too risky”? Previously, folks thought that DALL-E is too risky, or cloning voice tech is too risky, and now it’s everywhere. The capabilities catch up even in OpenSource. But the facts are, Anthropic says they’ve found a 27-year old bug in OpenBSD (famously very secure), and that this model is very very good at connecting the dots between several, seemingly inacuous bugs, to string them together into one coheren exploit. This is, indeed scary. Just last week, one of the top security researchers in the world, Nicolas Carlini, now at Anthropic, gave a talk at Black Hat, showing off these results, and saying that these models since December and definitely recently have passed him as a security engineer. If you haven’t seen this talk, watch it, then try to estimate if Anthropic did the right thing by only releasing this model to enterprises first. But on the show, Peter Gostev from Arena gave me a take on this that I haven’t been able to shake. Peter pulled up his Compute Wars chart live on the show — and the picture is that OpenAI is way ahead of Anthropic on compute, with Anthropic only recently getting a noticeable bump (which lines up suspiciously well with Mythos being trainable in the first place). His read: “it sounds cooler to say it’s too risky to release than ‘we can’t serve it.’” The official partner pricing is $25 / $125 per million tokens — 5x Opus 4.6 — but if you don’t have the GPUs to serve it broadly, the price doesn’t matter. In the year of the IPO, the company that cannot serve a model says the model is too dangerous to serve. Make of that what you will. This also reframes the whole rate-limit drama with OpenClaw. Anthropic didn’t ban OpenClaw — I want to be very clear about this because the discourse went sideways. What they did is they made it significantly more expensive for Max-tier subscribers to use Opus through OpenClaw, which pushed a lot of people over to GPT-5.4 via Codex. Same root cause: they’re out of compute. The freshly announced Anthropic + Google TPU deal (Google already owns ~10% of Anthropic) is them trying to fix this — though as Peter noted, it’s pretty wild that Google is propping up a direct competitor to their own DeepMind team. Same pattern as their original $2B Anthropic investment ending up propping AWS Bedrock against Google Cloud. Big Google contains multitudes. Meta Superintelligence Labs ships Muse Spark — Llama is dead, long live Muse Llama is dead, long live Muse. This week Meta finally showed what the very expensive Meta Superintelligence Labs under Alexandr Wang has been cooking, and the answer is Muse Spark — the smaller of their new model family, built on a fully rebuilt AI stack from scratch in just 9 months. Nine months is wild for that kind of overhaul, and the headline number people are quoting is that they reach Llama 4 Maverick capability with over 10x less compute. Spark is intentionally small and latency-optimized — it’s not trying to be the biggest, it’s trying to be the first step on Meta’s new scaling ladder. But the benchmarks in certain areas are nuts: 86.4 on CharXiv Reasoning (beats Opus, Gemini, GPT-5.4), and the one that really got me — 42.8 on HealthBench Hard vs Opus at 14.8 and Gemini at 20.6. They trained it with data curated by over 1,000 physicians and it shows. They also shipped a Contemplating mode which is parallel multi-agent reasoning, hitting 58.4% on Humanity’s Last Exam with tools. Coding is the acknowledged weak point (77.4 on SWE-Bench Verified vs Opus 80.8) but for v1 from a brand new stack, this is extremely respectable. Meta is Back! The real story isn’t any single benchmark though, it’s distribution. Spark is rolling out across meta.ai, WhatsApp, Instagram, Threads, Messenger, and Ray-Ban Meta glasses — billions of users. Meta went from open Llama to a closed consumer model and they’re clearly playing a different game now (though Wang says future Muse versions might be open-sourced). The deep-dive that’s really worth your time is Simon Willison’s post where he poked at the meta.ai chat UI and got the model to spit out descriptions of 16 hidden tools behind the scenes — full Code Interpreter with persistent Python 3.9, a visual grounding tool that does pixel-precise object detection (bounding boxes, point coordinates, counting — it located 8 objects including individual whiskers and claws on a generated raccoon), sub-agent spawning, file editing, and semantic search across Instagram/Threads/Facebook posts. It’s basically an entire agentic harness baked into the chat UI. Jack Wu from MSL confirmed the tools are part of a new harness built specifically for Spark’s launch. Meta stock went up 7% on this. They are very much back in the frontier game. Guest highlights We had an unprecedented packed show with 5 guests (also this is the shortest show we’ve ever Swyx kicked us off with vibes from the AI Engineer floor — harness engineering as the dominant theme (gains are coming from the harness, not the weights), the rise of skills (English-as-programming-language) absorbing more of that harness work, and his thesis that supply-chain attacks like the recent light LLM and Axios incidents mean you should basically vendor everything — pip fork instead of pip install. We also chatted about how MCP has gone from “the most exciting protocol” to “settled and stable, therefore less interesting,” which is a great problem to have. Peter Gostev from Arena (you saw a lot of him in the Mythos section above) also dropped a bonus on us: Arena just released 3

    1h 59m
  7. APR 3

    📅 ThursdAI - Apr 2 - Gemma 4 is the new LLama, Claude Code Leak, OpenAI raises $122B & more AI news

    Hey Ya’ll, Alex here, let me catch you up. What a week! Anthropic is in the spotlight again, first with #SessionGate, then with the whole Claude Code source code leak, and finally with an incredible research into LLM having feelings!? (more on this below). And while Anthropic continues to burn through developer good will faster than their sessions, OpenAI announced a MASSIVE $122B round of funding (largest in history), Google released Gemma 4 with Apache 2 license - we had Omar Sanseviero on the show to help us cover what’s new, Microsoft dropped 3 new AI models (not LLMs) and PrismML potentially revolutionized local LLM inference with lossless 1-bit quantization! P.S - Oh also, something on X algo changed, I get way more exposure now, 3 out of my best 5 posts ever have been from this week + I got the coveted Elon RT on my Claude Code leak coverage. I’ll try to stay humble 😂 Anyway, let’s dive in, don’t forget to hit like or share with friends, and TL;DR with links is as always, at the bottom: ThursdAI - Highest signal weekly AI news show is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. The Claude Code source Leak: Half a Million Lines of “Oops” So here’s what happened. On March 31st, Anthropic shipped Claude Code version 2.1.88 to npm. Inside that package was a 59.8 megabyte source map file — basically a debugging artifact that contained the entire compiled source code. 512,000 lines of TypeScript across 1,900 files. The entire playbook for how the Claude Code harness works, including a lot of stuff that wasn’t supposed to be public yet. A researcher named Chaofan Shou spotted it at 4 AM ET, posted the download link, Sigrid (who came to the show) posted it on Github and within six hours it had 3 million views and 41,000 GitHub forks (This repo is the highest starred repo in Github history btw, with well over 150K Github stars). Anthropic started filing takedowns, but the internet being the internet, it was already everywhere. The source code is still on tens of thousands of computers right now. (I won’t link directly but there’s a website called Gitlawb, look it up) The community went absolutely wild digging through the source code btw, and they found some interesting things! KAIROS: Claude Code is going to become a Proactive Agent! This is the biggest take-away from this leak IMO, that like OpenClaw/Hermes agentic harnesses, Claude Code is already a fully featured proactive agent, we just don’t have access to this yet. With KAIROS, Claude Code will have it’s own daemon (will run independently from the CLI), will have a background ping system (hello Heartbeat.md from OpenClaw) that will make it wakeup and do stuff, will do “autodream” memory consolidation reviewing your daily sessions and fix memories, subscribe to Github, and maintain daily appent-only logs to show you what it did while it and you were asleep. This is by far the hugest thing, I’m excited to see how / when they ship KAIROS, as I said, 2026 is the year of Proactive agents! My Wolfred OpenClaw agent summed it up very nicely: Undercover Mode For Anthropic employees working on public repos, there’s an Undercover Mode that auto-activates and strips all AI attribution from commits. The system prompt? “Do not blow your cover.” They really said “this is fine” about shipping internal tools to production while hiding from the world that AI wrote the code. Which, honestly, is kind of incredible meta-humor from whoever wrote that. The Buddy System My personal favorite discovery: there’s a hidden Tamagotchi-style terminal pet called the Buddy System with 18 obfuscated species, rarity tiers (including a 1% legendary), cosmetic hats, shiny variants, and stats like DEBUGGING, PATIENCE, and CHAOS. If you activate it now, you can do /buddy and you’ll have a little companion judging your coding decisions. Anthropic shipped a game inside their CLI tool. Mine is called Vexrind and he’s sarcastic as f**k, I’m not sure I like it. Anti-Distillation Protections The code also revealed that Claude Code injects fake tool calls into logs to poison training datasets. If you’ve been backing up your .claw folders to train on the data; Stop. Pass your data through something like Qwen or make sure you’re filtering out the noise. (a Nisten tip) The Models That Don’t Exist Yet Buried in the code are references to Opus 4.7, Sonnet 4.8, and a model called capybara-v2-fast with a 1 million context window. These haven’t been released. This is yet another confirmation of the leaked “Mythos” model that’s coming soon from Anthropic. Which btw, with Anthropic very rocky uptime lately, the tons of SessionGate issues, the leaked blog announcing Mythos, the leaked Claude Code oopsie, they are not having the best Q1 in terms of proving to the world that they are the safest lab out there. I hope they protect their weights better than they protect everything else, before the rumored IPO later this year. SessionGate is still not solved, despite the official response I told you about session gate last week, and since then we got, finally, and official acknowledgement from Anthropic. But before that, some folks on Reddit reverse-engineered Claude Code (this was before the source code leak ha) and found a few caching bugs that potentially cause 10-20x increase in price if you use --resume a lot especially. While folks continue to complain about burning through Max account quotas much faster than before, here’s the official response from Anthropic, after the supposed investigation, turns out, we’re using it wrong 🤦‍♂️ My take is simple: Anthropic has one of the best models in the world, maybe the best personality plus coding stack in some situations, and they are squandering a chunk of goodwill by not being much more explicit about decreased limits, caching bugs, routing, and usage behavior. Nothing else to add here, really bad DevEx, people can handle bad news. They hate opaque bad news. Gemma 4 Is Here, Apache 2.0, and Honestly… This Is a Big One (HF) This was the hopeful turn in the show. You know we LOVE open source! Right in the middle of all the Anthropic chaos, Google dropped Gemma 4, and Omar Sanseviero from DeepMind joined us live to talk through it. This launch hit a bunch of notes I care a lot about: strong local-friendly sizes, serious open distribution, Apache 2.0 licensing, agentic improvements, and a clear willingness to listen to community feedback. The headline model for me is the 31B Gemma 4. It’s big enough to matter, small enough to actually run in serious local setups, and strong enough that the benchmark chart looks slightly ridiculous. On LM Arena, it is competing far above what you’d intuit from the raw parameter count. When a 31B model starts getting uncomfortably close to models in the several-hundred-billion range, you pay attention. That was really the vibe on the show. It wasn’t just “nice, another open model.” It felt more like: wait, local models are seriously back. Gemma is the new LLaMa When I asked Omar where local models are going, his answer was optimistic: “The open models catch up to proprietary models relatively quickly. If you compare Gemma 3 to Gemma 4, it’s matching proprietary capabilities from eight months ago. Being able to run those capabilities directly in the user’s hardware — that’s the future.” The 31B model downloads as about 18-20GB depending on quantization. With the right setup, you can run it on a single GPU. This is exactly what the open source community has been asking for: frontier-level intelligence that you can actually run yourself. OpenAI’s largest in history $122B funding round + TBPN acquisition While OpenAI quietly meme’d around the Anthropic leak but mostly stayed silent on the releases, they did announce 2 pretty huge things. First, OpenAI raised an absolutely bonkers, insane, unreal $122 Billion dollars round, largest in history, 2x bigger than the previous record round, which was OpenAI. Amazon put in $50B, Nvidia $30B, SoftBank $30B — all three of whom are also OpenAI’s biggest vendors. They’re generating $2 billion per month in revenue with 900 million weekly active users, but still burning roughly $150 million per day and projecting a $14 billion loss this year, making the upcoming IPO a financial necessity rather than a choice. And they’re not just spending on compute — today OpenAI acquired TBPN (TBPN is a tech-focused media company / live show), in a very “surprising” deal, rumored to be in the “low hundreds of millions”, OpenAI has purchased a very tech-positive show. Shoutout to Jordi Hays and John Coogan + TBPN team. Proving that live show format means a lot in the era of fake AI news. This could potentially price TBPN higher than Washington Post, make the founders multi millionaires and give OpenAI a direct to consumers media angle. Very interesting purchase. This weeks buzz - W&B corner + Wolfbench update Quick 2 things, this weekend I flew for 1 day to San Francisco, to host one of the most unique hackathons i’ve ever saw, in this one, AI wrote the code, but humans were punished if they touched their laptops! Yes, with a “lobster of shame” they used Ralph loops and talked to each other intead of hacking. I edited a video of it, hope you enjoy my summary: The other, and potentially much bigger news, comes from Wolfram and WolfBench.ai I’ve tasked Wolfram to expand our findings, and he tested the new Hermes Agent (from Nous Research) against OpenClaw, Claude Code and found that... drum roll... Hermes Agent performs way better on Terminal Bench, than either Claude Code and OpenClaw. 😮 Here’s the clip of him explaining, and you can find all our findings and methodology here PrismML’s 1-Bit Bonanza: The Biggest ML Discovery in Half a Decade My co-host Nisten called it, and I thi

    1h 32m
  8. MAR 27

    AGI is here? Jensen says yes, ARC-AGI-3 says AI scores under 1%

    Hey y’all, Alex here, let me catch you up! Jensen Huang went on Lex and said AGI has been achieved. We’ll get to that. The biggest demo moment: Gemini 3.1 Flash Live launched - Google’s omni model that sees, hears, and searches the web in real time. We tested it live and I said “what the f**k” on air. It was really impressive! Google Research also dropped TurboQuant (6x KV cache compression) which crashed Samsung and Micron stocks - we had Daniel Han from UnSloth help us make sense of why that’s overblown. OpenAI killed Sora - the app, the API, and the $1B Disney deal. Claude felt noticeably dumber this week AND max account quotas are melting as 500+ people confirmed on my X and Reddit. We have an official word from Anthropic as to why. Mistral launched Voxtral TTS (open weight, claims to beat ElevenLabs), Cohere shipped an ASR model, and Google’s Lyria 3 Pro now generates full 3-minute music tracks inside Producer AI. This and a lot more in today’s episode, let’s dive in (as always, show notes and links in the end!) ThursdAI - Let me catch you up! Gemini 3.1 Flash Live: The Real-Time AI Companion Is Here Google dropped a breaking news on the show today, with Gemini 3.1 Flash - LIVE version. This one is an omni-model, that means it can receive text/audio/video on input and respond in text and voice. It has Google search grounding, and it felt... immediate! I was blown away, really, check out the video, the speed with which it was able to “see” me, respond to my query, look up something on the web, was mind blowing. I don’t often get “mind blown” anymore, there’s just too many news, but this one did the trick! With the pricing being around 10x cheaper than GPT-real-time, and the Google search grounding being super fast, I can absolutely see this model being hooked up to... robots (like ReachyMini), SmartGlasses that can see what you see, and a bunch more! Gemini Live is available on Google AI studio and has been rolled out globally inside the Google Search app! So now when you pull up the Google Search app, just open it and point at anything. Truly a remarkable advancement. Google research publishes TurboQuant - 6x reduction in KV cache with 0 accuracy loss Google research posted some work (based on an Arxiv paper from almost a year ago) that shows that with geometry tricks, combining two other techniques like PolarQuant and QJL, they are able to compress the KV cache of running LLMs by nearly 6x, and show an 8X speed up for model inference with zero accuracy loss. If you ever watched silicon valley the HBO show, this sounds like the fictional middle-out algorithm from PiedPiper. If this scales (and that’s a big if, we don’t know if this applies to other, bigger models yet), this means significant decreases in memory requirements to run the current crop of LLMs for longer context. The claim is big, so we’ll continue to monitor if this indeed scales, but the most interesting thing about this piece of news is, that it broke the AI bubble and went to wall street, with finance brows deciding that this means that memory will not be needed as much any more and it tanked Samsung and Micron stocks. Which I found particularly ridiculous on the show, did they not hear about Jevons Paradox? This is reminiscent of the DeepSeek R1 saga that tanked Nvidia stocks over a year ago. Daniel Han from Unsloth, who joined us on the show, pointed out that the approach is mathematically interesting even if it’s not necessarily better than existing open-source techniques like DeepSeek MLA. LDJ noted that the baseline comparison (16-bit KV cache) isn’t really fair since most production systems are already compressing beyond that. Yam implemented it himself and confirmed the speedups are real, but so is the trade-off. Anthropic updates: Opus dumber? Quotas lower! Injunction won! Computer.. used. Anthropic folks, especially on the Claude code side are shipping like crazy, we won’t be able to cover all the updates, but there was a few notable things I have to keep you up to date on. Claude Opus seems to be getting “dumber”, again I have to talk about this because it affected my work directly this week and hundreds of people confirmed the same experience. I use Claude Opus for my standard ThursdAI prep workflow — generating the TL;DR with 10 bullet points and an executive summary for every topic we cover, creating episode pages, etc. The format has not changed for over a year and yet this week I asked for 10 factoids. I got 4. It says “10” right there in the prompt. Four bullet points. On the website builder, I’ve asked Opus to create a page for last weeks episode, and instead of adding it to the other episode, Opus decided to ... replace the last episode with this one. This would be funny if it wasn’t sad. This is Opus 4.6 we’re talking about, not some quantized open source LLM from last year! The reason is unclear, and it’s not only me, Wolfram noticed that it’s easier to see these types of things in other languages and that for the last week Opus would forget to add Umlauts in German!? and Yam also felt it. Pro/Max plan quotas burning up, Anthropic confirmed that they are tightening them for “peak hour” usage This week, so many people started posting that something is wrong with their Claude Codes, I did a survey, and it blew up. Hundreds of people replied and confirmed that for the first week, they are hitting their session quotas on Pro and 20x $200/mo MAX accounts much much quicker than before. When I say much quicker, I mean, some fokls have hit the quota in as little as 5 minutes. While some others had no issues. I personally btw did not have this. A few days later, Thariq from the Claude code team, and later an official post, confirmed that Anthropic had been rolling out a “tightening” of the Pro/Max accounts to accomodate for growth. This is of course, a huge bummer to the folks who pay $200/mo for the 20x max tier, as they tend to run agents and subagents overnight. But here’s the thing, I don’t think that folks from Anthropic see what we see, some folks got no issues with hitting quota, and some are barely able to use their subscription. I hope that they will find and resolve these bugs quick, because some folks are switching to Codex, and the Anthropic IPO is coming up! I will say, I don’t envy Thariq’s job, he’s doing it gracefully, and maybe one of the only ones in Anthropic that does it at all. Judge granted Anthropic an injunction against DoW and the whole “Supply chain risk” designation! Just in as I’m writing this, a district judge in CA, granted Anthropic an injunction against being designated as a supply-chain-risk company. If you haven’t been following, the US Department of War, specifically Pete Hegseth, threatened and then designated Anthropic as a supply chain risk company, while us president Trump “fired” Anthropic and banned its use in any gov agencies. Well, no so fast says Judge Lin, from CA District court. In this Order, she shows that Dept. of war didn’t meet any legal requirements for this designation. It’s really a fascinating read, but the highligth is this: When asked why Hegseth made a public statementthat had no legal effect and that did not reflect the immediate intent of DoW, counsel stated, “I don’t know.” This is just the first court and will likely be escalated further up the judicial system. This is still developing and apparently the Pentagon declared Anthropic a supply chain risk under two different statutes, and this only affects one of them. So while it’s good news, it’s not over yet. Voice & Audio Explosion: Three Releases in One Hour I had to hit the breaking news button mid-TLDR because three major voice releases dropped simultaneously during the show. Mistral Voxtral TTS — Mistral’s first text-to-speech model, 3 billion parameters, open weight. They claim it beats ElevenLabs Flash v2.5 in human preference tests (58% win rate on flagship voices, 68% on zero-shot voice cloning). We tested it live on the show — it’s decent, with emotion controls for neutral, happy, and frustrated voices. I was not super impressed tbh, it sits somewhere between the very good big labs TTS and the very small open source 82M param TTS. Cohere Transcribe — Cohere enters the ASR game with a 2 billion parameter open-source model (Apache 2.0!) that immediately grabbed the #1 spot on HuggingFace’s Open ASR Leaderboard with a 5.42% word error rate, beating Whisper Large v3’s 7.44%. In human evaluations, it wins 61% of the time on average, and 64% specifically against Whisper. For anyone in regulated industries needing local inference for compliance, this could genuinely replace Whisper as the default. Google Lyria 3 Pro — Google’s most advanced music model is here. It can now generate full 3-minute tracks with structural control — intros, verses, choruses, bridges. We generated a ThursdAI opening theme live on the show using Producer AI, and it was... honestly not bad? It followed our instructions perfectly: drum and bass, 174 BPM, high energy podcast opener with vocals and introduction. The instruction-following was spot on. Nisten said it’s the best music generation model right now. It’s available to Gemini subscribers and via Producer AI and gemini, and it can even compose music from images. SynthID watermarked, royalty-free. We might actually use one of the generated tracks as a new show opener. The craziest thing is, since Google acquired Composer, the team has been shipping. I only generated the audio during the live show, but now went back there to download it for you guys, and whoah, it can now generate whole clips by using other Google tech, this is really cool! OpenAI kills SORA (and Atlas?) Last week we reported on about OpenAI’s focus shift towards Codex and productivity, and this week we see the first casualty. OpenAI is killing SORA, the app, the

    1h 40m

Ratings & Reviews

4.9
out of 5
17 Ratings

About

Every ThursdAI, Alex Volkov hosts a panel of experts, ai engineers, data scientists and prompt spellcasters on twitter spaces, as we discuss everything major and important that happened in the world of AI for the past week. Topics include LLMs, Open source, New capabilities, OpenAI, competitors in AI space, new LLM models, AI art and diffusion aspects and much more. sub.thursdai.news

You Might Also Like