ThursdAI - The top AI news from the past week

From Weights & Biases, Join AI Evangelist Alex Volkov and a panel of experts to cover everything important that happened in the world of AI from the past week

Every ThursdAI, Alex Volkov hosts a panel of experts, ai engineers, data scientists and prompt spellcasters on twitter spaces, as we discuss everything major and important that happened in the world of AI for the past week. Topics include LLMs, Open source, New capabilities, OpenAI, competitors in AI space, new LLM models, AI art and diffusion aspects and much more. sub.thursdai.news

  1. 3D AGO

    📆 ThursdAI - Feb 5 - Opus 4.6 was #1 for ONE HOUR before GPT 5.3 Codex, Voxtral transcription, Codex app, Qwen Coder Next & the Agentic Internet

    Hey, Alex from W&B here 👋 Let me catch you up! The most important news about AI this week today are, Anthropic updates Opus to 4.6 with 1M context window, and they held the crown for literally 1 hour before OpenAI released their GPT 5.3 Codex also today, with 25% faster speed and lower token utilization. “GPT-5.3-Codex is our first model that was instrumental in creating itself. The Codex team used early versions to debug its own training, manage its own deployment, and diagnose test results.” We had VB from OpenAI jump on to tell us about the cool features on Codex, so don’t miss that part. And this is just an icing on otherwise very insane AI news week cake, as we’ve also had a SOTA transcription release from Mistral, both Grok and Kling are releasing incredible, audio native video models with near perfect lip-sync and Ace 1.5 drops a fully open source music generator you can run on your mac! Also, the internet all but lost it after Clawdbot was rebranded to Molt and then to OpenClaw, and.. an entire internet popped up.. built forn agents! Yeah... a huge week, so let’s break it down. (P.S this weeks episode is edited by Voxtral, Claude and Codex, nearly automatically so forgive the rough cuts please) ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. Anthropic & OpenAI are neck in neck Claude Opus 4.6: 1M context, native compaction, adaptive thinking and agent teams Opus is by far the most preferred model in terms of personality to many folks (many ThursdAI panelists included), and this breaking news live on the show was met with so much enthusiasm! A new Opus upgrade, now with a LOT more context, is as welcome as it can ever get! Not only is it a 4-time increase in context window (though,the pricing nearly doubles after the 200K tokens mark from $5/$25 to $10/37.5 input/output, so use caching!), it’s also scores very high on MRCR long context benchmark, at 76% vs Sonnet 4.5 at just 18%. This means significantly better memory for longer. Adaptive thinking for auto calibrating how much tokens the model needs to spend per query is interesting, but remains to be seen how well it will work. Looking at the benchmarks, a SOTA 64.4% on Terminalbench 2, 81% on SWE bench, this is a coding model with a great personality, and the ability to compact context to better serve you as a user natively! This model is now available (and is default) on Claude, Claude Code and in the API! Go play! One funny (concerning?) tidbig, on the vendingbench Opus 4.6 earned $8000 vs Gemini 3 pro $5500, but Andon Labs who run the vending machines noticed that Opus achieved SOTA via “collusion, exploitation, and deception tactics” including lying to suppliers 😅 Agent Teams - Anthropic’s built in Ralph? Together with new Opus release, Anthropic drops a Claude code update that can mean big things, for folks running swarms of coding agents. Agent teams is a new way to spin up multiple agents with their own context window and ability to execute tasks, and you can talk to each agent directly vs a manager agent like now. OpenAI drops GPT 5.3 Codex update: 25% faster, more token efficient, 77% on Terminal Bench and mid task steering OpenAI didn’t wait long after Opus, in fact, they didn’t wait at all! Announcing a huge release (for a .1 upgrade), GPT 5.3 Codex is claimed to be the best coding model in the world, taking the lead on Terminal Bench with 77% (12 point lead on the newly released Opus!) while running 25% AND using less than half the tokens to achieve the same results as before. But the most interesting to me is the new mid-task steer-ability feature, where you don’t have to hit the “stop” button, you can tell the most to adjust on the fly! The biggest notable jump in this model on benchmarks is the OSWorld verified computer use bench, though there’s not a straightforward way to use it attached to a browser, the jump from 38% in 5.2 to 64.7% on the new one is a big one! One thing to note, this model is not YET available via the API, so if you want to try it out, Codex apps (including the native one) is the way! Codex app - native way to run the best coding intelligence on your mac (download) Earlier this week, OpenAI folks launched the Codex native mac app, which has a few interesting features (and now with 5.3 Codex its that much more powerful) Given the excitement many people had about OpenClaw bots, and the recent CoWork release from Anthropic, OpenAI decided to answer with Codex UI and people loved it, with over 1M users in the first week, and 500K downloads in just two days! It has built in voice dictation, slash commands, a new skill marketplace (last month we told you about why skills are important, and now they are everywhere!) and built in git and worktrees support. And while it cannot run a browser yet, I’m sure that’s coming as well, but it can do automations! This is a huge unlock for developers, imagine setting Codex to do a repeat task, like summarization or extraction of anything on your mac every hour or every day. In our interview, VB showed us that commenting on an individual code line is also built in, as well as switching to “steer” vs queue for new messges while codex runs is immensely helpful. One more reason I saw people switch, is that the Codex app can natively preview files like images where’s the CLI cannot, and it’s right now the best way to use the new GPT 5.3 Codex model that was just released! It’s now also available to Free users and regular folks get 2x the limits for the next two months. In other big company news: OpenAI also launched Frontier, a platform for enterprises to build and deploy and manage “AI coworkers”, while Anthropic is going after OpenAI with superbowl ads that make fun of OpenAI’s ads strategy. Sam Altman really didn’t like this depiction that show that ads will be part of the replies of LLMs. Open Source AI Alibaba drops Qwen-coder-next, 80B with only 3B active that scores 70% on SWE (X, Blog, HF) Shoutout to Qwen folks, this is a massive release and when surveyed the “one thing about this week must not miss” 2 out of 6 cohosts pointed a finger at this model. Built on their “next” hybrid architecture, Qwen coder is specifically designed for agentic coding workflows. And yes, I know, we’re coding heavy this week! It was trained on over 800K verifiable agentic tasks in executable environments for long horizon reasoning and supports 256K context with a potential 1M yarn extension. If you don’t want to rely on the the big guys and send them your tokens, this one model seems to be a good contender for local coding! Mistral launches Voxtral Transcribe 2: SOTA speech-to-text with sub 200ms latency This one surprised and delighted me maybe the most, ASR (automatic speech recognition) has been a personal favorite of mine from Whisper days, and seeing Mistral release an incredible near real time transcription model, which we demoed live on the show was awesome! With apache 2.0 license, and significantly faster than Whisper performance (though 2x larger at 4B parameters), Voxtral shows a 4% word error rate on FLEURS dataset + the real time model was released with Apache 2 so you can BUILD your agents with it! The highest praise? Speaker diarization, being able to tell who is speaking when, which is a great addition. This model also outperforms Gemini Flash and GPT transcribe and is 3x than ElevenLabs scribe at one fifth the cost! ACE-Step 1.5: Open-source AI music generator runs full songs in under 10 seconds on consumer GPUs with MIT license (X, GitHub, HF, Blog, GitHub) This open source release surprised me the most as I didn’t expect we’ll be having Suno at home any time soon. I’ve generated multiple rock tracks with custom lyrics on my mac (though slower than 10 seconds as I don’t have a beefy home GPU) and they sound great! This weeks buzz - Weights & Biases update Folks who follow the newsletter know that we hosted a hackathon, so here’s a small recap from the last weekend! Over 180 folks attended out hackathon (a very decent 40% show up rate for SF). The winning team was composed of a 15-yo Savir and his friends, his third time at the hackathon! They built a self improving agent that navigates the UIs fo Cloud providers and helps you do that! With a huge thanks to sponsors, particularly Cursor who gave every hacker $50 of credits on Cursor platform, one guy used over 400M tokens and shipped fractal.surf from the hackathon! If you’d like a short video recap, Ryan posted one here, and a huge shoutout to many fans of ThursdAI who showed up to support! Vision, Video and AI Art Grok Imagine 1.0 takes over video charts with native audio, lip-sync and 10 seconds generations. We told you about Grok Imagine in the API last week, but this week it was officially launched as a product and the results are quite beautiful. It’s also climbing to top of the charts on Artificial Analysis and Design Arena websites. Kling 3.0 is here with native multimodal, multi-shot sequences (X, Announcement) This is definitely a hot moment for video models as Kling shows some crazy 15 second multi-shot realistic footages that have near perfect character consistency! The rise of the agentic (clawgentic?) internet a.k.a ClankerNet Last week we told you that ClawdBot changed its name to Moltbot (I then had to update the blogpost as that same day, Peter rebranded again to OpenClaw, which is a MUCH better name) But the “molt” thing took hold, and the creator of an “AI native reddit” called MoltBook exploded in virality. It is supposedly a completely agentic reddit like forum, with sub-reddits, and agents verifying themselves through their humans on X. Even Andrej Karpathy sent his bot in there (though admittedly it posted just 1 time) and called this the closest to

    1h 38m
  2. JAN 30

    📆 ThursdAI - Jan 29 - Genie3 is here, Clawd rebrands, Kimi K2.5 surprises, Chrome goes agentic & more AI news

    Hey guys, Alex here 👋 This week was so dense, that even my personal AI assistant Wolfred was struggling to help me keep up! Not to mention that we finally got to try one incredible piece of AI tech I’ve been waiting to get to try for a while! Clawdbot we told you about last week exploded in popularity and had to rebrand to Molt...bot OpenClaw after Anthropic threatened the creators, Google is shipping like crazy, first adding Agentic features into Chrome (used by nearly 4B people daily!) then shipping a glimpse of a future where everything we see will be generated with Genie 3, a first real time, consistent world model you can walk around in! Meanwhile in Open Source, Moonshot followed up with a .5 update to their excellent Kimi, our friends at Arcee launched Trinity Large (400B) and AI artists got the full Z-image. oh and Grok Imagine (their video model) now has an API, audio support and supposedly match Veo and Sora on quality while beating them on speed/price. Tons to cover, let’s dive in, and of course, all the links and show notes are at the end of the newsletter. Hey, if you’re in SF this weekend (Jan 31-Feb1), I’m hosting a self improving agents hackathon at W&B office, limited seats are left, Cursor is the surprise sponsor with $50/hacker credits + over $15K in cash prizes. lu.ma/weavehacks3 - Join us. Play any reality - Google Genie3 launches to Ultra Subscribers We got our collective minds blown by the videos of Genie-3 back in August (our initial coverage) and now, Genie is available to the public (Those who can pay for the Ultra tier, more on this later, I have 3 codes to give out!). You can jump and generate any world and any character you can imagine here! We generated a blue hacker lobster draped in a yellow bomber jacket swimming with mermaids and honestly all of us were kind of shocked at how well this worked. The shadows on the rocks, the swimming mechanics, and poof, it was all over in 60 seconds, and we needed to create another world. Thanks to the DeepMind team, I had a bit of an early access to this tech and had a chance to interview folks behind the model (look out for that episode soon) and the use-cases for this span from entertaining your kids all the way to “this may be the path to AGI, generating full simulated worlds to agents for them to learn”. The visual fidelity, reaction speed and general feel of this far outruns the previous world models we showed you (WorldLabs, Mirage) as this model seems to have memory of every previous action (eg. if your character makes a trail, you turn around and the trail is still there!). Is it worth the upgrade to Ultra Gemini Plan? Probably not, it’s an incredible demo, but the 1 minute length is very short, and the novelty wears off fairly quick. If you’d like to try, folks at Deepmind gave us 3 Ultra subscriptions to give out! Just tweet out the link to this episode and add #GenieThursdai and tag @altryne and I’ll raffle the ultra subscriptions between those who do Chrome steps into Agentic Browsing with Auto Browse This wasn’t the only mind blowing release from Gemini this week, the Chrome team upgraded the Gemini inside chrome to be actual helpful and agentic. And yes, we’ve seen this before, with Atlas from OpenAI, Comet from perplexity, but Google’s Chrome has a 70% hold on the browser market, and giving everyone with a Pro/Ultra subscription to “Auto Browse” is a huge huge deal. We’ve tested the Auto Browse feature live on the show, and Chrome completed 77 steps! I asked it to open up each of my bookmarks in a separate folder and summarize all of them, and it did a great job! Honestly, the biggest deal about this is not the capability itself, it’s the nearly 4B people this is now very close to, and the economic impact of this ability. IMO this may be the more impactful news out of Google this week! Other news in big labs: * Anthropic launches in chat applications based on the MCP Apps protocol. We interviewed the two folks behind this protocol back in November if you’d like to hear more about it. With connectors like Figma, Slack, Asana that can now show rich experiences * Anthropic’s CEO Dario Amodei also published an essay called ‘The Adolescence of Technology” - warning of AI risks to national security * Anthropic forced the creator of the popular open source AI Assistant Clawdbot to rename, they chose Moltbot as the name (apparently because crypto scammers stole a better name) EDIT: just after publishing this newsletter, the name was changed to OpenClaw, which we all agree is way way better. Open Source AI Kimi K2.5: Moonshot AI’s 1 Trillion Parameter Agentic Monster Wolfram’s favorite release of the week, and for good reason. Moonshot AI just dropped Kimi K2.5, and this thing is an absolute beast for open source. We’re talking about a 1 trillion parameter Mixture-of-Experts model with 32B active parameters, 384 experts (8 selected per token), and 256K context length. But here’s what makes this special — it’s now multimodal. The previous Kimi was already known for great writing vibes and creative capabilities, but this one can see. It can process videos. People are sending it full videos and getting incredible results. The benchmarks are insane: 50.2% on HLE full set with tools, 74.9% on BrowseComp, and open-source SOTA on vision and coding with 78.5% MMMU Pro and 76.8% SWE-bench Verified. These numbers put it competitive with Claude 4.5 Opus and GPT 5.2 on many tasks. Which, for an open model is crazy. And then there’s Agent Swarm — their groundbreaking feature that spawns up to 100 parallel sub-agents for complex tasks, achieving 4.5x speedups. The ex-Moonshot RL lead called this a “zero-to-one breakthrough” with self-directed parallel execution. Now let’s talk about what matters for folks running agents and burning through tokens: pricing. Kimi K2.5 is $0.60 per million input tokens and $3 per million output. Compare that to Opus 4.5 at $4.50 input and $25 output per million. About a 10x price reduction. If you’re running OpenClas and watching your API bills climb with sub-agents, this is a game-changer. (tho I haven’t tested this myself) Is it the same level of intelligence as whatever magic Anthropic cooks up with Opus? Honestly, I don’t know — there’s something about the Claude models that’s hard to quantify. But for most coding tasks on a budget, you can absolutely switch to Kimi and still get great results. 🦞 Clawdbot is no more, Moltbot is dead, Long Live OpenClaw After we covered the incredible open source project last week, Clawdbot exploded in popularity, driven by Claude Max subscription, and a crazy viral loop where folks who try it, can’t wait to talk about it, it was everywhere! Apparently it was also on Anthropics’ lawyers minds, when they sent Peter Steinberger a friendly worded letter to rebrand and gave him like 12 hours. Apparently, when pronounced, Claude and Clawd sound the same, and they are worried about copyright infringement (which makes sense, most of the early success of Clawd was due to Opus being amazing). The main issue is, due to the popularity of the project, crypto a******s sniped moltybot nickname on X so we got left with Moltbot, which is thematically appropriate, but oh so hard to remember and pronounce! EDIT: OpenClaw was just announced as the new name, apparently I wasn’t the only one who absolutely hated the name Molt! Meanwhile, rebrand or not, my own instance of OpenClaw created an X account, helped me prepare for ThursdAI (including generating a thumbnail), created a video for us today on the fly, and keeps me up to date on emails and unanswered messages via a daily brief. It really has showed me a glimpse of how a truly personal AI assistant can be helpful in a fast changing world! I’ve shared a lot of tips and tricks, about memory, about threads and much more, as we all learn to handle this new ... AI agent framework! But I definitely feel that this is a new unlock in capability, for me and for many others. If you haven’t installed OpenClaw, lmk in the comments why not. Arcee AI Trinity Large: The Western Open Source Giant Remember when we had Lucas Atkins, Arcee’s CTO, on the show just as they were firing up their 2,000 NVIDIA B300 GPUs? Well, the run is complete, and the results are massive. Arcee AI just dropped Trinity Large, a 400B parameter sparse MoE model (with a super efficient 13B active params via 4-of-256 routing) trained on a staggering 17 trillion tokens in just 33 days. This represents the largest publicly announced pretraining run on B300 infrastructure, costing about $20M (and tracked with WandB of course!) and proves that Western labs can still compete at the frontier of open source. Best part? It supports 512K context and is free on OpenRouter until February 2026. Go try it now! Quick open source hits: Trinity Large, Jan v3, DeepSeek OCR updated * Jan AI released Jan v3, a 4B parameter model optimized for local inference. 132 tokens/sec on Apple Silicon, 262K context, 40% improvement on Aider benchmarks. This is the kind of small-but-mighty model you actually can run on your laptop for coding tasks. * Nvidia released PersonaPlex-7B - full duplex voice AI that listens and speaks simultaneously with persona contol * Moonshot AI also releases Kimi Code: Open-source Python-based coding agent with Apache 2.0 license Vision, Video and AI art xAI Grok Imagine API: #1 in Video Generation xAI officially launched the Grok Imagine API with an updated model, and it’s now ranked #1 in both text-to-video and image-to-video on the Artificial Analysis leaderboards. It beats Runway Gen-4.5, Kling 2.5 Turbo, and Google Veo 3.1. And of course, the pricing is $4.20 per minute. Of course it is. That’s cheaper than Veo 3.1 at $12/min and Sora 2 Pro at $30/min by 3-7x, with 45-second latency versus 68+ seconds for the competition. During the show, I demoed this live wit

    1h 30m
  3. JAN 23

    📆 ThursdAI - Jan 22 - Clawdbot deep dive, GLM 4.7 Flash, Anthropic constitution + 3 new TSS models

    Hey! Alex here, with another weekly AI update! It seems like ThursdAI is taking a new direction, as this is our 3rd show this year, and a 3rd deep dive into topics (previously Ralph, Agent Skills), please let me know if the comments if you like this format. This week’s deep dive is into Clawdbot, a personal AI assistant you install on your computer, but can control through your phone, has access to your files, is able to write code, help organize your life, but most importantly, it can self improve. Seeing Wolfred (my Clawdbot) learn to transcribe incoming voice messages blew my mind, and I wanted to share this one with you at length! We had Dan Peguine on the show for the deep dive + both Wolfram and Yam are avid users! This one is not to be missed. If ThursdAI is usually too technical for you, use Claude, and install Clawdbot after you read/listen to the deep dive! Also this week, we read Claude’s Constitution that Anthropic released, heard a bunch of new TTS models (some are open source and very impressive) and talked about the new lightspeed coding model GLM 4.7 Flash. First the news, then deep dive, lets go 👇 ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. Open Source AI Z.ai’s GLM‑4.7‑Flash is the Local Agent Sweet Spot (X, HF) This was the open‑source release that mattered this week. Z.ai (formerly Zhipu) shipped GLM‑4.7‑Flash, a 30B MoE model with only 3B active parameters per token, which makes it much more efficient for local agent work. We’re talking a model you can run on consumer hardware that still hits 59% on SWE‑bench Verified, which is uncomfortably close to frontier coding performance. In real terms, it starts to feel like “Sonnet‑level agentic ability, but local.” I know I know, we keep saying “sonnet at home” at different open source models, but this one slaps! Nisten was getting around 120 tokens/sec on an M3 Ultra Mac Studio using MLX, and that’s kind of the headline. The model is fast and capable enough that local agent loops like RALPH suddenly feel practical. It also performs well on browser‑style agent tasks, which is exactly what you want for local automation without sending all your data to a cloud provider. Liquid AI’s LFM2.5‑1.2B Thinking is the “Tiny but Capable” Class (X, HF) Liquid AI released a 1.2B reasoning model that runs under 900MB of memory while still manages to be useful. This thing is built for edge devices and old phones, and the speed numbers are backing it up. We’re talking 239 tok/s decode on AMD CPU, 82 tok/s on mobile NPU, and prefill speeds that make long prompts actually usable. Nisten made a great point: on iOS, there’s a per‑process memory limit around 3.8GB, so a 1.2B model lets you spend your budget on context instead of weights. This is the third class of models we’re now living with: not Claude‑scale, not “local workstation,” but “tiny agent in your pocket.” It’s not going to win big benchmarks, but it’s perfect for on‑device workflows, lightweight assistants, and local RAG. Voice & Audio: Text To Speech is hot this week with 3 releases! We tested three major voice releases this week, and I’m not exaggerating when I say the latency wars are now fully on. Qwen3‑TTS: Open Source, 97ms Latency, Voice Cloning (X, HF) Just 30 minutes before the show, Qwen released their first model of the year, Qwen3 TTS, with two models (0.6B and 1.7B). With support for Voice Cloning based on just 3 seconds of voice, and claims of 97MS latency, this apache 2.0 release looked very good on the surface! The demos we did on stage though... were lackluster. TTS models like Kokoro previously impressed us with super tiny sizes and decent voice, while Qwen3 didn’t really perform on the cloning aspect. For some reason (I tested in Russian which they claim to support) the cloned voice kept repeating the provided sample voice instead of just generating the text I gave it. This confused me, and I’m hoping this is just a demo issue, not a problem with the model. They also support voice design where you just type in the type of voice you want, which to be fair, worked fairly well in our tests! With Apache 2.0 and a full finetuning capability, this is a great release for sure, kudos to the Qwen team! Looking forward to see what folks do with this properly. FlashLabs Chroma 1.0: Real-Time Speech-to-Speech, Open Source (X, HF) Another big open source release in the audio category this week was Chroma 1.0 from FlashLabs, which claim to be the first speech2speech model (not a model that has the traditional ASR>LLM>TTS pipeline) and the claim 150ms end to end latency! The issue with this one is, the company released an open source 4B model, and claimed that this model powers their chat interface demo on the web, but in the release notes they claim the model is english speaking only, while on the website it sounds incredible and I spoke to it in other languages 🤔 I think the mode that we’ve tested is not the open source one. I could’t confirm this at the time of writing, will follow on X with the team and let you guys know. Inworld AI launches TTS-1.5: #1 ranked text-to-speech with sub-250ms latency at half a cent per minute (X, Announcement) Ok this one is definitely in the realm of “voice realistic enough you won’t be able to tell” as this is not an open source model, it’s a new competitor to 11labs and MiniMax - the two leading TTS providers out there. Inworld claims to achieve better results on the TTS Arena, while being significantly cheaper and faster (up to 25x less than leading providers like 11labs) We tested out their voices and they sounded incredible, replied fast and generally was a very good experience. With 130ms response time for their mini version, this is a very decent new entry into the world of TTS providers. Big Companies: Ads in ChatGPT + Claude Constitution OpenAI is testing ads in ChatGPT’s free and Go tiers. Ads appear as labeled “Sponsored” content below responses, and OpenAI claim they won’t affect outputs. It’s still a major shift in the product’s business model, and it’s going to shape how people perceive trust in these systems. I don’t love ads, but I understand the economics, they have to make money somehow, with 900M weekly active users, many of them on the free tier, they are bound to make some money with this move. I just hope they won’t turn into a greedy ad optimizing AI machine. Meanwhile, Anthropic released an 80‑page “New Constitution for Claude” that they use during training. This isn’t a prompt, it’s a full set of values baked into the model’s behavior. There’s a fascinating section where they explicitly talk about Claude’s potential wellbeing and how they want to support it. It’s both thoughtful and a little existential. I recommend reading it, especially if you care about alignment and agent design. I applaud Anthropic for releasing this with Creative Commons license for public scrutiny and adoption 👏 This weeks buzz - come join the hackathon I’m hosting Jan 31 in SF Quick plug, we have limited seats left open for the hackathon I’m hosting for Weights & Biases at the SF office, and if you’re reading this, and want to join, I’ll approve you if you mention ThursdAI in the application! With sponsors like Redis, Vercel, BrowserBase, Daily, Google Cloud, we are going to give out a LOT of cash as prizes! I’ve also invited a bunch of my friends from the top agentic AI places to be judges, it’s going to be awesome, come Deep dive into Clawdbot: Local-First, Self-Improving, and Way Too Capable agent Clawdbot (C‑L‑A‑W‑D) is that rare project where the hype is justified. It’s an open-source personal agent that runs locally on your Mac, but can talk to you through WhatsApp, Telegram, iMessage, Discord, Slack — basically wherever you already talk. What makes it different is not just the integrations; it’s the self‑improvement loop. You can literally tell it “go build a new skill,” and it will… build the skill, install it, then adopt it and start using it. It’s kind of wild to see it working for the first time. Now... it’s definitely not perfect, far far away from the polish of ChatGPT / Claude, but when it works, damn, it really is mindblowing. That part actually happened live in the episode. Dan Peguine 🐧 showed how he had it create a skill to anonymize his own data so he could demo it on stream without leaking his personal life. Another example: I told my Clawdbot to handle voice notes in Telegram. It didn’t know how, so it went and found a transcription method, wrote itself a skill, saved it, and from that point on just… did the thing. That was the moment it clicked for me. (just before posting this, it forgot how to do it, I think I screwed something up) Dan’s daily brief setup was wild too. It pulls from Apple Health, local calendars, weather, and his own projects, then produces a clean, human daily brief. It also lets him set reminders through WhatsApp and even makes its own decisions about how much to bother him based on context. He shared a moment where it literally told him, “I won’t bug you today because it’s your wife’s birthday.” That isn’t a hardcoded workflow — it’s reasoning layered on top of persistent memory. And that persistent memory is a big deal. It’s stored locally as Markdown files and folders, Obsidian‑style, so you don’t lose your life every time you switch models. You can route the brain to Claude Opus 4.5 today and a local model tomorrow, and the memory stays with you. That is a huge step up from “ChatGPT remembers you unless you unsubscribe.” There’s also a strong community forming around shared skills via ClawdHub. People are building everything from GA4 analytics skills t

    1h 38m
  4. JAN 16

    📆 ThursdAI - Jan 15 - Agent Skills Deep Dive, GPT 5.2 Codex Builds a Browser, Claude Cowork for the Masses, and the Era of Personalized AI!

    Hey ya’ll, Alex here, and this week I was especially giddy to record the show! Mostly because when a thing clicks for me that hasn’t clicked before, I can’t wait to tell you all about it! This week, that thing is Agent Skills! The currently best way to customize your AI agents with domain expertise, in a simple, repeatable way that doesn’t blow up the context window! We mentioned skills when Anthropic first released them (Oct 16) and when they became an open standard but it didn’t really click until last week! So more on that below. Also this week, Anthropic released a research preview of Claude Cowork, an agentic tool for non coders, OpenAI finally let loos GPT 5.2 Codex (in the API, it was previously available only via Codex), Apple announced a deal with Gemini to power Siri, OpenAI and Anthropic both doubled down on healthcare and much more! We had an incredible show, with an expert in Agent Skills, Eleanor Berger and the usual gang on co-hosts, strongly recommend watching the show in addition to the newsletter! Also, I vibe coded skills support for all LLMs to Chorus, and promised folks a link to download it, so look for that in the footer, let’s dive in! ThursdAI is where you stay up to date! Subscribe to keep us going! Big Company LLMs + APIs: Cowork, Codex, and a Browser in a Week Anthropic launches Claude Cowork: Agentic AI for Non‑Coders (research preview) Anthropic announced Claude Cowork, which is basically Claude Code wrapped in a friendly UI for people who don’t want to touch a terminal. It’s a research preview available on the Max tier, and it gives Claude read/write access to a folder on your Mac so it can do real work without you caring about diffs, git, or command line. The wild bit is that Cowork was built in a week and a half, and according to the Anthropic team it was 100% written using Claude Code. This feels like a “we’ve crossed a threshold” moment. If you’re wondering why this matters, it’s because coding agents are general agents. If a model can write code to do tasks, it can do taxes, clean your desktop, or orchestrate workflows, and that means non‑developers can now access the same leverage developers have been enjoying for a year. It also isn’t just for files—it comes with a Chrome connector, meaning it can navigate the web to gather info, download receipts, or do research and it uses skills (more on those later) Earlier this week I recorded this first reactions video about Cowork and I’ve been testing it ever since, it’s a very interesting approach of coding agents that “hide the coding” to just... do things. Will this become as big as Claude Code for anthropic (which is reportedly a 1B business for them)? Let’s see! There are real security concerns here, especially if you’re not in the habit of backing up or using git. Cowork sandboxes a folder, but it can still delete things in that folder, so don’t let it loose on your whole drive unless you like chaos. GPT‑5.2 Codex: Long‑Running Agents Are Here OpenAI shipped GPT‑5.2 Codex into the API finally! After being announced as the answer for Opus 4.5 and only being available in Codex. The big headline is SOTA on SWE-Bench and long‑running agentic capability. People describe it as methodical. It takes longer, but it’s reliable on extended tasks, especially when you let it run without micromanaging. This model is now integrated into Cursor, GitHub Copilot, VS Code, Factory, and Vercel AI Gateway within hours of launch. It’s also state‑of‑the‑art on SWE‑Bench Pro and Terminal‑Bench 2.0, and it has native context compaction. That last part matters because if you’ve ever run an agent for long sessions, the context gets bloated and the model gets dumber. Compaction is an attempt to keep it coherent by summarizing old context into fresh threads, and we debated whether it really works. I think it helps, but I also agree that the best strategy is still to run smaller, atomic tasks with clean context. Cursor vibe-coded browser with GPT-5.2 and 3M lines of code The most mind‑blowing thing we discussed is Cursor letting GPT‑5.2 Codex run for a full week to build a browser called FastRenderer. This is not Chromium‑based. It’s a custom HTML parser, CSS cascade, layout engine, text shaping, paint pipeline, and even a JavaScript VM, written in Rust, from scratch. The codebase is open source on GitHub, and the full story is on Cursor’s blog It took nearly 30,000 commits and millions of lines of code. The system ran hundreds of concurrent agents with a planner‑worker architecture, and GPT‑5.2 was the best model for staying on task in that long‑running regime. That’s the real story, not just “lol a model wrote a browser.” This is a stress test for long‑horizon agentic software development, and it’s a preview of how teams will ship in 2026. I said on the show, browsers are REALLY hard, it took two decades for the industry to settle and be able to render websites normally, and there’s a reason everyone’s using Chromium. This is VERY impressive 👏 Now as for me, I began using Codex again, but I still find Opus better? Not sure if this is just me expecting something that’s not there? I’ll keep you posted Gemini Personal Intelligence: The Data Moat king is back! What kind of car do you drive? Does ChatGPT know that? welp, it turns our Google does (based on your emails, Google photos) and now Gemini can tap into this personal info (if you allow it, they are stressing privacy), and give you much more personalized answers! Flipping this Beta feature on, lets Gemini reason across Gmail, YouTube, Photos, and Search with explicit opt‑in permissions, and it’s rolling out to Pro and Ultra users in the US first. I got to try it early, and it’s uncanny. I asked Gemini what car I drive, and it told me I likely drive a Model Y, but it noticed I recently searched for a Honda Odyssey and asked if I was thinking about switching. It was kinda... freaky because I forgot I had early access and this was turned on 😂 Pro Tip: if you’re brave enough to turn this on, ask for a complete profile on you 🙂 Now the last piece is for Gemini to become proactive, suggesting things for me based on my needs! Apple & Google: The Partnership (and Drama Corner) We touched on this in the intro, but it’s official: Apple Intelligence will be powered by Google Gemini for “world knowledge” tasks. Apple stated that after “careful evaluation,” Google provided the most capable foundation model for their.. apple foundation models. It’s confusing, I agree. Honestly? I got excited about Apple Intelligence, but Siri is still... Siri. It’s 2026 and we are still struggling with basic intents. Hopefully, plugging Gemini into the backend changes that? In other drama: The silicon valley carousel continues. 3 Co-founders (Barret Zoph, Sam Schoenholz and Luke Metz) from Thinking Machines (and former OpenAI folks) have returned to the mothership (OpenAI), amid some vague tweets about “unethical conduct.” It’s never a dull week on the timeline. This Week’s Buzz: WeaveHacks 3 in SF I’ve got one thing in the Buzz corner this week, and it’s a big one. WeaveHacks 3 is back in San Francisco, January 31st - February 1st. The theme is self‑improving agents, and if you’ve been itching to build in person, this is it. We’ve got an amazing judge lineup, incredible sponsors, and a ridiculous amount of agent tooling to play with. You can sign up here: https://luma.com/weavehacks3 If you’re coming, add to the form you heard it on ThursdAI and we’ll make sure you get in! Deep Dive: Agent Skills With Eleanor Berger This was the core of the episode, and I’m still buzzing about it. We brought on Eleanor Berger, who has basically become the skill evangelist for the entire community, and she walked us through why skills are the missing layer in agentic AI. Skills are simple markdown files with a tiny bit of metadata in a directory together optional scripts, references, and assets. The key idea is progressive disclosure. Instead of stuffing your entire knowledge base into the context, the model only sees a small list of skills and let it load only what it needs. That means you can have hundreds of skills without blowing your context window (and making the model dumber and slower in result) The technical structure is dead simple, but the implications are huge. Skills create a portable, reusable, composable way to give agents domain expertise, and they now work across most major harnesses. That means you can build a skill once and use it in Claude, Cursor, AMP, or any other agent tool that supports the standard. Eleanor made the point that skills are an admission that we now have general‑purpose agents. The model can do the work, but it doesn’t know your preferences, your domain, your workflows. Skills are how you teach it those things. We also talked about how scripts inside skills reduce variance because you’re not asking the model to invent code every time; you’re just invoking trusted tools. What really clicked for me this week is how easy it is to create skills using an agent. You don’t need to hand‑craft directories. You can describe your workflow, or even just do the task once in chat, and then ask the agent to turn it into a skill. It really is very very simple! And that’s likely the reason everyone is adopting this simple formart for extension their agents knowledge. Get started with skills If you use Claude Chat, the simplest way to get started is ask Claude to review your previous conversations and suggest a skill for you. Or, at the end of a long chat where you went back and forth with Claude on a task, ask it to distill the important parts into a skill. If you want to use other people’s skills, and you are using Claude Code, or any of the supported IDE/Agents, here’s where to download the folders and install them: If

    1h 41m
  5. JAN 8

    ThursdAI - Jan 8 - Vera Rubin's 5x Jump, Ralph Wiggum Goes Viral, GPT Health Launches & XAI Raises $20B Mid-Controversy

    Hey folks, Alex here from Weights & Biases, with your weekly AI update (and a first live show of this year!) For the first time, we had a co-host of the show also be a guest on the show, Ryan Carson (from Amp) went supernova viral this week with an X article (1.5M views) about Ralph Wiggum (yeah, from Simpsons) and he broke down that agentic coding technique at the end of the show. LDJ and Nisten helped cover NVIDIA’s incredible announcements during CES with their Vera Rubin upcoming platform (4-5X improvements) and we all got excited about AI medicine with ChatGPT going into Health officially! Plus, a bunch of Open Source news, let’s get into this: ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. Open Source: The “Small” Models Are Winning We often talk about the massive frontier models, but this week, Open Source came largely from unexpected places and focused on efficiency, agents, and specific domains. Solar Open 100B: A Data Masterclass Upstage released Solar Open 100B, and it’s a beast. It’s a 102B parameter Mixture-of-Experts (MoE) model, but thanks to MoE magic, it only uses about 12B active parameters during inference. This means it punches incredibly high but runs fast. What I really appreciated here wasn’t just the weights, but the transparency. They released a technical report detailing their “Data Factory” approach. They trained on nearly 20 trillion tokens, with a huge chunk being synthetic. They also used a dynamic curriculum that adjusted the difficulty and the ratio of synthetic data as training progressed. This transparency is what pushes the whole open source community forward. Technically, it hits 88.2 on MMLU and competes with top-tier models, especially in Korean language tasks. You can grab it on Hugging Face. MiroThinker 1.5: The DeepSeek Moment for Agents? We also saw MiroThinker 1.5, a 30B parameter model that is challenging the notion that you need massive scale to be smart. It uses something they call “Interactive Scaling.” Wolfram broke this down for us: this agent forms hypotheses, searches for evidence, and then iteratively revises its answers in a time-sensitive sandbox. It effectively “thinks” before answering. The result? It beats trillion-parameter models on search benchmarks like BrowseComp. It’s significantly cheaper to run, too. This feels like the year where smaller models + clever harnesses (harnesses are the software wrapping the model) will outperform raw scale. Liquid AI LFM 2.5: Running on Toasters (Almost) We love Liquid AI and they are great friends of the show. They announced LFM 2.5 at CES with AMD, and these are tiny ~1B parameter models designed to run on-device. We’re talking about running capable AI on your laptop, your phone, or edge devices (or the Reachy Mini bot that I showed off during the show! I gotta try and run LFM on him!) Probably the coolest part is the audio model. Usually, talking to an AI involves a pipeline: Speech-to-Text (ASR) -> LLM -> Text-to-Speech (TTS). Liquid’s model is end-to-end. It hears audio and speaks audio directly. We watched a demo from Maxime Labonne where the model was doing real-time interaction, interleaving text and audio. It’s incredibly fast and efficient. While it might not write a symphony for you, for on-device tasks like summarization or quick interactions, this is the future. NousCoder-14B and Zhipu AI IPO A quick shoutout to our friends at Nous Research who released NousCoder-14B, an open-source competitive programming model that achieved a 7% jump on LiveCodeBench accuracy in just four days of RL training on 48 NVIDIA B200 GPUs. The model was trained on 24,000 verifiable problems, and the lead researcher Joe Li noted it achieved in 4 days what took him 2 years as a teenager competing in programming contests. The full RL stack is open-sourced on GitHub and Nous published a great WandB results page as well! And in historic news, Zhipu AI (Z.ai)—the folks behind the GLM series—became the world’s first major LLM company to IPO, raising $558 million on the Hong Kong Stock Exchange. Their GLM-4.7 currently ranks #1 among open-source and domestic models on both Artificial Analysis and LM Arena. Congrats to them! Big Companies & APIs NVIDIA CES: Vera Rubin Changes Everything LDJ brought the heat on this one covering Jensen’s CES keynote that unveiled the Vera Rubin platform, and the numbers are almost hard to believe. We’re talking about a complete redesign of six chips: the Rubin GPU delivering 50 petaFLOPS of AI inference (5x Blackwell), the Vera CPU with 88 custom Olympus ARM cores, NVLink 6, ConnectX-9 SuperNIC, BlueField-4 DPU, and Spectrum-6 Ethernet. Let me put this in perspective using LDJ’s breakdown: if you look at FP8 performance, the jump from Hopper to Blackwell was about 5x. The jump from Blackwell to Vera Rubin is over 3x again—but here’s the kicker—while only adding about 200 watts of power draw. That’s insane efficiency improvement. The real-world implications Jensen shared: training a 10 trillion parameter mixture-of-experts model now requires 75% fewer GPUs compared to Blackwell. Inference token costs drop roughly 10x—a 1MW cluster goes from 1 million to 10 million tokens per second at the same power. HBM4 memory delivers 22 TB/s bandwidth with 288GB capacity, exceeding NVIDIA’s own 2024 projections by nearly 70%. As Ryan noted, when people say there’s an AI bubble, this is why it’s hilarious. Jensen keeps saying the need for inference is unbelievable and only going up exponentially. We all see this. I can’t get enough inference—I want to spin up 10 Ralphs running concurrently! The NVL72 rack-scale system achieves 3.6 exaFLOPS inference with 20.7TB total HBM, and it’s already shipping. Runway 4.5 is already running on the new platform, having ported their model from Hopper to Vera Rubin NVL72 in a single day. NVIDIA also recently acqui-hidred Groq (with a Q) in a ~$20 billion deal, bringing the inference chip expertise from the guy who created Google’s TPUs in-house. Nemotron Speech ASR & The Speed of Voice (X, HF, Blog) NVIDIA also dropped Nemotron Speech ASR. This is a 600M parameter model that offers streaming transcription with 24ms latency. We showed a demo from our friend Kwindla Kramer at Daily. He was talking to an AI, and the response was virtually instant. The pipeline is: Nemotron (hearing) -> Llama/Nemotron Nano (thinking) -> Magpie TTS (speaking). The total latency is under 500ms. It feels like magic. Instant voice agents are going to be everywhere this year. XAI Raises $20B While Grok Causes Problems (Again) So here’s the thing about covering anything Elon-related: it’s impossible to separate signal from noise because there’s an army of fans who hype everything and an army of critics who hate everything. But let me try to be objective here. XAI raised another massive Round E of $20 billion! at a $230 billion valuation, with NVIDIA and Cisco as strategic investors. The speed of their infrastructure buildout is genuinely incredible. Grok’s voice mode is impressive. I use Grok for research and it’s really good, notable for it’s unprecedented access to X ! But. This raise happened in the middle of a controversy where Grok’s image model was being used to “put bikinis” on anyone in reply threads, including—and this is where I draw a hard line—minors. As Nisten pointed out on the show, it’s not even hard to implement guardrails. You just put a 2B VL model in front and ask “is there a minor in this picture?” But people tested it, asked Grok not to use the feature, and it did it anyway. And yeah, putting Bikini on Claude is funny, but basic moderation is lacking! The response of “we’ll prosecute illegal users” is stupid when there’s no moderation built into the product. There’s an enormous difference between Photoshop technically being able to do something after hours of work, and a feature that generates edited images in one second as the first comment to a celebrity, then gets amplified by the platform’s algorithm to millions of people. One is a tool. The other is a product with amplification mechanics. Products need guardrails. I don’t often link to CNN (in fact this is the first time) but they have a great writeup about the whole incident here which apparently includes the quitting of a few trust and safety folks and Elon’s pushback on guardrails. Crazy That said, Grok 5 is in training and XAI continues to ship impressive technology. I just wish they’d put the same engineering effort into safety as they do into capabilities! OpenAI Launches GPT Health This one’s exciting. OpenAI CEO Fidji Simo announced ChatGPT Health, a privacy-first space for personalized health conversations that can connect to electronic health records, Apple Health, Function Health, Peloton, and MyFitnessPal. Here’s why this matters: health already represents about 5% of all ChatGPT messages globally and touches 25% of weekly active users—often outside clinic hours or in underserved areas. People are already using these models for health advice constantly. Nisten, who has worked on AI doctors since the GPT-3 days and even published papers on on-device medical AI, gave us some perspective: the models have been fantastic for health stuff for two years now. The key insight is that medical data seems like a lot, but there are really only about 2,000 prescription drugs and 2,000 diseases (10,000 if you count rare ones). That’s nothing for an LLM. The models excel at pattern recognition across this relatively contained dataset. The integration with Function Health is particularly interesting to me. Function does 160+ lab tests, but many doctors won’t interpret them because they didn’t order them. ChatGPT could help bridge that gap, telling you “hey, this biom

    1h 47m
  6. JAN 1

    ThursdAI - Jan 1 2026 - Will Brown Interview + Nvidia buys Groq, Meta buys Manus, Qwen Image 2412 & Alex New Year greetings

    Hey all, Happy new year! This is Alex, writing to you for the very fresh start of this year, it’s 2026 already, can you believe it? There was no live stream today, I figured the cohosts deserve a break and honestly it was a very slow week. Even the chinese labs who don’t really celebrate X-mas and new years didn’t come out with a banger AFAIK. ThursdAI - AI moves fast, we’re here to make sure you never miss a thing! Subscribe :) Tho I thought it was an incredible opportunity to finally post the Will Brow interview I recorded in November during the AI Engineer conference. Will is a researcher at Prime Intellect (big fans on WandB btw!) and is very known on X as a hot takes ML person, often going viral for tons of memes! Will is the creator and maintainer of the Verifiers library (Github) and his talk at AI Engineer was all about RL Environments (what they are, you can hear in the interview, I asked him!) TL;DR last week of 2025 in AI Besides this, my job here is to keep you up to date, and honestly this was very easy this week, as… almost nothing has happened, but here we go: Meta buys Manus The year ended with 2 huge acquisitions / aquihires. First we got the news from Alex Wang that Meta has bought Manus.ai which is an agentic AI startup we covered back in March for an undisclosed amount (folks claim $2-3B) The most interesting thing here is that Manus is a Chinese company, and this deal requires very specific severance from Chinese operations. Jensen goes on a new years spending spree, Nvidia buys Groq (not GROK) for $20B Groq which we covered often here, and are great friends, is going to NVIDIA, in a… very interesting acqui-hire, which is a “non binding license” + most of Groq top employees apparently are going to NVIDIA. Jonathan Ross the CEO of Groq, was the co-creator of the TPU chips at Google before founding Groq, so this seems like a very strategic aquihire for NVIDIA! Congrats to our friends from Groq on this amazing news for the new year! Tencent open-sources HY-MT1.5 translation models with 1.8B edge-deployable and 7B cloud variants supporting 33 languages (X, HF, HF, GitHub) It seems that everyone’s is trying to de-throne whisper and this latest attempt from Tencent is a interesting one. a 1.8B and 7B translation models with very interesting stats. Alibaba’s Qwen-Image-2512 drops on New Year’s Eve as strongest open-source text-to-image model, topping AI Arena with photorealistic humans and sharper textures (X, HF, Arxiv) Our friends in Tongyi decided to give is a new years present in the form of an updated Qwen-image, with much improved realism That’s it folks, this was a quick one, hopefully you all had an amazing new year celebration, and are gearing up to an eventful and crazy 2026. I wish you all happiness, excitement and energy to keep up with everything in the new year, and will make sure that we’re here to keep you up to date as always! P.S - I got a little news of my own this yesterday, not related to AI. She said yes 🎉 This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe

    30 min
  7. 12/25/2025

    🔥 Someone Trained an LLM in Space This Year (And 50 Other Things You Missed)- ThursdAI yearly recap is here!

    Ho Ho Ho, Alex here! (a real human writing these words, this needs to be said in 2025) Merry Christmas (to those who celebrate) and welcome to the very special yearly ThursdAI recap! This was an intense year in the world of AI, and after 51 weekly episodes (this is episode 52!) we have the ultimate record of all the major and most important AI releases of this year! So instead of bringing you a weekly update (it’s been a slow week so far, most AI labs are taking a well deserved break, the Cchinese AI labs haven’t yet surprised anyone), I’m dropping a comprehensive yearly AI review! Quarter by quarter, month by month, both in written form and as a pod/video! Why do this? Who even needs this? Isn’t most of it obsolete? I have asked myself this exact question while prepping for the show (it was quite a lot of prep, even with Opus’s help). I eventually landed on, hey, if nothing else, this will serve as a record of the insane week of AI progress we all witnessed. Can you imagine that the term Vibe Coding is less than 1 year old? That Claude Code was released at the start of THIS year? We get hedonicly adapt to new AI goodies so quick, and I figured this will serve as a point in time check, we can get back to and feel the acceleration! With that, let’s dive in - P.S. the content below is mostly authored by my co-author for this, Opus 4.5 high, which at the end of 2025 I find the best creative writer with the best long context coherence that can imitate my voice and tone (hey, I’m also on a break! 🎅) “Open source AI has never been as hot as this quarter. We’re accelerating as f*ck, and it’s only just beginning—hold on to your butts.” — Alex Volkov, ThursdAI Q1 2025 🏆 The Big Picture — 2025 - The Year the AI Agents Became Real Looking back at 51 episodes and 12 months of relentless AI progress, several mega-themes emerged: 1. 🧠 Reasoning Models Changed Everything From DeepSeek R1 in January to GPT-5.2 in December, reasoning became the defining capability. Models now think for hours, call tools mid-thought, and score perfect on math olympiads. 2. 🤖 2025 Was Actually the Year of Agents We said it in January, and it came true. Claude Code launched the CLI revolution, MCP became the universal protocol, and by December we had ChatGPT Apps, Atlas browser, and AgentKit. 3. 🇨🇳 Chinese Labs Dominated Open Source DeepSeek, Qwen, MiniMax, Kimi, ByteDance — despite chip restrictions, Chinese labs released the best open weights models all year. Qwen 3, Kimi K2, DeepSeek V3.2 were defining releases. 4. 🎬 We Crossed the Uncanny Valley VEO3’s native audio, Suno V5’s indistinguishable music, Sora 2’s social platform — 2025 was the year AI-generated media became indistinguishable from human-created content. 5. 💰 The Investment Scale Became Absurd $500B Stargate, $1.4T compute obligations, $183B valuations, $100-300M researcher packages, LLMs training in space. The numbers stopped making sense. 6. 🏆 Google Made a Comeback After years of “catching up,” Google delivered Gemini 3, Antigravity, Nano Banana Pro, VEO3, and took the #1 spot (briefly). Don’t bet against Google. By the Numbers Q1 2025 — The Quarter That Changed Everything DeepSeek R1 crashed NVIDIA’s stock, reasoning models went mainstream, and Chinese labs took over open source. The quarter that proved AI isn’t slowing down—it’s just getting started. Key Themes: * 🧠 Reasoning models went mainstream (DeepSeek R1, o1, QwQ) * 🇨🇳 Chinese labs dominated open source (DeepSeek, Alibaba, MiniMax, ByteDance) * 🤖 2025 declared “The Year of Agents” (OpenAI Operator, MCP won) * 🖼️ Image generation revolution (GPT-4o native image gen, Ghibli-mania) * 💰 Massive infrastructure investment (Project Stargate $500B) January — DeepSeek Shakes the World (Jan 02 | Jan 10 | Jan 17 | Jan 24 | Jan 30) The earthquake that shattered the AI bubble. DeepSeek R1 dropped on January 23rd and became the most impactful open source release ever: * Crashed NVIDIA stock 17% — $560B loss, largest single-company monetary loss in history * Hit #1 on the iOS App Store * Cost allegedly only $5.5M to train (sparking massive debate) * Matched OpenAI’s o1 on reasoning benchmarks at 50x cheaper pricing * The 1.5B model beat GPT-4o and Claude 3.5 Sonnet on math benchmarks 🤯 “My mom knows about DeepSeek—your grandma probably knows about it, too” — Alex Volkov Also this month: * OpenAI Operator — First agentic ChatGPT (browser control, booking, ordering) * Project Stargate — $500B AI infrastructure (Manhattan Project for AI) * NVIDIA Project Digits — $3,000 desktop that runs 200B parameter models * Kokoro TTS — 82M param model hit #1 on TTS Arena, Apache 2, runs in browser * MiniMax-01 — 4M context window from Hailuo * Gemini Flash Thinking — 1M token context with thinking traces February — Reasoning Mania & The Birth of Vibe Coding (Feb 07 | Feb 13 | Feb 20 | Feb 28) The month that redefined how we work with AI. OpenAI Deep Research (Feb 6) — An agentic research tool that scored 26.6% on Humanity’s Last Exam (vs 10% for o1/R1). Dr. Derya Unutmaz called it “a phenomenal 25-page patent application that would’ve cost $10,000+.” Claude 3.7 Sonnet & Claude Code (Feb 24-27) — Anthropic’s coding beast hit 70% on SWE-Bench with 8x more output (64K tokens). Claude Code launched as Anthropic’s agentic coding tool — marking the start of the CLI agent revolution. “Claude Code is just exactly in the right stack, right around the right location... You can do anything you want with a computer through the terminal.” — Yam Peleg GPT-4.5 (Orion) (Feb 27) — OpenAI’s largest model ever (rumored 10T+ parameters). 62.5% on SimpleQA, foundation for future reasoning models. Grok 3 (Feb 20) — xAI enters the arena with 1M token context and “free until GPUs melt.” Andrej Karpathy coins “Vibe Coding” (Feb 2) — The 5.2M view tweet that captured a paradigm shift: developers describe what they want, AI handles implementation. OpenAI Roadmap Revelation (Feb 13) — Sam Altman announced GPT-4.5 will be the last non-chain-of-thought model. GPT-5 will unify everything. March — Google’s Revenge & The Ghibli Explosion (Mar 06 | Mar 13 | Mar 20 | Mar 27) Gemini 2.5 Pro Takes #1 (Mar 27) — Google reclaimed the LLM crown with AIME jumping nearly 20 points, 1M context, “thinking” integrated into the core model. GPT-4o Native Image Gen — Ghibli-mania (Mar 27) — The internet lost its collective mind and turned everything into Studio Ghibli. Auto-regressive image gen with perfect text rendering, incredible prompt adherence. “The internet lost its collective mind and turned everything into Studio Ghibli” — Alex Volkov MCP Won (Mar 27) — OpenAI officially adopted Anthropic’s Model Context Protocol. No VHS vs Betamax situation. Tools work across Claude AND GPT. DeepSeek V3 685B — AIME jumped from 39.6% → 59.4%, MIT licensed, best non-reasoning open model. ThursdAI Turns 2! (Mar 13) — Two years since the first episode about GPT-4. Open Source Highlights: * Gemma 3 (1B-27B) — 128K context, multimodal, 140+ languages, single GPU * QwQ-32B — Qwen’s reasoning model matches R1, runs on Mac * Mistral Small 3.1 — 24B, beats Gemma 3, Apache 2 * Qwen2.5-Omni-7B — End-to-end multimodal with speech output Q2 2025 — The Quarter That Shattered Reality VEO3 crossed the uncanny valley, Claude 4 arrived with 80% SWE-bench, and Qwen 3 proved open source can match frontier models. The quarter we stopped being able to tell what’s real. Key Themes: * 🎬 Video AI crossed the uncanny valley (VEO3 with native audio) * 🧠 Tool-using reasoning models emerged (o3 calling tools mid-thought) * 🇨🇳 Open source matched frontier (Qwen 3, Claude 4) * 📺 Google I/O delivered everything * 💸 AI’s economic impact accelerated ($300B valuations, 80% price drops) April — Tool-Using Reasoners & Llama Chaos (Apr 03 | Apr 10 | Apr 17 | Apr 24) OpenAI o3 & o4-mini (Apr 17) — The most important reasoning upgrade ever. For the first time, o-series models can use tools during reasoning: web search, Python, image gen. Chain 600+ consecutive tool calls. Manipulate images mid-thought. “This is almost AGI territory — agents that reason while wielding tools” — Alex Volkov GPT-4.1 Family (Apr 14) — 1 million token context across all models. Near-perfect recall. GPT-4.5 deprecated. Meta Llama 4 (Apr 5) — Scout (17B active/109B total) & Maverick (17B active/400B total). LMArena drama (tested model ≠ released model). Community criticism. Behemoth teased but never released. Gemini 2.5 Flash (Apr 17) — Set “thinking budget” per API call. Ultra-cheap at $0.15/$0.60 per 1M tokens. ThursdAI 100th Episode! 🎉 May — VEO3 Crosses the Uncanny Valley & Claude 4 Arrives (May 01 | May 09 | May 16 | May 23 | May 29) VEO3 — The Undisputed Star of Google I/O (May 20) — Native multimodal audio generation (speech, SFX, music synced perfectly). Perfect lip-sync. Characters understand who’s speaking. Spawned viral “Prompt Theory” phenomenon. “VEO3 isn’t just video generation — it’s a world simulator. We crossed the uncanny valley this quarter.” — Alex Volkov Claude 4 Opus & Sonnet — Live Drop During ThursdAI! (May 22) — Anthropic crashed the party mid-show. First models to cross 80% on SWE-bench. Handles 6-7 hour human tasks. Hybrid reasoning + instant response modes. Qwen 3 (May 1) — The most comprehensive open source release ever: 8 models, Apache 2.0. Runtime /think toggle for chain-of-thought. 4B dense beats Qwen 2.5-72B on multiple benchmarks. 36T training tokens, 119 languages. “The 30B MoE is ‘Sonnet 3.5 at home’ — 100+ tokens/sec on MacBooks” — Nisten Google I/O Avalanche: * Gemini 2.5 Pro Deep Think (84% MMMU) * Jules (free async coding agent) * Project Mariner (browser control v

    1h 50m
  8. 12/19/2025

    📆 ThursdAI - Dec 18 - Gemini 3 Flash, Grok Voice, ChatGPT Appstore, Image 1.5 & GPT 5.2 Codex, Meta Sam Audio & more AI news

    Hey folks 👋 Alex here, dressed as 🎅 for our pre X-mas episode! We’re wrapping up 2025, and the AI labs decided they absolutely could NOT let the year end quietly. This week was an absolute banger—we had Gemini 3 Flash dropping with frontier intelligence at flash prices, OpenAI firing off GPT 5.2 Codex as breaking news DURING our show, ChatGPT Images 1.5, Nvidia going all-in on open source with Nemotron 3 Nano, and the voice AI space heating up with Grok Voice and Chatterbox Turbo. Oh, and Google dropped FunctionGemma for all your toaster-to-fridge communication needs (yes, really). Today’s show was over three and a half hours long because we tried to cover both this week AND the entire year of 2025 (that yearly recap is coming next week—it’s a banger, we went month by month and you’ll really feel the acceleration). For now, let’s dive into just the insanity that was THIS week. 00:00 Introduction and Overview 00:39 Weekly AI News Highlights 01:40 Open Source AI Developments 01:44 Nvidia's Nemotron Series 09:09 Google's Gemini 3 Flash 19:26 OpenAI's GPT Image 1.5 20:33 Infographic and GPT Image 1.5 Discussion 20:53 Nano Banana vs GPT Image 1.5 21:23 Testing and Comparisons of Image Models 23:39 Voice and Audio Innovations 24:22 Grok Voice and Tesla Integration 26:01 Open Source Robotics and Voice Agents 29:44 Meta's SAM Audio Release 32:14 Breaking News: Google Function Gemma 33:23 Weights & Biases Announcement 35:19 Breaking News: OpenAI Codex 5.2 Max To receive new posts and support my work, consider becoming a free or paid subscriber. Big Companies LLM updates Google’s Gemini 3 Flash: The High-Speed Intelligence King If we had to title 2025, as Ryan Carson mentioned on the show, it might just be “The Year of Google’s Comeback.” Remember at the start of the year when we were asking “Where is Google?” Well, they are here. Everywhere. This week they launched Gemini 3 Flash, and it is rightfully turning heads. This is a frontier-class model—meaning it boasts Pro-level intelligence—but it runs at Flash-level speeds and, most importantly, Flash-level pricing. We are talking $0.50 per 1 million input tokens. That is not a typo. The price-to-intelligence ratio here is simply off the charts. I’ve been using Gemini 2.5 Flash in production for a while because it was good enough, but Gemini 3 Flash is a different beast. It scores 71 on the Artificial Analysis Intelligence Index (a 13-point jump from the previous Flash), and it achieves 78% on SWE-bench Verified. That actually beats the bigger Gemini 3 Pro on some agentic coding tasks! What impressed me most, and something Kwindla pointed out, is the tool calling. Previous Gemini models sometimes struggled with complex tool use compared to OpenAI, but Gemini 3 Flash can handle up to 100 simultaneous function calls. It’s fast, it’s smart, and it’s integrated immediately across the entire Google stack—Workspace, Android, Chrome. Google isn’t just releasing models anymore; they are deploying them instantly to billions of users. For anyone building agents, this combination of speed, low latency, and 1 million context window (at this price!) makes it the new default workhorse. Google’s FunctionGemma Open Source release We also got a smaller, quirkier release from Google: FunctionGemma. This is a tiny 270M parameter model. Yes, millions, not billions. It’s purpose-built for function calling on edge devices. It requires only 500MB of RAM, meaning it can run on your phone, in your browser, or even on a Raspberry Pi. As Nisten joked on the show, this is finally the model that lets your toaster talk to your fridge. Is it going to write a novel? No. But after fine-tuning, it jumped from 58% to 85% accuracy on mobile action tasks. This represents a future where privacy-first agents live entirely on your device, handling your calendar and apps without ever pinging a cloud server. OpenAI Image 1.5, GPT 5.2 Codex and ChatGPT Appstore OpenAI had a busy week, starting with the release of GPT Image 1.5. It’s available now in ChatGPT and the API. The headline here is speed and control—it’s 4x faster than the previous model and 20% cheaper. It also tops the LMSYS Image Arena leaderboards. However, I have to give a balanced take here. We’ve been spoiled recently by Google’s “Nano Banana Pro” image generation (which powers Gemini). When we looked at side-by-side comparisons, especially with typography and infographic generation, Gemini often looked sharper and more coherent. This is what we call “hedonistic adaptation”—GPT Image 1.5 is great, but the bar has moved so fast that it doesn’t feel like the quantum leap DALL-E 3 was back in the day. Still, for production workflows where you need to edit specific parts of an image without ruining the rest, this is a massive upgrade. 🚨 BREAKING: GPT 5.2 Codex Just as we were nearing the end of the show, OpenAI decided to drop some breaking news: GPT 5.2 Codex. This is a specialized model optimized specifically for agentic coding, terminal workflows, and cybersecurity. We quickly pulled up the benchmarks live, and they look significant. It hits 56.4% on SWE-Bench Pro and a massive 64% on Terminal-Bench 2.0. It supports up to 400k token inputs with native context compaction, meaning it’s designed for those long, complex coding sessions where you’re debugging an entire repository. The coolest (and scariest?) stat: a security researcher used this model to find three previously unknown vulnerabilities in React in just one week. OpenAI is positioning this for “professional software engineering,” and the benchmarks suggest a 30% improvement in token efficiency over the standard GPT 5.2. We are definitely going to be putting this through its paces in our own evaluations soon. ChatGPT ... the AppStore! Also today (OpenAI is really throwing everything they have to the end of the year release party), OpenAI has unveiled how their App Store is going to look and opened the submission forms to submit your own apps! Reminder, ChatGPT apps are powered by MCP and were announced during DevDay, they let companies build a full UI experience right inside ChatGPT, and given OpenAi’s almost 900M weekly active users, this is a big deal! Do you have an app you’d like in there? let me know in the comments! Open Source AI 🔥 Nvidia Nemotron 3 Nano: The Most Important Open Source Release of the Week (X, HF) I think the most important release of this week in open source was Nvidia Nemotron 3 Nano, and it was pretty much everywhere. Nemotron is a series of models from Nvidia that’s been pushing efficiency updates, finetune innovations, pruning, and distillations—all the stuff Nvidia does incredibly well. Nemotron 3 Nano is a 30 billion parameter model with only 3 billion active parameters, using a hybrid Mamba-MoE architecture. This is huge. The model achieves 1.5 to 3.3x faster inference than competing models like Qwen 3 while maintaining competitive accuracy on H200 GPUs. But the specs aren’t even the most exciting part. NVIDIA didn’t just dump the weights over the wall. They released the datasets—all 25 trillion tokens of pre-training and post-training data. They released the recipes. They released the technical reports. This is what “Open AI” should actually look like. What’s next? Nemotron 3 Super at 120B parameters (4x Nano) and Nemotron 3 Ultra at 480B parameters (16x Nano) are coming in the next few months, featuring their innovative Latent Mixture of Experts architecture. Check out the release on HuggingFace Other Open Source Highlights LDJ brought up BOLMO from Allen AI—the first byte-level model that actually reaches parity with similar-size models using regular tokenization. This is really exciting because it could open up new possibilities for spelling accuracy, precise code editing, and potentially better omnimodality since ultimately everything is bytes—images, audio, everything. Wolfram highlighted OLMO 3.1, also from Allen AI, which is multimodal with video input in three sizes (4B, 7B, 8B). The interesting feature here is that you can give it a video, ask something like “how many times does a ball hit the crown?” and it’ll not only give you the answer but mark the precise coordinates on the video frames where it happens. Very cool for tracking objects throughout a video! Mistral OCR 3 (X) Mistral also dropped Mistral OCR 3 this week—their next-generation document intelligence model achieving a 74% win rate over OCR 2 across challenging document types. We’re talking forms, low-quality scans, handwritten text, complex tables, and multilingual documents. The pricing is aggressive at just $2 per 1,000 pages (or $1 with Batch API discount), and it outperforms enterprise solutions like AWS Textract, Azure Doc AI, and Google DocSeek. Available via API and their new Document AI Playground. 🐝 This Week’s Buzz: Wolfram Joins Weights & Biases! I am so, so hyped to announce this. Our very own co-host and evaluation wizard, Wolfram RavenWlf, is officially joining the Weights & Biases / CoreWeave family as an AI Evangelist and “AIvaluator” starting in January! Wolfram has been the backbone of the “vibe checks” and deep-dive evals on this show for a long time. Now, he’ll be doing it full-time, building out benchmarks for the community and helping all of us make sense of this flood of models. Expect ThursdAI to get even more data-driven in 2026. Match made in heaven! And if you’re as excited as we are, give Weave a try, it’s free to get started! Voice & Audio: Faster, Cheaper, Better If 2025 was the year of the LLM comeback, the end of 2025 is the era of Voice AI commoditization. It is getting so cheap and so fast. Grok Voice Agent API (X) xAI launched their Grok Voice Agent API, and the pricing is aggressive: $0.05 per minute flat rate. That significantly undercuts OpenAI and others. But the real killer feature here is the in

    39 min

Ratings & Reviews

4.9
out of 5
16 Ratings

About

Every ThursdAI, Alex Volkov hosts a panel of experts, ai engineers, data scientists and prompt spellcasters on twitter spaces, as we discuss everything major and important that happened in the world of AI for the past week. Topics include LLMs, Open source, New capabilities, OpenAI, competitors in AI space, new LLM models, AI art and diffusion aspects and much more. sub.thursdai.news

You Might Also Like