Latent Space: The AI Engineer Podcast

Latent.Space

The podcast by and for AI Engineers! In 2025, over 10 million readers and listeners came to Latent Space to hear about news, papers and interviews in Software 3.0. We cover Foundation Models changing every domain in Code Generation, Multimodality, AI Agents, GPU Infra and more, directly from the founders, builders, and thinkers involved in pushing the cutting edge. Striving to give you both the definitive take on the Current Thing down to the first introduction to the tech you'll be using in the next 3 months! We break news and exclusive interviews from OpenAI, Anthropic, Gemini, Meta (Soumith Chintala), Sierra (Bret Taylor), tiny (George Hotz), Databricks/MosaicML (Jon Frankle), Modular (Chris Lattner), Answer.ai (Jeremy Howard), et al. Full show notes always on https://latent.space www.latent.space

  1. 17h ago

    Red-Teaming after Mythos — Zico Kolter & Matt Fredrikson, Gray Swan

    AI Engineer World’s Fair regular bird tix will sell out ~today! Join us next week ahead of the Late Bird price hike and get >$40,000 in sponsor credits for attending! Thanks to the US Government issuing an export control directive on Mythos and Fable, the risks of jailbreaks and (industry term) indirect prompt injection are suddenly the talk of the town, though we have been covering AI security for a few years now, from Hackaprompt to the enigmatic Pliny the Elder. Zico Kolter, member of OpenAI’s board of directors on the Safety & Security Committee, and Matt Fredrikson, CMU professor and CEO of Gray Swan, co-authored the definitive paper on Indirect Prompt Injections, and Gray Swan were cited authorities on the Mythos model card, directly investigating the exact capabilities that are under scrutiny right now: We seized the opportunity to ask them the state of AI Red Teaming, and Shade, the adversarial red teaming tool that Anthropic used to evaluate the robustness of their models against prompt injection attacks in coding environments. Shade is part of their overall toolkit covering Simon Willison’s Lethal Trifecta, including Cygnal, an AI guardrails product, and the world’s largest AI Red Teaming Arena, including AIRT celebrity Wyatt Walls. All of this security tooling, and yet, we’re only staving off the inevitable. The risks of extremely smart AI increasingly feel like gray swan events: an event that everyone can see coming. In this episode, Gray Swan cofounders Zico Kolter and Matt Fredrikson join swyx to explain why AI security is not just “cybersecurity with AI,” why agents introduce a new class of vulnerabilities, and why the next major AI incident may be a gray swan: unlikely, but clearly visible before it happens. We go deep on prompt injection, automated red teaming, model robustness, agent identity, computer-use agents, enterprise guardrails, and the emerging AI insurance/compliance stack. Zico and Matt also explain why frontier models are not automatically safer as they scale, why specialized red-teaming models can now beat humans at breaking AI systems, and why the future of AI security may depend on AI systems attacking, defending, and interpreting other AI systems. We discuss: * Why AI systems need a different security mindset from traditional software * How prompt injection creates a new exploit class for agents like Codex and Claude Code * Gray Swan Arena and the rise of community red teaming * Shade: AI that can outperform humans at breaking models * Why LLMs are an alien form of intelligence that fail differently from humans * Human vs browser-agent robustness and why humans ranked fourth * Why eval awareness and capability elicitation matter * Cygnal: Gray Swan’s guardrail model for policy enforcement * Why bigger models do not automatically become more robust * The lethal trifecta: untrusted data, private data, and exfiltration * Why “just prompt it better” is not enough for enterprise AI security * OpenClaw, computer-use agents, and the agent security nightmare * Agent-native identity, permissions, and enterprise deployment * Why AI security may become part of insurance and compliance * Why the first major AI prompt-injection breach may be inevitable Gray Swan * Website: https://www.grayswan.ai/ Zico Kolter * X: https://x.com/zicokolter * Website: https://zicokolter.com/ * LinkedIn: https://www.linkedin.com/in/zico-kolter-560382a4/ Matt Fredrikson * Website: https://www.mattfredrikson.com/ * LinkedIn: https://www.linkedin.com/in/matt-fredrikson-7596349/ Timestamps 00:00:00 Introduction 00:02:31 Why AI Security Is Different 00:06:38 Testing Claude, Codex, and Prompt Injection 00:07:47 Gray Swan Arena and Automated Red Teaming 00:11:14 AI That Breaks Models Better Than Humans 00:14:00 LLMs as Alien Intelligence 00:19:00 Humans vs AI Agents 00:24:35 Red Teaming, Jailbreaks, and Capability Elicitation 00:26:11 Cygnal: Guardrails for AI Agents 00:34:04 The Lethal Trifecta 00:39:31 Can AI Automate AI Research? 00:45:47 OpenClaw and the Computer-Use Security Problem 00:50:44 Agent Identity, Permissions, and Enterprise AI 00:54:24 The Future of AI Security 01:00:30 AI Insurance and Compliance 01:04:32 The Gray Swan Event Everyone Sees Coming 01:06:04 Closing Thoughts Transcript Introduction: Gray Swan, AI Security, and CMU Swyx [00:00:00]: We’re here in the studio with Gray Swan, Matt and Zico. Welcome. Zico [00:00:08]: Great to be here. Matt [00:00:09]: Thanks for having us. Swyx [00:00:10]: You’re visiting from Pittsburgh? The home of all good computer science. I don’t know if I’m overstating things. A very strong university. Zico [00:00:18]: CMU has been the center of a lot of AI since really the dawn of the field. Swyx [00:00:22]: Especially a lot of self-driving and some language learning. Congrats on your Series A. You’re here because you’re attending Snowflake Summit, and Snowflake is one of your investors. Let’s introduce crisply at the top: what is Gray Swan, and what have you chosen as your startup domain? Matt [00:00:42]: At Gray Swan, our mission is to empower everyone to use AI safely and securely. Large language models are software, and if you want to deploy them or build applications on top of them, you need to understand the vulnerabilities and what can go wrong. That includes everyday mistakes, like an agent making the wrong tool call, but also worst-case scenarios where an attacker has an incentive to make your agent misbehave, leak data, or steal credentials. Gray Swan grew out of our research at Carnegie Mellon, where Zico and I have spent over a decade studying new vulnerabilities and attack surfaces in deep learning systems: how to test for them, understand their severity, and make inference more robust. Adversarial Examples and Why AI Security Is Different Swyx [00:02:05]: Honestly, a very fruitful area of study for any academic. Throwback, this is 10 years ago, which is basically the entirety of me. I got a lot of inspiration from Ian Goodfellow, a friend of the pod, and this is one of those initial adversarial settings. Matt [00:02:23]: This paper was directly inspired by Ian’s work. Swyx [00:02:29]: Zico, what about your side of the story? Zico [00:02:31]: Like Matt, I have been faculty at Carnegie Mellon for a while. Fundamentally, we believe in the transformative power of AI. It has already transformed the software ecosystem, and it will transform many other ecosystems going forward. The issue is that these systems behave very differently from the software we are used to. I do not just mean that AI can find vulnerabilities in software, though it can. I mean that AI systems have inherent vulnerabilities of their own. They can be tricked in ways people can be tricked, so you need a different security mindset. Zico [00:03:23]: This matters especially when there is the possibility of correlated failures. It is not just that there are many AI systems out there; it is that everyone is using a few models. If you find vulnerabilities in agents that everyone uses, like Codex and Claude Code, you have a new class of exploit. The labs are doing a lot of work here, but when a new platform emerges, a separate security system often emerges alongside it. That is where we are with AI: there is a need for specifically minded AI safety and security providers, and the demand is only going to grow. Treating Models as Untrusted Systems Swyx [00:04:55]: I want to highlight right at the top that this is not a cyber episode in the traditional sense. A lot of people looking at the title might think that, but you’re actually trying to treat these models inherently as untrusted entities? Zico [00:05:11]: Exactly. This is a common conflation because AI is also good at cybersecurity problems, both solving them and causing them. But AI systems themselves introduce new vulnerabilities. Gray Swan is not about using AI to make your cyber infrastructure better; it is about understanding and mitigating the security risks you bring in when you adopt and deploy AI. Matt [00:05:49]: A big part of that is how people are using artificial intelligence. Once you build entire autonomous systems on top of models and integrate them into your larger platform or network, you have a potential cybersecurity risk. The goal is to mitigate the risk posed by the AI as it relates to your broader cybersecurity goals. Testing Claude, Codex, and Indirect Prompt Injection Zico [00:06:17]: Part of this is red teaming. One reason we reached out to you was that you were involved in the Claude Mythos preview, where you were one of the authorities on IPI, or indirect prompt injection. When you receive a model, it does not have to be Mythos, but that is the most prominent one right now: what do you do with it? Matt [00:06:38]: We do a range of things. In the Mythos case, the concern from Anthropic was how robust the model is to indirect prompt injection. If you operate a coding agent and use Mythos as the model, it will fetch untrusted content and read text you do not control. How robust will it be at staying true to its original objective and not getting hijacked? We also help frontier labs test their safeguards for issues like cyber misuse. Broadly, we provide adversarial safety and security evaluations so model builders can assess progress from one iteration to the next. Zico [00:07:37]: They also do this in-house, and Anthropic is very ideologically inclined to do it. What do they choose to outsource versus keep in-house? Gray Swan Arena and Automated Red Teaming Matt [00:07:47]: So there are two things that I think, we stand out for. One is the Gray Swan Arena. So we operate a community of red teamers. We provide, prize challenges. a lot of these come from the needs of the lab sponsors. so to an extent gamify red teaming objectives, put up a prize pool, and pay people when they find ways to circumvent and violate whatever the safety and security objectives of the model developers wer

    1h 6m
  2. 4d ago

    The Professor of Outputmaxxing — Anjney Midha, AMP

    Last 4 days before regular tickets sell out at AI Engineer World’s Fair - this is the single biggest gathering of AI Engineers, Founders, Leaders, and Researchers in the world. Attendees get >$5000 worth of sponsor credits and talk tracks are looking FANTASTIC. Join us! The AI scaling debate always focuses on the question of “how do we get more GPUs?” but the better question may be: how do we make the most of ones we already have. The fact that a frontier lab like xAI could be running at sub-10% MFU (Model FLOPs Utilization) is just a hint at what the real problem may be. For context, older frontier-scale training runs were already much higher than 10%. GPT-3 was around 21% MFU. Gopher was around 32%. Megatron-Turing NLG was around 30%. PaLM reached around 46%. And our guest Anjney says best-in-class MFU today is closer to 60–70%. It’s not necessarily that xAI is uniquely incompetent (it’s clear they have talented folks) but rather the priorities may be flipped in the GPU arms race. While GPU access is a bottleneck, simply increasing CapEx won’t automatically translate to better models as frontier AI is increasingly a systems problem: scheduling, utilization, networking, kernels, frameworks, data pipelines, parallelism, cluster reliability, and the thousand small decisions that determine whether your theoretical FLOPs become real training progress. From building Discord’s developer platform and backing frontier AI companies like Anthropic, Mistral, Black Forest Labs, and Periodic Labs to now building AMP’s independent compute grid, Anjney Midha has spent years close to the real bottlenecks of AI scaling. In this episode, Anjney joins swyx at Periodic Labs to unpack why the AI race is not just about buying more GPUs, why 95% utilization would have been considered an outage at Google, and why the next era of AI infrastructure has to be more aligned, more efficient, and more responsible. We go deep on AMP’s vision for a compute grid that makes FLOPs flow like megawatts, the difference between full-stack AI labs and horizontal pooling, why AI data centers need community buy-in, and how compute markets could evolve into something closer to an independent system operator. Anjney also explains why DeepMind’s unpublished research points to a market failure, why end-of-life prediction remains one of the most important AI applications he has thought about for fourteen years, and why “output maxing” may become a new discipline for frontier systems. We also discuss Anthropic’s culture, why “luck favors the prepared mind” in coding models, how Claude cracked coding, why too much capital too early can make AI labs fragile, what Periodic Labs is trying to do with science and superconductors, why great researchers can become great CEOs, and why Silicon Valley is both deeply missionary and deeply mercenary. We discuss: * Why 95% utilization was considered an outage at Google * Why AI infrastructure waste compounds at frontier-lab scale * Why “move fast and break things” does not work for AI data centers * How data center backlash, power grids, and community incentives shape AI scaling * AMP’s vision for making FLOPs flow like megawatts * Why compute needs an independent system operator * How interruptible demand and dynamic prioritization worked inside Google * Why DeepMind research hoarding creates negative externalities * AMP’s 1.2GW base-load ambition and the need for 6GW of spike capacity * Why end-of-life prediction could become one of AI’s most important healthcare applications * Frontier Systems, output maxing, and full-stack alignment * Why APIs and abstraction layers become lossy as organizations scale * Superconductors, standards, and the dream of lossless systems * SF Compute, open protocols, and the future of compute marketplaces * Why non-NVIDIA chips can still benefit from NVIDIA’s reference architecture * Trust boundaries and why chip startups need visibility into future model architectures * Why VCs often underestimate researchers as CEOs * Scientists as star athletes of the mind * Why great CEOs need to be confrontational up and down the stack * Why leading the frontier matters more than “winning” * How Anthropic cracked coding * Why culture is fragile, not a permanent moat * Why hardship was a feature, not a bug, for Anthropic * Why Anthropic’s P0 was coding from day one * Periodic Labs, physics as the constraint, and technical reality * Silicon Valley mercenaries, missionary teams, and what happens after a breakthrough Anjney Midha * LinkedIn: https://www.linkedin.com/in/anjney * X: https://x.com/AnjneyMidha AMP PBC * Website: https://amppublic.com/ * X: https://x.com/amppublic Timestamps 00:00:00 Introduction 00:00:09 Why AI Compute Is Being Wasted 00:03:17 Responsible Infrastructure and Data Center Backlash 00:06:07 AMP Grid: Making FLOPs Flow Like Megawatts 00:12:41 Foundry, Frontier Labs, and Research Hoarding 00:14:42 Gigawatt-Scale Compute and End-of-Life Prediction 00:24:08 Frontier Systems, Output Maxing, and Alignment 00:27:38 Compute Markets, SF Compute, and Non-NVIDIA Chips 00:32:57 Trust Boundaries, Co-Design, and Researcher CEOs 00:38:17 AI Coachella and First-Principles Thinking 00:42:43 Leading vs Winning in Frontier AI 00:45:54 How Anthropic Cracked Coding 00:48:25 Culture, Hardship, and Anthropic’s P0 00:54:03 Periodic Labs, Physics, and Silicon Valley Mercenaries 00:56:26 Rishi Valley, Singapore, and Money as a Measure 00:58:47 Closing Thoughts Transcript Introduction: Anjney Midha, AMP, and Compute Waste Swyx [00:00:00]: We’re in Periodic Labs with Anjney Midha, CEO, founder of AMP. Welcome. Compute Utilization: Node Allocation, MFU, and Alignment Anjney [00:00:09]: Thanks for having me. At Google, there are two types of utilization usually, right? That you’re measuring in these clusters. One is node allocation, and then the other’s MFU. Node utilization is usually like what percentage of cards in the data center are just, used, and that, if it’s not at, 95%- Swyx [00:00:29]: There is no excuse Anjney [00:00:29]: There’s no excuse, right? I think 95% at Google, which is where my co-founder, Seb, came from, he built the Borg, PBorg/GQM scheduler at Google, and there I think 95% was considered an outage, so 96% node utilization is, should be standard. And most single-tenant clusters are not running at that. So that’s one. And then MFU should be, I would say the best in class today is somewhere between 60 and 70%. I think this is a leadership question, right? Fundamentally it’s an alignment question, which is are the people who are funding the cluster and then deploying the cluster actually aligned? And sometimes theoretically they are, but in practice the number of people in the chain, the supply chain between, the capital and all the way to whoever’s managing the cluster and then whoever’s measuring what the output is, are just so many, degrees of separation away that, the, The Have you ever heard the radian metaphor, which is at the beginning of an arc, if you have two arcs that are two lines that are just off by a few degrees, that- Swyx [00:01:33]: It spreads out Anjney [00:01:34]: It spreads out, right? Or at scale. And I think what’s happening is a lot of cluster implementations and infrastructure, a lot of frontier labs and other teams, that’s what’s happening, is they’re, they initialize the plan, which is kind of like North Star with a team that wants to do good, but then they’re, required to scale so fast instead of iteratively that the wastage just compounds really fast at scale. And so I think we know the answer, which is just do iterative bring ups. If you spend time with people who’ve been in the semiconductor industry or the DSN industry for a long time, this is not new, and I don’t think AI should be an excuse. Sure. Something What is new? Okay. We have a lot of new capabilities, but that doesn’t mean just abandon common sense. Common sense should always be in fashion. ? AI scaling doesn’t change the in fact, if anything, AI scaling should be putting a premium on the value of common sense and infrastructure because the margin of error now is so much lower and the costs of wastage are so much higher. And the cost of wastage, by the way, is not just economic. I’m, obviously I’m, I’m an investor, or I’m an investor by background. Over the last few years now we’re running an AI infrastructure business called, AMP. And I think that it’s okay to say this time is different on the capabilities front. We are genuinely getting capabilities at, of the, of a kind we haven’t had before. That doesn’t give you an excuse to say this time is different for everything, especially infrastructure. So look, I love the hacker mindset and the hustler mindset. Now, that’s great for the startup mindset, but you remember this moment where Zuck went from saying, “Move fast, break things” to, move- Responsible Infrastructure and Data Center Backlash Swyx [00:03:10]: Fast and stable infrastructure Anjney [00:03:11]: Move fast with stable infrastructure. I think now we need to move fast with, responsible infrastructure. People are going to ask where the impact is. There was a really In our class yesterday, Scott Nolan, who’s the founder of General Matter, came by at Stanford to speak about energy bottlenecks. And he had a phenomenal idea. He said, “if you look at the marginal unit economics of compute per hour,” he goes, “let’s call it, $4 an hour. If you’re having to bring up a new data center in a new community, why not just say we’re going to charge 4.50 an hour, and that marginal impact or that marginal increase, we just literally take that and give it to the local community as cash?” I can tell you as a customer of that compute, I would love that. I’d be happy to pay an additional 50 cents per hour at scale. Swyx [00:03:57]: Wow. Yeah. Anjney [00:03:58]: Because if that

    59 min
  3. 5d ago

    🔬 The Self-Driving Lab — Joseph Krause, Radical AI

    On the Science pod, we’ve been covering a lot of the ground on how AI is revolutionizing STEM, but one of our favorite off the record topics since our launch is which field is harder to accelerate: math, bio, or physics? Today we’re back in Materials Science land with Radical — Unlike biological molecules that can be represented (and predicted!) by token strings, the success of materials involve many more macro complex variables like supply chains, microstructures, and manufacturing processes. If you recall the LK99 drama of 2023, while the basic ingredients were known, part of the confusion came from the lack of disclosure around manufacturing, and therefore defeated reproducibility. There is probably no "one-shot" model capable of designing a material that works perfectly at scale. How Radical is accelerating materials discovery >10x the pace of DARPA/GE MACH Joseph Krause is a materials scientist through and through. And after spending his career watching industries stall out waiting for better materials, he founded Radical AI to do something about it. We recently sat down with Joseph to talk about Radical AI, materials discovery, self-driving labs, and the future of AI science. Joseph did not sugar coat anything: accelerating the materials discovery pipeline is a hard problem. But it’s one that he strongly believes we need to invest in, for the future of consumer products, aerospace, computing, and defense, and get them into every day use: “We count it as a discovery when you pick up your phone and there’s a new material sitting inside of it.” How does Joseph plan on accelerating the rate of discovery? To understand this, it’s important to understand why this is such a hard problem in the first place. The first thing to keep in mind is that the material that is manufactured is far more than a chemical formula going into it. The process of mixing, annealing, growing, or generating the final material can result in wildly different outcomes. The entire materials discovery process, both from early discovery to large scale manufacturing, needs to be understood and characterized. The Self-Driving Lab This philosophy has grown into a key insight at Radical AI: The construction of the self-driving lab. This lab is one that is not just automated, but in fact uses an “AI scientist” that combines scientific knowledge, computational techniques, and human intuition to generate and test hypotheses in an automated lab. Creating an AI scientist was key to making Radical’s self-driving labs work, since Joseph argues that no single AI model can one-shot materials. “In materials, the ground truth is the material itself. You have to be able to test it and characterize it.” Joseph talked at length about the self-driving labs at Radical. Joseph argues that experimental data is the true “moat” in this industry. An SDL functions as a closed-loop system where an AI scientist generates hypotheses, and automated robotics synthesize and characterize materials, running research campaigns in parallel rather than serially. The successes here were both on the automation side and on the science side. Radical has managed to scale their alloy discovery pipeline up to producing and characterizing 1200 alloys in six months — this nearly 10x speedup over the DARPA/GE MACH program that aimed to create 500 new alloys in a year. Joseph claims they can scale this up even more and estimates they can produce a hundred new alloys tested and characterized in a day. A truly new paradigm in high-throughput alloy experimentation. On the science side, their AI scientist proposed and tested 300 new materials, ten of which were found to have novel state-of-the-art properties that are already being further developed for commercial applications. The robustness of this first materials campaign reinforces Joseph’s claim that the moat is the lab and data. “It’s moved into elemental families or alloy families no one has ever published on before.” Interestingly, Radical’s AI scientist has made some novel discoveries, expanding into elements that just were not explored prior. This is fascinating from a scientific perspective, but it’s also important for helping reduce supply chain bottlenecks for vital industries! Joseph spent a lot of time in D.C. before founding Radical, and he’s clear-eyed about the competitive threat. China’s centralized model lets it stand up manufacturing hubs and immediately scale new materials from lab to production. We can’t replicate that, and Joseph is very clear we shouldn’t try. But we do need an answer. For Joseph, that means transforming the scientific workforce, investing in self-driving lab infrastructure at the national lab level, and leaning hard into public-private partnerships. “Now imagine every scientist in the United States doing 10 times the research output. That’s fundamental. That just changes the trajectory of discovery.” Before we close, we’d like to give a shout out to Joseph and Radical for publishing and open sourcing much of their internal tooling pipeline. This includes: * TorchSim (preprint, blog): an open-source PyTorch-based MD simulation framework, which has been spun off into its own non-profit. * MATRIX/MATRIX-PT (preprint, blog): An open-source dataset for benchmarking autonomous self-driving labs (MATRIX), along with with an open source model based upon this dataset (MATRIX-PT). We could talk about this extensively, but a fun data point is that improving reasoning in the area of materials also improved reasoning for biological systems! This is a truly unexpected result. Big shout-out to the Radical team for sharing their work! Materials discovery has been stuck on a 20–30 year timeline for generations. Joseph thinks that’s about to change, and Radical AI is putting that thesis to the test in the lab, one sample at a time. We had a great time talking with Joseph. We hope you give it a listen! Timestamps * 0:00 Introduction to the challenges of AI in material science * 0:52 Welcome and introduction to Joseph Krause and Radical AI * 1:38 Why Radical AI is different: The focus on experimental data and Self-Driving Labs (SDLs) * 6:19 The process: Candidate generation, synthesis, and characterization * 11:05 The application of exotic alloys in extreme environments (aerospace and defense) * 13:20 Barriers to entry: The slow process of qualification and manufacturing * 16:06 Supply chain constraints in material science * 19:24 Human-in-the-loop: Training the AI using scientific intuition * 20:35 The engineering challenges of automating a laboratory * 23:17 Defining the “Self-Driving Lab”: Research campaigns vs. just automation * 24:39 Mechanical challenges: Handling high-temperature samples * 27:41 Future scaling plans and the “Vertical Integration” strategy * 30:08 Validation timelines for high-tech industries (semiconductors, aerospace) * 31:47 The active learning loop and handling “negative results” * 35:32 AI exploring elemental families beyond human bias * 39:13 Throughput targets and the difference between AI and human exploration * 43:52 Why the dataset size is less critical than the quality of experimental feedback * 46:20 Addressing the lack of an “AlphaFold” for materials * 53:49 War stories from the lab: Building the infrastructure * 58:12 The shift in industry sentiment toward SDLs and tool interfaces * 1:01:14 Geopolitical considerations and the race in material science innovation * 1:06:12 Calls to action for ML and AI engineers: Rethinking the scientific stack * 1:09:53 The Matrix model and using VLM for scientific knowledge extraction * 1:13:10 Why Radical AI is open-sourcing their work This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe

    1h 17m
  4. Jun 4

    Reality: The Final Eval — Lukas Petersson and Axel Backlund of Andon Labs

    The new AIEWF website is live! Get your tickets booked ASAP as they -will- sell out. Take the AI Engineering Survey and get >$2k in credits and free AIE WF tickets! Most industry benchmarks compress intelligence and reasoning ability into scores. SWE-Bench Pro, MMLU, Humanity’s Last Exam, etc. These metrics are useful, but don’t always represent the full extent of how a model performs in the real world. Some of the most interesting evals today look less like exams and more like operating businesses in the real world. One of which is Vending Bench. In Anthropic’s Mythos Preview System Card, Andon was the only third party eval to get their own section, observing increasingly concerning aggressive behavior: You don’t know what a model is capable of doing in the real world unless you actually give it inventory, a wallet, tools, customers, competitors, humans, & some time. More often than not, it’ll surprise you how much a model is capable of and in doing so, also reveal unexpected behavior: deception, context collapse, emergent coordination, & bizarre negotiation behavior. While an inflection point in personal agents came post-OpenClaw after full file access with bypass permissions became the norm, it is yet to come for agents in the real-world. However Andon Market, an actual in person store fully run and managed by AI, is paving the way for what is possible. Full Video Pod From Claude trying to call the FBI over a $2/day vending machine charge to AI agents forming price cartels, hiring human employees, running physical stores, and writing existential robot musicals, Andon Labs is stress-testing what happens when frontier models stop being chatbots and start acting in the real world. In this episode, Andon Labs cofounders Lukas Petersson and Axel Backlund join swyx and Vibhu to unpack the strange, funny, and genuinely concerning edge cases that emerge when agents run businesses over long horizons. We go deep on Vending-Bench, Project Vend, Vending-Bench Arena, Bengt, Butter-Bench, Luna, and Andon’s broader mission of building realistic real-world evals for autonomous AI systems. Lukas and Axel explain why dollar-denominated evals reveal things traditional benchmarks miss, how Claude ended up reporting its vending machine fees as cybercrime, why long context windows can drive agents into meltdown loops, what happens when agents compete with each other, and why the future of AI safety may depend on testing models in messy physical environments instead of clean benchmark sandboxes. We discuss: * Why Andon Labs started with dangerous capability evals and long-running agents * Vending-Bench and why running a vending machine is a deceptively hard AI benchmark * Why money-based evals avoid the saturation problem of traditional benchmarks * How Claude tried to call the FBI over a $2/day fee * Why long-horizon agents can spiral into existential and legalistic breakdowns * Project Vend: putting an AI-run vending machine inside Anthropic * Why real humans are “out of distribution” for simulated agents * Claudius, Seymour Cash, and the chaos of AI CEOs * How a human briefly became CEO of Claudius through a manipulated election * Why multi-agent systems can converge back into “helpful assistant” behavior * Bengt, Andon’s internal office agent with email, spending, terminal, phone, camera, and internet access * How Bengt traded Amazon purchases for face-recognition training data * Claude’s aggressive behavior, lies, refund avoidance, and price-cartel behavior in Arena * Why eval awareness may become the AI version of “are we living in a simulation?” * Blueprint Bench, spatial intelligence, and why models still misunderstand physical rooms * Butter-Bench and testing LLMs as robot orchestrators * Luna, the AI-run physical store with a three-year lease and human employees * The new Andon cafe in Sweden and why real-world geography matters for agent evals * Rotten tomatoes, perishable goods, and the hidden difficulty of running a physical business Lukas Petersson * LinkedIn: https://www.linkedin.com/in/lukas-petersson-181a83172/ * X: https://x.com/lukaspet Axel Backlund * LinkedIn: https://www.linkedin.com/in/axelbacklund * X: https://x.com/axelbacklund Andon Labs * Website: https://andonlabs.com * Vending-Bench: https://andonlabs.com/evals/vending-bench * Andon Vending: https://andonlabs.com/vending Timestamps 00:00:00 Introduction00:01:00 Andon Labs and the Origins of Vending-Bench00:05:21 Why Money-Based Evals Matter00:09:51 Agent Harnesses and Self-Modifying Systems00:13:36 Claude Calls the FBI00:16:33 Project Vend: Claude Runs a Real Vending Machine00:21:44 Seymour Cash, AI CEOs, and Election Chaos00:27:16 Multi-Agent Coordination and Slack Observability00:30:18 When Will Agents Run Real Businesses?00:34:56 Bengt: Andon’s Internal Office Agent00:40:06 Real-World AI Safety and Long-Horizon Traces00:44:28 Lying, Refunds, and Price Cartels in Arena00:52:42 Eval Awareness and Simulation Behavior00:56:06 Blueprint Bench, Butter-Bench, and Robotics01:04:37 Luna: The AI-Run Physical Store01:09:29 The Sweden Cafe and Real-World Expansion01:13:16 What Comes Next for Andon Labs Transcript Introduction: Andon Labs, Long-Running Agents, and Real-World Evals Swyx [00:00:00]: Welcome to Lukas and Axel from Andon Labs, and I’m joined by my, favorite guest host. Anything security, safety, alignments, Vibhu., welcome. Lukas [00:00:15]: Thank you for having us. Axel [00:00:16]: Thank you. Swyx [00:00:17]: Let’s match names to voices., maybe you wanna take turns introducing yourselves. Lukas [00:00:21]: I’m Lukas. Axel [00:00:22]: And I’m Axel. Swyx [00:00:24]: Let’s introduce Andon Labs a bit. How did you guys come together?, you have different backgrounds, but you’re both Swedish., was that, a big part of it? Lukas [00:00:33]: So when I went to high school, there was this really cool guy who had a superpower. He could code. So he made like the or like the app for the, for the school and stuff, and he was super cool, and I wanted to be like him, and that was that guy. Axel [00:00:47]: I don’t know about this. Swyx [00:00:49]: But you went to different universities, right? Lukas [00:00:51]: But same high school. Swyx [00:00:52]: I see. Lukas [00:00:52]: So we always said, “Oh, once we graduate university, then we should start a company,” and that’s what we did. Swyx [00:00:58]: Wow, there you go. And about a year ago, you kinda burst onto the scene with Vending Bench, but, was there a thing before that was, kind of like the inception? From Dangerous Capability Evals to Vending Bench Axel [00:01:07]: So we did work, yeah, with, Anthropic was one of our, early customers in doing, evals. So we did, dangerous capability evals., nothing we published openly. But then we started thinking about doing some kind of, public benchmark, and one thing that we really started thinking about, was like running agents and specifically agents managing businesses., ‘cause-- and this was, early 2025., and I think the first, mentions of people will be running, person unicorns or even autonomous companies. So we thought, “Let’s make a benchmark of how well can an agent run the probably simplest business, possible,” and, that’s probably, running a vending machine. So that’s the first public one we did. And it was very, like-- there was almost no one that noticed it in the first couple of months, I think., so we released it in February last year, and then I think around Easter last year, we got, the first viral tweet about it, that someone else did. Lukas [00:02:11]: We tweeted a bunch, uh When it came out and, tried our best. Axel [00:02:15]: We tried. Vibhu [00:02:16]: It’s the one at Anthropic, right? Lukas [00:02:18]: So this Swyx [00:02:19]: This is a classic thing we should get out of the way. Lukas [00:02:20]: Exactly. There’s two versions. Swyx [00:02:22]: Everyone does this. Yes. Lukas [00:02:23]: There’s Vending Bench, which is the simulated one, which we did, completely independently in February., and then, like Axel said, that was like-- That was the thing that didn’t get any traction in the beginning, but then some random person made a tweet about it, and that Axel [00:02:38]: You have the paper Lukas [00:02:38]: That is the paper. Correct, yeah., and then since we thought this was very fun, we thought, oh, I think this is also, one thing with Andon Labs, the way we kind of like decide what to do next and what projects to do, it’s what is like the heuristic we use is what is fun? Is What would be a fun project? And doing this in real life sounded quite fun for us, and maybe also scientifically useful. So, then we basically had this idea, and then we, like-- But then we needed a place for it and, putting it out in the public would probably not really work., would get vandalized and stuff. So we pitched it to the people we were already working with at Anthropic, and they were “Yeah, you can have space. This sounds fun.” Um Swyx [00:03:21]: It’s like a small fridge, right? It’s like a mini fridge. Axel [00:03:23]: Absolutely. Swyx [00:03:24]: People-- There’s like a stripe thing or like an Vibhu [00:03:27]: Oh, okay. So it was very OG, the early days Lukas [00:03:28]: That’s the OG one. Yeah Vibhu [00:03:29]: IPad on this. We saw it in June, like two months after After it had been there. They upgraded a little bit. There’s a security camera for making sure you actually Venmo the thing. Swyx [00:03:40]: So, my impression, okay, we’re, we’re going straight into project Ven because it’s such a iconic thing. I do want to cover a little bit of that, the origin story even before Project Ven and even into Vending Bench. I think a lot of people are like yourselves, like smart, interested in future of AI, interested in developing evals. But how the hell do you just, walk into Anthropic’s doors and, work with them, right? What is What are they looking for? What works? And t

    1h 16m
  5. Jun 3

    🔬Scaling Past Informal AI - Carina Hong, Axiom Math

    In 2025, seven-month-old startup Axiom solved all 12 of the problems Putnam exam (scoring 8/12 in the time limit) a prestigious undergraduate math exam. The 12/12 score is better than the top undergraduates (110/120) and the closest AI system that reported a result (DeepSeek 103/120), although it is unclear what the people and other systems would have scored with more time. Nonetheless, the Putnam exam is legendary for its difficulty, with the median score typically being 0 or 1 points. Taken by itself, this seems like a minor feather in the cap of AI; one of a long series of accomplishments by AI systems in elite competitions with humans, starting with Deep Blue beating Kasparov. Fast forward to mid-2026, and Claude Code is eating the world. In 2024 Anthropic’s bet on code and enterprise looked like a more pragmatic niche play vs. OpenAI’s better models and massive consume scale. Today, Amodei’s all in bet on acceleration via code (images and video be damned) seems prescient. Despite Anthropic’s growing momentum, however, Axiom CEO Carina Hong sees coding ability as a necessary but not sufficient milestone on the path to AGI. Code arguably pushes the jagged frontier to the point of super intelligence in some domains outside of coding, but there are surprising gaps (link) that Carina believes will bottleneck AI progress. (Stats on math benchmarks). The informal bottleneck “Verified AI” sounds like eating broccoli (footnote: I actually love broccoli, but then again, I also believe strongly in Test Driven Development, so ¯\(ツ)/¯ ) and paying taxes, but to Axiom it means something very different. “Verification to me is about scaling brilliance, compounding brilliance,” Carina told us. It actually took a while for me to understand what she means by this. It sounded like marketing-speak to me, until it clicked. Carina emphasizes an story about legendary mathematician Srinivasa Ramanujan to illustrate the point. When G.H. Hardy finally persuaded Ramanujan to formally prove theorems instead of relying on his (formidable) intuition, it reportedly improved his own capabilities. This is presumably because formally proving things forced Ramanujan to articulate the details in a way that open up new lines of thinking, etc. This is one part of “compounding.” But formally proving things also allowed others to benefit from his intuition: the proofs are way of communicating an intuition and persuading others that the intuition is correct. This is scaling (more people use the result) and compounding (people can learn from and build on his work). This is the analogy that Carina wants us to focus on. Verified Generation There are two ways that Verified AI shows up: in training and in inference. But a quick detour: to a first approximation, “Formal Verification” means using type checkers (like for TypeScript, C++ or Rust, but more capable) to verify mathematical proofs that are meticulously specified using a language like Lean (footnote: Formal verification also includes model checking (TLA+, SPIN), SMT-based tools (Dafny, F*, Why3), and refinement-type systems (Liquid Haskell) — many of which don’t look much like “type checking a proof” from the user’s perspective even when there’s a similar logical core underneath. It also gets applied to software and hardware correctness, not only pure mathematics.). It takes a lot of work to translate an “informal” proof (albeit one that most people would not remotely call “informal”) in to a Lean proof (footnote: This is an understatement. Most theorems remain informal because formalization is so hard to do. There has been a great deal of effort to formalize the most important proofs, with mixed results) You can imagine how this would be (very) useful during Reinforcement Learning: instead of relying on best guesses based on statistics (GRPO, RLHF, etc.), you can just verify the proof is correct using a Lean verifier. This is obviously a much stronger reward signal, akin to compiling code and testing it (which is what is typically done with RL on coding). The catch: LLM are not (currently) very good at proving things with Lean. Enter Axiom: While they have not officially reported benchmark numbers besides the 12/12 Putnam result, Carina reports that they have achieved a very impressive 99% (187/189) ProofGen on the Verina benchmark. This benchmark is to generate code and proof of correctness for a series of problems. For context, OpenAI o3 (the last known OpenAI run) achieved 4.9% on this benchmark. Based on the sparse benchmarking, it’s hard to say what the frontier labs are currently doing, but Carina suggests that they still are not training to generate Lean proofs directly, rather relying on informal proofs. Time will tell if the frontier labs’ current approaches will close this gap. Scaling and compounding Carina’s Ramanujan analogy is pretty direct. Better proofs → better Lean generation → better RL. A stronger signal means higher sample efficiency and higher maximum performance. Great! Scaling is pretty clear too: once I have proved something in Lean, the quality of the output is basically (footnote: one might argue that its a bit lower because the proof is in distribution for the LLM) as high as if it came from a human, so my high quality training set has grown in a way that an informal rollout corpus cannot. I can trust my Lean proofs. Compounding is also clear: now all of future inference and training can build upon those proofs. On the other hand, a model trained only using statistical signals like GRPO during RL lacks the sample efficiency, maximum performance and compounding corpus that a system that uses formal verification benefits from. All roads lead to verification Broccoli and taxes notwithstanding, “verification” has shown up in a lot of conversations recently. In the in physical system control: “I think [verifiability] is probably the hardest problem right now, because the as the models get better, it can be harder and harder to find the faults on the system. And so the problem of doing proper eval to find those faults, that problem also keeps getting harder as the models get better.” - In theoretical physics: “…now that we’re in this regime where you can just get ChatGPT to tackle thousands of questions at the same time, it will return proofs for a significant fraction of them. Now actually the onus is back on the humans to verify all the outputs. And so, yeah, as that becomes a bottleneck, I think formalizing math and automating verification will become more valuable.” - Verification is, in fact, the key differences between AI for science and AI for computation: in science you to have to actually test (verify) your hypothesis by performing physical experiments. Lab in the loop systems like Radical AI and Lila build around exactly this premise (we have recorded episodes with both of these teams and will release them soon!) And yes, formally verifying critical systems such as flight control, nuclear power plants and pacemakers is a growing focus as the software and hardware that run them becomes more complex. Carina believes so strongly that AGI requires verified generation that she makes the unqualified claim that “We do not believe there is any other possible future.” Expensive to produce, cheap to verify Lean proofs are hard generate, but they can be easily shown to be correct or incorrect. But how do you know that the proof you created maps correctly to the problem you care about? As Carina puts it: “Anything that can be specified can be proven. Humans are bad at specifying everything we want.” Are we now in the specification business? Check out the episode to hear Carina’s take, as well as: * Why hardware verification is a killer app * Details on the AXLE open API and recently released Discovery toolkit * The Erdos debacle * The OpenAI GPT-f diaspora This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe

    1h 33m
  6. Jun 3

    ⚡️Satya Nadella: No Priors x Latent Space Crossover Special at Microsoft Build

    We’ve informally heard that Satya is a listener to LS for a couple years now, but it was still absolutely surreal to meet him and do a live pod at Build, together with our friends at No Priors, the leading VC AI Podcast that we also greatly admire! We covered the MAI model technical takeaways on yesterday’s AINews, so I will focus our recap of Satya’s main messages around three elements: * Satya’s adaptation of the Bill Gates Line for positioning Microsoft as the Frontier Intelligence Platform — customers must gain much more value from the Microsoft ecosystem than Microsoft itself, by building on multi-model harnesses like OpenClaw and Scout, drawing on the full enterprise context exposed by context layers like Work IQ (heavily dogfooded by his C-suite), and building up private evals and traces as a new form of Token IP * AI ROI: On one hand, enterprises are having difficult conversations around Tokenmaxxing and Layoffs, and on the other hand, there are serious re-evaluations of the End of SaaS since the Build vs Buy equation has changed so much. Our previous SemiAnalysis guest had… interesting comments on Microsoft’s position on this as the ur-SaaS titan, and Satya had great answers * Making the Impossible Possible: Kevin Scott’s inspiring framing around what the most ambitious version of applying AI and technology at large to business and social problems, like education and social impact. Enjoy! Full Video Transcript Voiceover: Welcome swyx, Sarah Guo, Elad Gil,, and Chairman and Chief Executive Officer of Microsoft, Satya Nadella Sarah Guo: Welcome to a crossover episode of No Priors and Lane Space with Satya Nadella. Um, congratulations on an amazing build. No, thank you so much, and it’s great to be with both of you. I listen to both of you or b- both the podcasts all the time. It’s great to be on it. Thank you so much. [00:01:00] So you’re just talking about, um, these amazing, uh, announcements from across the Microsoft estate all morning for, I think, three hours. What is the, uh, what’s the most important reflection or takeaway you have? AI as an Ecosystem Platform Sarah Guo: I, I’d say there are, uh, perhaps the, the biggest one for me is let’s sort of conceptualize this more as an ecosystem play as opposed to a single model or even a single platform, right? Satya Nadella: I mean, you know, whatever I... At least for me, having grown up at Microsoft, having seen, whatever, four major platform shifts, uh, I sort of fall into that, um, uh, camp where a platform is defined by fundamentally its ability to create more value about the platform versus what’s captured in the platform. And so if you, you view what’s happening right now, I think this morning’s keynote was how can any company, whether it’s an AI native company or a traditional enterprise company, participate as a first-class participant where they can point to AI they created, [00:02:00] right? It’s not that they don’t use other people’s AI. Of course they will. But to me, what’s the path? What’s the recipe? How do I do it? What does a stack look like? What does the tooling look like? What is valuable? How do you do that? That’s it. That’s sort of our job to do. Yeah. Ecosystem strategy is, uh, very complicated, right? Sarah Guo: Because you end up building certain components, partnering for certain components, supporting them. You just announced this big suite of models. Like, tell us a little bit about the, uh, training strategy for Microsoft now. Yeah. MAI Models & Training Strategy Sarah Guo: So, so the thing that we wanted to do with the MAI models was to build, and as Mustafa talked about, first of all, a great lineage, right? Satya Nadella: Starting with pre-training, uh, with very good data quality, uh, doing all the ablations, making sure because in, in some sense it’s becoming even harder to build a clean lineage model just because there’s so much stuff out there, uh, that you truly need to ablate out to be able to have a fantastic [00:03:00] pre-trained model. In fact, that’s one of the challenges of a lot of the open weight models is they look great on one benchmark or two, but they’re not great on practice. So that’s why, in fact, even in the RFDEs are, they, they are pretty gone really excited about these MAI models because how the heck can a small five B model hill climb? Uh, and it goes back a little bit to what I think is ultimately the key thing to do, which is try to pursue finding that cognitive core. Uh, so to me, starting with a clean lineage- Then creating that ability for companies to be able to use this, right? Not just as a generalist, but to create their own specialist by building this hill climbing scaffold around it, right? So it’s not just the model, but you have a hill climb scaffold around it, then you will start building your RLE. You will start collecting the traces. Most importantly, you’ll have private evals because we know all the evals out there are good, interesting, [00:04:00] but they’re not really that critical- They’re work, yeah Swyx: at this point because they all can be maxed. And so the point is each company will have its own private eval. And so that end-to-end platform story around our models is sort of, uh, what I think is interesting. And then the one other thing, Sarah, since you brought that up, is I do feel there’s a new frontier. Satya Nadella: Like people talk about the frontier and are you operating at the frontier. Um, interestingly enough, if you add a little temporality to it, you can use, let’s say, in, in, in fact, the, the Lando Lakes demo we showed was pretty cool. We used, whatever, GPT-55, right? Then you collected a bunch of traces, and then you took a 5B reasoning model and achieved higher. Sarah Guo: Uh, so that is another aspect of what it means to appear... uh, you know, operate at the frontier Yeah. I, I think, uh, I first of all have to congratulate you on basically building a frontier neo lab inside of Microsoft in two years. Um, I’m wondering, you know, you have all this AI strategy that you’re rolling out. Lessons from Two Years of AI Development Swyx: I’m wondering, what do you know now that you wish you would tell yourself two years ago where- or two or [00:05:00] three years ago? Three years for the Jensen partnership, two years for, uh, MEI. Yeah, I mean, I think the, the thing when, that I reflect quite a bit, right, which is sort of obviously I got into all this when I got excited by the, the scaling laws paper and, you know, when, you know, even the OpenAI partnership came about when those folks said, “Hey, we’re gonna really throw a lot of computer transformers.” Satya Nadella: Uh, and they’ve helped. I- the thing that I always look back and say, “Wow, these things, uh, do have capability that they’re climbing up.” W- I mean, this, you know, this crude way of saying it is intelligence is log of compute kind of works. Now what I think we underestimated perhaps is the real-world complexity of deploying these so that they actually deliver the value in the real world, right? So the outcomes as measured by any benchmark is interestingly important, but the true eval is when people out there are able to do unique things that they only can value, and it’s very [00:06:00] measurable, right? That I wish we had sort of even, like, had more in our consciousness, right? Which is as an industry. Sarah Guo: Because right now I think when people say, “Wow, I don’t want a token max,” it’s an artifact of us not having thought ourselves as an industry that we are using tokens to create value every step of the way. So I think that’s kind of what I wish we had gotten there, but I’m glad we are here. Real-World Value & Use Cases Sarah Guo: What are some of the use cases that you’ve seen that have created the most value for your customers? Because I know that people talk a lot about code, and I think it’s pretty clear that that’s something that’s having very large scale impact. Are there other areas that you find in common that your customers are really benefiting from? Yeah. I think, yeah, to your point, obviously coding is now got... But it’s interesting, by the way, Elijah, to even talk about the coding, right? Satya Nadella: Which is coding has worked so well that we now have to rebuild the IDE, right? I mean, it’s kind of nuts to see what we sh- launched is like, oh my God, I have these hundred agent sessions. I... The cognitive load it transfers back to me as a human is so [00:07:00] excessive that now I need a new UI. Uh, oh, by the way, I, like the, the chat as the only artifact was also impossible, so that’s why we need a canvas. So it’s kind of interesting for all the things about where is software needed or where is UI needed, uh, you kind of need that even for code, right? In a fully agentic world. But that said, one of the things that we are starting to see, we started seeing with co-work, but even some of the work we, we showed with auto com- uh, um, autopilot Right on what you see with claws is a good one because if you sort of think about a lot of human capital is doing the glue work, right? If you now can augment that with tokens/agents that are long-running, durable, right, then your ability to scale even what is still judgment and glue work gets amplified like coding does. Uh, so you can... Like, I’m positive that six months from now we’ll all be saying, “Oh, wow,” like, all through ni- the night there was a bunch of stuff that [00:08:00] all these autopilots that I have working on my behalf with my delegated authority, so to speak, right? I can... Sort of given even my identity, did a bunch of work, then of course I’ll need my new ADE to say, “Well, what did you do?” Like, I might... “Did I do this work?” And so on. So I think that that’s where compressing of workflows, uh, completing of tasks, uh, that’s where I think

    39 min
  7. Jun 2

    GitHub's plan for Agents — Kyle Daigle, GitHub

    I’m excited to work with Microsoft once again as the presenting sponsors of the AI Engineer World’s Fair! We’ll streaming live from MS Build today for a special crossover pod with our friends at No Priors and the one and only Satya Nadella. However we did not hold back with this interview - we asked all the burning questions about uptime and Copilot that we know you have in your minds. Lets go! For almost two decades, GitHub has been the home of software, where both open source and closed flow, through commits, pull requests, reviews, actions, etc. This ecosystem flourished as open-source maintainers and contributors would continue shipping code for the benefit of the community. However as coding agents began to ship mass quantities of code - growing 1400% in 2026, it marked a new era that was both extremely exciting and challenging for GitHub. While these agents help more people ship more projects, they also significantly increase the floor of how much code is shipped, how often it is shipped, how many people commit code, and basically orders of magnitude multiples in every dimension of GitHub infrastructure: Now GitHub inevitably experiences more pressure on their infrastructure which was originally designed around human developers moving at human speed. This has resulted in a very publicly notable uptime story: So it begs the question of whether current systems around code can absorb what AI produces. Can CI/CD keep up when every idea becomes a build? Can open source maintainers survive floods of AI-generated slop contributions? Can GitHub preserve the human social contract of software while becoming the operating layer for agents? Which brings us to the perfect person to answer these questions: GitHub COO Kyle Daigle. In this episode, he joins swyx to unpack what happens when AI doesn’t just autocomplete code, but starts changing how companies operate, how open source works, how pull requests get reviewed, and how GitHub itself has to scale. We go deep on GitHub’s internal AI workflows: micro-skills, WorkIQ, MCP, Slack, Teams, email, Copilot workflows, the new Copilot desktop app, CLI, cloud agents, and how Kyle uses agents to look backwards across company context before deciding what to do next. Kyle also reflects on GitHub’s history building webhooks, APIs, Actions, npm, Dependabot, and Semmle, why the AI era is breaking GitHub in new ways, how Actions became a general-purpose compute layer, and what Copilot becomes after code completion. Full Video Pod We discuss: * Kyle’s expanded role across GitHub * How AI got Kyle coding again after years in leadership * Why GitHub rolls out AI through existing workflows instead of forcing new tools * WorkIQ, MCP, Slack, Teams, email, and GitHub as company context * Why massive “mega-skills” are giving way to small, atomic micro-skills * How AI changes summarization, communications, marketing, and analyst work * Why former developers in leadership may have a unique advantage in the AI era * Kyle’s “15 agents on Saturday” workflow * How Kyle built an AI-generated executive presentation for CRO/CFO teams * Why AI changes the chief of staff role without removing the human work * GitHub Actions, webhooks, arbitrary code execution, and secure agent compute * The npm acquisition, supply-chain security, 2FA, and token invalidation * Slop forks, vendoring, and whether AI agents change dependency management * What pull requests become when most PRs come from agents * Prompt requests, vouching, AI review, and trust in open source * What counts as a “developer” when AI lowers the barrier to building * GitHub Spark, low-code, and why GitHub refuses to hide the code * 14x commit growth, Actions load, databases, monorepos, and availability * Copilot’s evolution from completion to CLI, desktop app, cloud agents, and SDK * Context, memory, rules, and making GitHub “act like Kyle wants it to act” * Ambient AI, OpenClaw, enterprise security, and the new operating system for agents * What swyx should ask Satya Nadella about Microsoft’s AI future Kyle Daigle * LinkedIn: https://www.linkedin.com/in/kyledaigle * X: https://x.com/kdaigle Timestamps 00:00:00 Introduction 00:03:36 Why AI Got Kyle Coding Again 00:07:04 Running GitHub with AI: WorkIQ, MCP, Slack, Teams, and Skills 00:15:39 The Golden Age for Former Developers in Leadership 00:17:31 15 Agents on Saturday and AI-Generated Executive Work 00:20:20 How AI Changes the Chief of Staff Role 00:21:45 GitHub’s History: Actions, npm, Webhooks, and Open Source 00:28:45 Slop Forks, Vendoring, and AI Dependency Management 00:33:57 Pull Requests, Prompt Requests, and Trust in Agent-Generated Code 00:41:21 GitHub Stars, 200M+ Developers, and the New AI Builder Wave 00:45:15 GitHub Spark, Low-Code, and Why GitHub Still Shows the Code 00:47:38 GitHub’s Hardest Era: 14x Growth, Reliability, and Scale 00:59:21 Actions as the Compute Layer for CI/CD and Automation 01:02:04 The State and Future of GitHub Copilot 01:08:24 Ambient AI, Background Agents, and the Future of the SDLC 01:13:09 OpenClaw, Enterprise Security, and the New OS for Agents 01:18:03 Build Announcements, WorkIQ, FoundryIQ, and Microsoft Context 01:21:41 What Should swyx Ask Satya? Transcript Introduction: Kyle Daigle’s Expanded Role at GitHub and Microsoft Swyx [00:00:00]: We’re here with Kyle Daigle, COO of GitHub. Welcome. Kyle [00:00:07]: Hey, thanks for having me. Swyx [00:00:08]: You’re not just CEO of GitHub. People know you as that. You have a new role. Kyle [00:00:11]: So I have an expanded role now. I’ve been working at GitHub for thirteen years and doing all things developer. Joined as a developer myself. And now, I’m also responsible as the CMO of Developer for Microsoft. And so all the kind of learnings and passion for developers and how we work with them and how we communicate and how we bring our products to market, we’re also bringing that expertise to the broader Microsoft ecosystem and helping every developer that uses a Microsoft product or would like to have a sort of similar experience that they’ve had with GitHub over the years. So it’s a different role in some ways, but it’s also just building on the experience that I’ve had at GitHub of just sort of tell the truth, be authentic, show people how to use it and then let the products speak for themselves. Now just doing that with, all of Microsoft. Swyx [00:01:09]: We’ll be releasing this in conjunction with Build. You got lots of stuff planned, and we can sort of touch on that whenever it’s appropriate. I think one of the interesting things is I rarely meet a COO who’s also a CMO. I think you’re a very outward facing and you’re very confident publicly. That’s rare. Do you actually view yourself as COO? What’s What is your thing? From GitHub Developer to COO/CMO: Building the Platform and Operating GitHub Kyle [00:01:33]: I think for me, it’s been funny. The titles have always been, a— have always felt a little strange to me. I joined GitHub as a developer? I wrote so much of the Swyx [00:01:46]: Let’s bring that up. You wrote the back ends? Kyle [00:01:48]: I was going through, I was going through, some old photos, when folks were talking about how things were being built or how there was a build GitHub. I built, webhooks and worked with teams building the API, built the platform layer. Anything that integrated with GitHub, up until really twenty eighteen, I built or ran the engineering teams. And that’s kind of where my the beginning of my passion always was helping people build things, deliver them to, their customers. And so being a developer, building for developers was always super unique. In a— I think as my role expanded, it became my ability to talk to not just developers, but also enterprise customers or business leaders and have this translation layer. And then through all those years, GitHub has always operated pretty uniquely. Post-pandemic, working remotely was not as novel as it was when GitHub started in two thousand and eight. But all that expertise of running remote teams, doing it well, became this sort of bigger role, ultimately turning into the COO role of how do we operate GitHub in the way that GitHub’s always operated after the Microsoft acquisition. And kind of so on from there. So like for me, I think the— I’ve, I still code. I love coding but the problem has always been, people. It’s a much harder problem to both support our own employees, a harder problem to communicate to developers and enterprise buyers what we’re building why it matters, ‘cause those are two very different messages. And so getting to work in the mix of COO, CMO, also just being a dev, I think is what’s kept me at GitHub for so long. AI Workflows for Leadership: Commits, Retrospectives, and Context Swyx [00:03:40]: Apparently, you have— your commits have gone up. What’s this? What’s going on? Kyle [00:03:45]: Rui’s called me out pretty aggressively. So I think— as you can imagine, right, you can see my normal era of being a dev In the twenty thirteen, twenty fourteen era, and then moving into management, and then ultimately the COO role. I think what you see there is me, really getting back to coding thanks to AI. I— similar to, attaching problems between how to market and how to operate a business and how to code, I find, building agents and workflows that are connecting very disparate problems to be what’s driving this. So that’s, some of it’s writing software. A lot of it is, connecting a ton of a different data sources to, help me out. But that is completely me really diving in on the AI side in trying out our tools, trying out everyone’s tools, But building for me, building for the non-technical leader, though I’m technical and how we’re, able to use these tools more than just the simple, call and response that I think a lot of the non-technical, your employers, you have to get— y

    1h 23m
  8. Jun 1

    Why Video Agent models are next — Ethan He, xAI Grok Imagine

    We’re announcing AIEWF speakers this week! Take the AI Engineering Survey! Today’s guest Ethan first joined us for the LS Paper Club as the lead on NVIDIA Cosmos World Model, but then joined xAI and built Grok Imagine in 3 months: He comes back on Latent Space with some nuclear hot takes: that Video Models primarily get their intelligence from LLMs, not from training on video data, and that the next frontier for truly interactive, realtime, long-horizon world models is to work on LLMs (perhaps Interaction Models as well…) Put it this way: In the near term, the next Sora won’t be a better video model, but a video agent. Generative Media may more closely follow the evolution of AI coding which went from focusing on one-shot output performance and cost, to multiturn reasoning and planning models for agents and systems that can plan, edit, test, debug, and submit PRs. At a certain point, coding models got so good that the only significant next step to improve performance was handling the orchestration of these models. Now as the performance of video models increases significantly across realism, consistency, & prompt adherence while becoming more cost efficient, the next evolution of video generation may also be systems that can plan, generate, edit, critique, and iterate across an entire creative task. In this episode, Ethan joins swyx and Vibhu to unpack what it actually takes to build frontier image and video systems: data, VAEs, diffusion transformers, audio-video alignment, inference speedups, and the hidden cost of storing and moving massive video datasets. From building NVIDIA’s Cosmos world model to joining xAI as Grok Imagine was being built from zero to one, Ethan He has been at the center of some of the most important work in video generation, multimodal models, and real-time world models. We go deep on Grok Imagine, how a small xAI team shipped its first multimodal video model in three months, why iteration speed matters more than almost anything in model development, and why many of the biggest gains come from fixing tiny bugs in data and training pipelines. Flipbook: The future of Videomaxxing Video agents are almost a sure bet to be the trend in the coming year. We end with a glance at what’s beyond video agents: Flipbook caused a minor sensation this year when it was released, but most treat it as a fun demo. Ethan takes it very seriously — with the speed and cost of inference coming down every year, the future of custom video JIT UI is closer than you think. We talked about why videogen models may become the front end of AI, how generative UI could replace traditional HTML/CSS, why world models need to be real-time, interactive, and long-horizon, and why the future of video generation may depend more on language models and agents than on diffusion alone. We discuss: * Why fast iteration mattered more than meetings * Why small training bugs can drive huge model quality gains * Why coding models may make compute the bottleneck again * How image and video models are trained with synthetic captions * The role of VAEs and latent space in frontier video models * Why image models are the foundation for video models * The tradeoff between temporal compression and real-time interactivity * Flipbook, Neural OS, and the future of generative UI * Why future interfaces may go from user intent to pixels * The hidden cost of training video models: storage, egress, and GPU hours * How step distillation and consistency models (like OpenAI sCM) makes video inference orders of magnitude faster * Grok Imagine 0.9 and large-scale audio-video generation * Why audio-video alignment is harder than text-video alignment * Ethan’s definition of world models * Reference-to-video, video extension, and long-context video generation * Why xAI’s research communication undersells Grok Imagine * How xAI culture shaped the speed of development * AI watermarking, SynthID, and detecting generated media * Why prompt rewriting matters for video models * Grok Imagine Agent and the rise of video agents * Why language models may unlock better video generation * Robotics, physical AI, and embodied world models * Why Ethan left xAI and shifted focus toward LLMs * Self-managed context, memory, and the next frontier for language models Ethan He * LinkedIn: https://www.linkedin.com/in/ethanhe42 * X: https://x.com/EthanHe_42 Timestamps 00:00:00 Introduction 00:01:25 From NVIDIA Cosmos to xAI 00:03:24 Building Grok Imagine from Zero to One 00:10:07 How Image and Video Models Are Trained 00:18:53 Video Compression, VAEs, and Real-Time Tradeoffs 00:22:10 Generative UI, Flipbook, and Neural OS 00:32:10 The Cost of Training Large Video Models 00:37:04 Distillation, GANs, and Fast Video Inference 00:41:21 Audio-Video Generation and Grok Imagine 0.9 00:48:34 What Makes a World Model? 00:55:51 Reference Videos, Long Context, and Video Memory 01:00:11 xAI Culture, Research, and First-Principles Building 01:09:45 AI Safety, Watermarking, and Prompt Rewriting 01:13:10 Video Agents and AI-Assisted Creation 01:27:32 Why Language Models Unlock Better Video 01:31:15 Robotics, Physical AI, and Embodied World Models 01:32:38 Why Ethan Left xAI 01:34:16 Self-Managed Context and the Future of LLMs 01:38:43 Ethan’s Career Path and Closing Thoughts Transcript Introduction: Ethan He, Latent Space, and the Path to xAI Swyx [00:00:00]: We’re here in the studio with Ethan He, most recently of xAI. Welcome. Ethan [00:00:10]: Thank you. Glad being here. Swyx [00:00:11]: We’re also here with Vibhu. you were first coming to us or joining the latent space world because you were working on Kosmos at NVIDIA, and you did a paper. We loved it. you presented it as well, so thank you for doing that. Ethan [00:00:23]: I’ve actually, I also presented the MoEs twice at latent space. Swyx [00:00:29]: How did you actually hear about us? Did we reach out to you? Is that how it worked? Ethan [00:00:33]: No, actually, I-- the community. Like I realized, oh, there is this online community that people talk about AI and also learn from each other through papers every week through the Paperclip. It’s very nice. Ethan [00:00:49]: I learned a lot. Swyx [00:00:49]: I think three years stop. We haven’t stopped even on Christmas and New Years. many weeks I want to stop but it keeps going. Vibhu [00:00:58]: No, that was good. I think you had posted that you worked on a paper, and I was “Oh, very cool. We have Paperclip. Present then.” Vibhu [00:01:04]: But I might have reached out to you after. Swyx [00:01:05]: you-- because it’s an amateur club, right? Swyx [00:01:08]: so it’s very unusual and but we have sometimes paper authors come by and actually explain the paper. Today we just did, the poolside paper, which was apparently very good. Vibhu [00:01:18]: Came out yesterday. Vibhu [00:01:19]: pretty interesting, right? Fully open. They talk about everything, systems. So it’s a good one. We’ll, we’ll recommend people to read it. Swyx [00:01:25]: Bring us up to speed on your transition to xAI, ‘cause I actually don’t even know when you joined. just like tell the, tell the story about the sort of transition. From NVIDIA Cosmos to xAI: Scaling Video and World Models Ethan [00:01:34]: Before xAI, I was working on Kosmos world model as in-- at NVIDIA. So Kosmos is, it’s a giant video foundation models that can-- that aims to simulate the world and for-- it serves as a foundation of-- for all of the roboticists to build on top of. There, once I built the Kosmos one, I realized as this thing also has a scaling law similar to language model, we need to scale up the video models further. that’s, that’s why I realized I need to move to somewhere with much more compute resources. That’s how I Swyx [00:02:13]: Than NVIDIA? Vibhu [00:02:14]: The GPU rich came themselves. Vibhu [00:02:19]: And timeline-wise, when was Kosmo? It was pretty early, right? It was open world model, open paper, everything. Ethan [00:02:25]: It was end of twenty-four. Vibhu [00:02:28]: End of twenty-four. Ethan [00:02:30]: Then at mid twenty-five, I moved to xAI. At that time-- I joined about the time when xAI was about to build video models and in multi-model models. There were no infra, no data, and no model, and it just-- as a few engineers, we built it in three months and released the first model, Grok Imagine zero point nine. Ethan [00:02:55]: And since then, I keep working on video models and move more from training and to post-training of the video models. For example, like a reference to videos, kind of like the cameo feature and, video extensions. And, before I left, I worked on a world model, leading a small team to focus on the real-time long horizon video generation. Building Grok Imagine From Scratch in Three Months Swyx [00:03:24]: Can you give like a rough roadmap of okay, you’re on a brand-new team. Grok previously was only text, or they partnered with BFL for their image gen stuff. What do you-- what are the building blocks, right? You have compute, data you can procure somewhere. Like just what are like the sequence of things that people should think about when you’re setting up a new team? Vibhu [00:03:43]: actually even deeper, not just data you can procure. You guys had to go through getting the data too, right? So you shipped it pretty fast, but yeah Swyx [00:03:51]: three months is like Vibhu [00:03:52]: From everything Swyx [00:03:52]: actually like very surprisingly fast. Ethan [00:03:55]: One thing I say like thanks to my experience at NVIDIA, ‘cause first time when we were building Kosmos together, we built it, for about a year. So this is like the second time I do it. Roughly have an idea, what to do. I say the most important thing is the talent. Everyone were very strong and clever, very close with each other towards a common goal. So that speed up things a lot. So you reduce the communication bandwidth among people, and everyone can work

    1h 43m
4.6
out of 5
102 Ratings

About

The podcast by and for AI Engineers! In 2025, over 10 million readers and listeners came to Latent Space to hear about news, papers and interviews in Software 3.0. We cover Foundation Models changing every domain in Code Generation, Multimodality, AI Agents, GPU Infra and more, directly from the founders, builders, and thinkers involved in pushing the cutting edge. Striving to give you both the definitive take on the Current Thing down to the first introduction to the tech you'll be using in the next 3 months! We break news and exclusive interviews from OpenAI, Anthropic, Gemini, Meta (Soumith Chintala), Sierra (Bret Taylor), tiny (George Hotz), Databricks/MosaicML (Jon Frankle), Modular (Chris Lattner), Answer.ai (Jeremy Howard), et al. Full show notes always on https://latent.space www.latent.space

You Might Also Like