Latent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and al

Alessio + swyx
Latent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and al

The podcast by and for AI Engineers! In 2023, over 1 million visitors came to Latent Space to hear about news, papers and interviews in Software 3.0. We cover Foundation Models changing every domain in Code Generation, Multimodality, AI Agents, GPU Infra and more, directly from the founders, builders, and thinkers involved in pushing the cutting edge. Striving to give you both the definitive take on the Current Thing down to the first introduction to the tech you'll be using in the next 3 months! We break news and exclusive interviews from OpenAI, tiny (George Hotz), Databricks/MosaicML (Jon Frankle), Modular (Chris Lattner), Answer.ai (Jeremy Howard), et al. Full show notes always on https://latent.space www.latent.space

  1. 2 DAYS AGO

    Generative Video WorldSim, Diffusion, Vision, Reinforcement Learning and Robotics — ICML 2024 Part 1

    Regular tickets are now sold out for Latent Space LIVE! at NeurIPS! We have just announced our last speaker and newest track, friend of the pod Nathan Lambert who will be recapping 2024 in Reasoning Models like o1! We opened up a handful of late bird tickets for those who are deciding now — use code DISCORDGANG if you need it. See you in Vancouver! We’ve been sitting on our ICML recordings for a while (from today’s first-ever SOLO guest cohost, Brittany Walker), and in light of Sora Turbo’s launch (blogpost, tutorials) today, we figured it would be a good time to drop part one which had been gearing up to be a deep dive into the state of generative video worldsim, with a seamless transition to vision (the opposite modality), and finally robots (their ultimate application). Sora, Genie, and the field of Generative Video World Simulators Bill Peebles, author of Diffusion Transformers, gave his most recent Sora talk at ICML, which begins our episode: * William (Bill) Peebles - SORA (slides) Something that is often asked about Sora is how much inductive biases were introduced to achieve these results. Bill references the same principles brought by Hyung Won Chung from the o1 team - “sooner or later those biases come back to bite you”. We also recommend these reads from throughout 2024 on Sora. * Lilian Weng’s literature review of Video Diffusion Models * Sora API leak * Estimates of 100k-700k H100s needed to serve Sora (not Turbo) * Artist guides on using Sora for professional storytelling Google DeepMind had a remarkably strong presence at ICML on Video Generation Models, winning TWO Best Paper awards for: * Genie: Generative Interactive Environments (covered in oral, poster, and workshop) * VideoPoet: A Large Language Model for Zero-Shot Video Generation (see website) We end this part by taking in Tali Dekel’s talk on The Future of Video Generation: Beyond Data and Scale. Part 2: Generative Modeling and Diffusion Since 2023, Sander Dieleman’s perspectives (blogpost, tweet) on diffusion as “spectral autoregression in the frequency domain” while working on Imagen and Veo have caught the public imagination, so we highlight his talk: * Wading through the noise: an intuitive look at diffusion models Then we go to Ben Poole for his talk on Inferring 3D Structure with 2D Priors, including his work on NeRFs and DreamFusion: Then we investigate two flow matching papers - one from the Flow Matching co-authors - Ricky T. Q. Chen (FAIR, Meta) And how it is implemented in Stable Diffusion 3 with Scaling Rectified Flow Transformers for High-Resolution Image Synthesis Our last hit on Diffusion is a couple of oral presentations on speech, which we leave you to explore via our audio podcast * NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models * Speech Self-Supervised Learning Using Diffusion Model Synthetic Data Part 3: Vision The ICML Test of Time winner was DeCAF, which Trevor Darrell notably called “the OG vision foundation model”. Lucas Beyer’s talk on “Vision in the age of LLMs — a data-centric perspective” was also well received online, and he talked about his journey from Vision Transformers to PaliGemma. We give special honorable mention to MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark. Part 4: Reinforcement Learning and Robotics We segue vision into robotics with the help of Ashley Edwards, whose work on both the Gato and the Genie teams at Deepmind is summarized in Learning actions, policies, rewards, and environments from videos alone. Brittany highlighted two poster session papers: * Behavior Generation with Latent Actions * We also recommend Lerrel Pinto’s On Building General-Purpose Robots * PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs However we must give the lion’s share of space to Chelsea Finn, now founder of Physical Intelligence, who gave FOUR talks on * "What robots have taught me about machine learning" * developing robot generalists * robots that adapt autonomously * how to give feedback to your language model * special mention to PI colleague Sergey Levine on Robotic Foundation Models We end the podcast with a position paper that links generative environments and RL/robotics: Automatic Environment Shaping is the Next Frontier in RL. Timestamps * [00:00:00] Intros * [00:02:43] Sora - Bill Peebles * [00:44:52] Genie: Generative Interactive Environments * [01:00:17] Genie interview * [01:12:33] VideoPoet: A Large Language Model for Zero-Shot Video Generation * [01:30:51] VideoPoet interview - Dan Kondratyuk * [01:42:00] Tali Dekel - The Future of Video Generation: Beyond Data and Scale. * [02:27:07] Sander Dieleman - Wading through the noise: an intuitive look at diffusion models * [03:06:20] Ben Poole - Inferring 3D Structure with 2D Priors * [03:30:30] Ricky Chen - Flow Matching * [04:00:03] Patrick Esser - Stable Diffusion 3 * [04:14:30] NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models * [04:27:00] Speech Self-Supervised Learning Using Diffusion Model Synthetic Data * [04:39:00] ICML Test of Time winner: DeCAF * [05:03:40] Lucas Beyer: “Vision in the age of LLMs — a data-centric perspective” * [05:42:00] Ashley Edwards: Learning actions, policies, rewards, and environments from videos alone. * [06:03:30] Behavior Generation with Latent Actions interview * [06:09:52] Chelsea Finn: "What robots have taught me about machine learning" * [06:56:00] Position: Automatic Environment Shaping is the Next Frontier in RL Get full access to Latent Space at www.latent.space/subscribe

  2. 2 DEC

    Bolt.new, Flow Engineering for Code Agents, and >$8m ARR in 2 months as a Claude Wrapper

    The full schedule for Latent Space LIVE! at NeurIPS has been announced, featuring Best of 2024 overview talks for the AI Startup Landscape, Computer Vision, Open Models, Transformers Killers, Synthetic Data, Agents, and Scaling, and speakers from Sarah Guo of Conviction, Roboflow, AI2/Meta, Recursal/Together, HuggingFace, OpenHands and SemiAnalysis. Join us for the IRL event/Livestream! Alessio will also be holding a meetup at AWS Re:Invent in Las Vegas this Wednesday. See our new Events page for dates of AI Engineer Summit, Singapore, and World’s Fair in 2025. LAST CALL for questions for our big 2024 recap episode! Submit questions and messages on Speakpipe here for a chance to appear on the show! When we first observed that GPT Wrappers are Good, Actually, we did not even have Bolt on our radar. Since we recorded our Anthropic episode discussing building Agents with the new Claude 3.5 Sonnet, Bolt.new (by Stackblitz) has easily cleared the $8m ARR bar, repeating and accelerating its initial $4m feat. There are very many AI code generators and VS Code forks out there, but Bolt probably broke through initially because of its incredible zero shot low effort app generation: But as we explain in the pod, Bolt also emphasized deploy (Netlify)/ backend (Supabase)/ fullstack capabilities on top of Stackblitz’s existing WebContainer full-WASM-powered-developer-environment-in-the-browser tech. Since then, the team has been shipping like mad (with weekly office hours), with bugfixing, full screen, multi-device, long context, diff based edits (using speculative decoding like we covered in Inference, Fast and Slow). All of this has captured the imagination of low/no code builders like Greg Isenberg and many others on YouTube/TikTok/Reddit/X/Linkedin etc: Just as with Fireworks, our relationship with Bolt/Stackblitz goes a bit deeper than normal - swyx advised the launch and got a front row seat to this epic journey, as well as demoed it with Realtime Voice at the recent OpenAI Dev Day. So we are very proud to be the first/closest to tell the full open story of Bolt/Stackblitz! Flow Engineering + Qodo/AlphaCodium Update In year 2 of the pod we have been on a roll getting former guests to return as guest cohosts (Harrison Chase, Aman Sanger, Jon Frankle), and it was a pleasure to catch Itamar Friedman back on the pod, giving us an update on all things Qodo and Testing Agents from our last catchup a year and a half ago: Qodo (they renamed in September) went viral in early January this year with AlphaCodium (paper here, code here) beating DeepMind’s AlphaCode with high efficiency: With a simple problem solving code agent: * The first step is to have the model reason about the problem. They describe it using bullet points and focus on the goal, inputs, outputs, rules, constraints, and any other relevant details. * Then, they make the model reason about the public tests and come up with an explanation of why the input leads to that particular output. * The model generates two to three potential solutions in text and ranks them in terms of correctness, simplicity, and robustness. * Then, it generates more diverse tests for the problem, covering cases not part of the original public tests. * Iteratively, pick a solution, generate the code, and run it on a few test cases. * If the tests fail, improve the code and repeat the process until the code passes every test. swyx has previously written similar thoughts on types vs tests for putting bounds on program behavior, but AlphaCodium extends this to AI generated tests and code. More recently, Itamar has also shown that AlphaCodium’s techniques also extend well to the o1 models: Making Flow Engineering a useful technique to improve code model performance on every model. This is something we see AI Engineers uniquely well positioned to do compared to ML Engineers/Researchers. Full Video Podcast Like and subscribe! Show Notes * Itamar * Qodo * First episode * Eric * Bolt * StackBlitz * Thinkster * AlphaCodium * WebContainers Chapters * 00:00:00 Introductions & Updates * 00:06:01 Generic vs. Specific AI Agents * 00:07:40 Maintaining vs Creating with AI * 00:17:46 Human vs Agent Computer Interfaces * 00:20:15 Why Docker doesn't work for Bolt * 00:24:23 Creating Testing and Code Review Loops * 00:28:07 Bolt's Task Breakdown Flow * 00:31:04 AI in Complex Enterprise Environments * 00:41:43 AlphaCodium * 00:44:39 Strategies for Breaking Down Complex Tasks * 00:45:22 Building in Open Source * 00:50:35 Choosing a product as a founder * 00:59:03 Reflections on Bolt Success * 01:06:07 Building a B2C GTM * 01:18:11 AI Capabilities and Pricing Tiers * 01:20:28 What makes Bolt unique * 01:23:07 Future Growth and Product Development * 01:29:06 Competitive Landscape in AI Engineering * 01:30:01 Advice to Founders and Embracing AI * 01:32:20 Having a baby and completing an Iron Man Transcript Alessio [00:00:00]: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner and CTO at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol.ai. Swyx [00:00:12]: Hey, and today we're still in our sort of makeshift in-between studio, but we're very delighted to have a former returning guest host, Itamar. Welcome back. Itamar [00:00:21]: Great to be here after a year or more. Yeah, a year and a half. Swyx [00:00:24]: You're one of our earliest guests on Agents. Now you're CEO co-founder of Kodo. Right. Which has just been renamed. You also raised a $40 million Series A, and we can get caught up on everything, but we're also delighted to have our new guest, Eric. Welcome. Eric [00:00:42]: Thank you. Excited to be here. Should I say Bolt or StackBlitz? Swyx [00:00:45]: Like, is it like its own company now or? Eric [00:00:47]: Yeah. Bolt's definitely bolt.new. That's the thing that we're probably the most known for, I imagine, at this point. Swyx [00:00:54]: Which is ridiculous to say because you were working at StackBlitz for so long. Eric [00:00:57]: Yeah. I mean, within a week, we were doing like double the amount of traffic. And StackBlitz had been online for seven years, and we were like, what? But anyways, yeah. So we're StackBlitz, the company behind bolt.new. If you've heard of bolt.new, that's our stuff. Yeah. Swyx [00:01:12]: Yeah. Itamar [00:01:13]: Excellent. I see, by the way, that the founder mode, you need to know to capture opportunities. So kudos on doing that, right? You're working on some technology, and then suddenly you can exploit that to a new world. Yeah. Eric [00:01:24]: Totally. And I think, well, not to jump, but 100%, I mean, a couple of months ago, we had the idea for Bolt earlier this year, but we haven't really shared this too much publicly. But we actually had tried to build it with some of those state-of-the-art models back in January, February, you can kind of imagine which, and they just weren't good enough to actually do the code generation where the code was accurate and it was fast and whatever have you without a ton of like rag, but then there was like issues with that. So we put it on the shelf and then we got kind of a sneak peek of some of the new models that have come out in the past couple of months now. And so once we saw that, once we actually saw the code gen from it, we were like, oh my God, like, okay, we can build a product around this. And so that was really the impetus of us building the thing. But with that, it was StackBlitz, the core StackBlitz product the past seven years has been an IDE for developers. So the entire user experience flow we've built up just didn't make sense. And so when we kind of went out to build Bolt, we just thought, you know, if we were inventing our product today, what would the interface look like given what is now possible with the AI code gen? And so there's definitely a lot of conversations we had internally, but you know, just kind of when we logically laid it out, we were like, yeah, I think it makes sense to just greenfield a new thing and let's see what happens. If it works great, then we'll figure it out. If it doesn't work great, then it'll get deleted at some point. So that's kind of how it actually came to be. Swyx [00:02:49]: I'll mention your background a little bit. You were also founder of Thinkster before you started StackBlitz. So both of you are second time founders. Both of you have sort of re-founded your company recently. Yours was more of a rename. I think a slightly different direction as well. And then we can talk about both. Maybe just chronologically, should we get caught up on where Kodo is first and then you know, just like what people should know since the last pod? Sure. Itamar [00:03:12]: The last pod was two months after we launched and we basically had the vision that we talked about. The idea that software development is about specification, test and code, etc. We are more on the testing part as in essence, we think that if you solve testing, you solve software development. The beautiful chart that we'll put up on screen. And testing is a really big field, like there are many dimensions, unit testing, the level of the component, how big it is, how large it is. And then there is like different type of testing, is it regression or smoke or whatever. So back then we only had like one ID extension with unit tests as in focus. One and a half year later, first ID extension supports more type of testing as context aware. We index local, local repos, but also 10,000s of repos for Fortune 500 companies. We have another agent, another tool that is called, the pure agent is the open source and the commercial one is CodoMerge. And then we have another open source called CoverAgent, which is not yet a commercial product coming very soon. It's very impressive. It could be that already people are approving automated pull requests that they don't even aware in really big open sources. So once we have enough of these, we will also launch another agent. So for the fir

    1h 39m
  3. 28 NOV

    The new Claude 3.5 Sonnet, Computer Use, and Building SOTA Agents — with Erik Schluntz, Anthropic

    We have announced our first speaker, friend of the show Dylan Patel, and topic slates for Latent Space LIVE! at NeurIPS. Sign up for IRL/Livestream and to debate! We are still taking questions for our next big recap episode! Submit questions and messages on Speakpipe here for a chance to appear on the show! The vibe shift we observed in July - in favor of Claude 3.5 Sonnet, first introduced in June — has been remarkably long lived and persistent, surviving multiple subsequent updates of 4o, o1 and Gemini versions, for Anthropic’s Claude to end 2024 as the preferred model for AI Engineers and even being the exclusive choice for new code agents like bolt.new (our next guest on the pod!), which unlocked so much performance from Claude Sonnet that it went from $0 to $4m ARR in 4 weeks when it launched last month. Anthropic has now raised an additional $4b from Amazon and made an incredibly well received update of Claude 3.5 Sonnet (and Haiku), making significant improvements in performance over its predecessors: Solving SWE-Bench As part of the October Sonnet release, Anthropic teased a blink-and-you’ll miss it result: The updated Claude 3.5 Sonnet shows wide-ranging improvements on industry benchmarks, with particularly strong gains in agentic coding and tool use tasks. On coding, it improves performance on SWE-bench Verified from 33.4% to 49.0%, scoring higher than all publicly available models—including reasoning models like OpenAI o1-preview and specialized systems designed for agentic coding. It also improves performance on TAU-bench, an agentic tool use task, from 62.6% to 69.2% in the retail domain, and from 36.0% to 46.0% in the more challenging airline domain. The new Claude 3.5 Sonnet offers these advancements at the same price and speed as its predecessor. This was followed up by a blogpost a week later from today’s guest, Erik Schluntz, the engineer who implemented and scored this SOTA result using a simple, non-overengineered version of the SWE-Agent framework (you can see the submissions here). We have previously covered the SWE-Bench story extensively: * Speaking with SWEBench/SWEAgent authors at ICLR * Speaking with Cosine Genie, the previous SOTA (43.8%) on SWEBench Verified (with brief update at DevDay 2024) * Speaking with Shunyu Yao on SWEBench and the ReAct paradigm driving SWE-Agent One of the notable inclusions in this blogpost are the tools that Erik decided to give Claude, e.g. the “Edit Tool”: The tools teased in the SWEBench submission/blogpost were then polished up and released with Computer Use… And you can also see even more computer use tools given in the new Model Context Protocol servers: Claude Computer Use Because it is one of the best received AI releases of the year, we recommend watching the 2 minute Computer Use intro (and related demos) in its entirety: Eric also worked on Claude’s function calling, tool use, and computer use APIs, so we discuss that in the episode. Erik [00:53:39]: With computer use, just give the thing a browser that's logged into what you want to integrate with, and it's going to work immediately. And I see that reduction in friction as being incredibly exciting. Imagine a customer support team where, okay, hey, you got this customer support bot, but you need to go integrate it with all these things. And you don't have any engineers on your customer support team. But if you can just give the thing a browser that's logged into your systems that you need it to have access to, now, suddenly, in one day, you could be up and rolling with a fully integrated customer service bot that could go do all the actions you care about. So I think that's the most exciting thing for me about computer use, is reducing that friction of integrations to almost zero. As you’ll see, this is very top of mind for Erik as a former Robotics founder who’s company basically used robots to interface with human physical systems like elevators. Full Video episode Please like and subscribe! Show Notes * Eric Schluntz * “Raising the bar on SWE-Bench Verified” * Cobalt Robotics * SWE-Bench * SWE-Bench Verified * Human Eval & other benchmarks * Anthropic Workbench * Aider * Cursor * Fireworks AI * E2B * Amanda Askell * Toyota Research * Physical Intelligence (Pi) * Chelsea Finn * Josh Albrecht * Eric Jang * 1X * Dust * Cosine Episode * Bolt * Adept Episode * TauBench * LMSys Episode Timestamps * [00:00:00] Introductions * [00:03:39] What is SWE-Bench? * [00:12:22] SWE-Bench vs HumanEval vs others * [00:15:21] SWE-Agent architecture and runtime * [00:21:18] Do you need code indexing? * [00:24:50] Giving the agent tools * [00:27:47] Sandboxing for coding agents * [00:29:16] Why not write tests? * [00:30:31] Redesigning engineering tools for LLMs * [00:35:53] Multi-agent systems * [00:37:52] Why XML so good? * [00:42:57] Thoughts on agent frameworks * [00:45:12] How many turns can an agent do? * [00:47:12] Using multiple model types * [00:51:40] Computer use and agent use cases * [00:59:04] State of AI robotics * [01:04:24] Robotics in manufacturing * [01:05:01] Hardware challenges in robotics * [01:09:21] Is self-driving a good business? Transcript Alessio [00:00:00]: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner and CTO at Decibel Partners. And today we're in the new studio with my usual co-host, Shawn from Smol AI. Swyx [00:00:14]: Hey, and today we're very blessed to have Erik Schluntz from Anthropic with us. Welcome. Erik [00:00:19]: Hi, thanks very much. I'm Erik Schluntz. I'm a member of technical staff at Anthropic, working on tool use, computer use, and Swebench. Swyx [00:00:27]: Yeah. Well, how did you get into just the whole AI journey? I think you spent some time at SpaceX as well? Yeah. And robotics. Yeah. There's a lot of overlap between like the robotics people and the AI people, and maybe like there's some interlap or interest between language models for robots right now. Maybe just a little bit of background on how you got to where you are. Yeah, sure. Erik [00:00:50]: I was at SpaceX a long time ago, but before joining Anthropic, I was the CTO and co-founder of Cobalt Robotics. We built security and inspection robots. These are sort of five foot tall robots that would patrol through an office building or a warehouse looking for anything out of the ordinary. Very friendly, no tasers or anything. We would just sort of call a remote operator if we saw anything. We have about 100 of those out in the world, and had a team of about 100. We actually got acquired about six months ago, but I had left Cobalt about a year ago now, because I was starting to get a lot more excited about AI. I had been writing a lot of my code with things like Copilot, and I was like, wow, this is actually really cool. If you had told me 10 years ago that AI would be writing a lot of my code, I would say, hey, I think that's AGI. And so I kind of realized that we had passed this level, like, wow, this is actually really useful for engineering work. That got me a lot more excited about AI and learning about large language models. So I ended up taking a sabbatical and then doing a lot of reading and research myself and decided, hey, I want to go be at the core of this and joined Anthropic. Alessio [00:01:53]: And why Anthropic? Did you consider other labs? Did you consider maybe some of the robotics companies? Erik [00:02:00]: So I think at the time I was a little burnt out of robotics, and so also for the rest of this, any sort of negative things I say about robotics or hardware is coming from a place of burnout, and I reserve my right to change my opinion in a few years. Yeah, I looked around, but ultimately I knew a lot of people that I really trusted and I thought were incredibly smart at Anthropic, and I think that was the big deciding factor to come there. I was like, hey, this team's amazing. They're not just brilliant, but sort of like the most nice and kind people that I know, and so I just felt like I could be a really good culture fit. And ultimately, I do care a lot about AI safety and making sure that I don't want to build something that's used for bad purposes, and I felt like the best chance of that was joining Anthropic. Alessio [00:02:39]: And from the outside, these labs kind of look like huge organizations that have these obscure Swyx [00:02:44]: ways to organize. Alessio [00:02:45]: How did you get, you joined Anthropic, did you already know you were going to work on of the stuff you publish or you kind of join and then you figure out where you land? I think people are always curious to learn more. Erik [00:02:57]: Yeah, I've been very happy that Anthropic is very bottoms up and sort of very sort of receptive to whatever your interests are. And so I joined sort of being very transparent of like, hey, I'm most excited about code generation and AI that can actually go out and sort of touch the world or sort of help people build things. And, you know, those weren't my initial initial projects. I also came in and said, hey, I want to do the most valuable possible thing for this company and help Anthropic succeed. And, you know, like, let me find the balance of those. So I was working on lots of things at the beginning, you know, function calling, tool use. And then sort of as it became more and more relevant, I was like, oh, hey, like, let's it's time to go work on encoding agents and sort of started looking at SWE-Bench as sort of a really good benchmark for that. Swyx [00:03:39]: So let's get right into SWE-Bench. That's one of the many claims to fame. I feel like there's just been a series of releases related with Cloud 3.5 Sonnet around about two or three months ago, 3.5 Sonnet came out and it was it was a step ahead in terms of a lot of people immediately fell in love with it for coding. And then last month you released a new updated version of Cloud Sonnet. We're not going to talk about the training for that because that's s

    1h 11m
  4. 25 NOV

    Why Compound AI + Open Source will beat Closed AI

    We have a full slate of upcoming events: AI Engineer London, AWS Re:Invent in Las Vegas, and now Latent Space LIVE! at NeurIPS in Vancouver and online. Sign up to join and speak! We are still taking questions for our next big recap episode! Submit questions and messages on Speakpipe here for a chance to appear on the show! We try to stay close to the inference providers as part of our coverage, as our podcasts with Together AI and Replicate will attest: However one of the most notable pull quotes from our very well received Braintrust episode was his opinion that open source model adoption has NOT gone very well and is actually declining in relative market share terms (it is of course increasing in absolute terms): Today’s guest, Lin Qiao, would wholly disagree. Her team of Pytorch/GPU experts are wholly dedicated toward helping you serve and finetune the full stack of open source models from Meta and others, across all modalities (Text, Audio, Image, Embedding, Vision-understanding), helping customers like Cursor and Hubspot scale up open source model inference both rapidly and affordably. Fireworks has emerged after its successive funding rounds with top tier VCs as one of the leaders of the Compound AI movement, a term first coined by the Databricks/Mosaic gang at Berkeley AI and adapted as “Composite AI” by Gartner: Replicating o1 We are the first podcast to discuss Fireworks’ f1, their proprietary replication of OpenAI’s o1. This has become a surprisingly hot area of competition in the past week as both Nous Forge and Deepseek r1 have launched competitive models. Full Video Podcast Like and subscribe! Timestamps * 00:00:00 Introductions * 00:02:08 Pre-history of Fireworks and PyTorch at Meta * 00:09:49 Product Strategy: From Framework to Model Library * 00:13:01 Compound AI Concept and Industry Dynamics * 00:20:07 Fireworks' Distributed Inference Engine * 00:22:58 OSS Model Support and Competitive Strategy * 00:29:46 Declarative System Approach in AI * 00:31:00 Can OSS replicate o1? * 00:36:51 Fireworks f1 * 00:41:03 Collaboration with Cursor and Speculative Decoding * 00:46:44 Fireworks quantization (and drama around it) * 00:49:38 Pricing Strategy * 00:51:51 Underrated Features of Fireworks Platform * 00:55:17 Hiring Transcript Alessio [00:00:00]: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner at CTO at Danceable Partners, and I'm joined by my co-host, Swyx founder, Osmalayar. Swyx [00:00:11]: Hey, and today we're in a very special studio inside the Fireworks office with Lin Qiang, CEO of Fireworks. Welcome. Yeah. Lin [00:00:20]: Oh, you should welcome us. Swyx [00:00:21]: Yeah, welcome. Yeah, thanks for having us. It's unusual to be in the home of a startup, but it's also, I think our relationship is a bit unusual compared to all our normal guests. Definitely. Lin [00:00:34]: Yeah. I'm super excited to talk about very interesting topics in that space with both of you. Swyx [00:00:41]: You just celebrated your two-year anniversary yesterday. Lin [00:00:43]: Yeah, it's quite a crazy journey. We circle around and share all the crazy stories across these two years, and it has been super fun. All the way from we experienced Silicon Valley bank run to we delete some data that shouldn't be deleted operationally. We went through a massive scale where we actually are busy getting capacity to, yeah, we learned to kind of work with it as a team with a lot of brilliant people across different places to join a company. It has really been a fun journey. Alessio [00:01:24]: When you started, did you think the technical stuff will be harder or the bank run and then the people side? I think there's a lot of amazing researchers that want to do companies and it's like the hardest thing is going to be building the product and then you have all these different other things. So, were you surprised by what has been your experience the most? Lin [00:01:42]: Yeah, to be honest with you, my focus has always been on the product side and then after the product goes to market. And I didn't realize the rest has been so complicated, operating a company and so on. But because I don't think about it, I just kind of manage it. So it's done. I think I just somehow don't think about it too much and solve whatever problem coming our way and it worked. Swyx [00:02:08]: So let's, I guess, let's start at the pre-history, the initial history of Fireworks. You ran the PyTorch team at Meta for a number of years and we previously had Sumit Chintal on and I think we were just all very interested in the history of GenEI. Maybe not that many people know how deeply involved Faire and Meta were prior to the current GenEI revolution. Lin [00:02:35]: My background is deep in distributed system, database management system. And I joined Meta from the data side and I saw this tremendous amount of data growth, which cost a lot of money and we're analyzing what's going on. And it's clear that AI is driving all this data generation. So it's a very interesting time because when I joined Meta, Meta is going through ramping down mobile-first, finishing the mobile-first transition and then starting AI-first. And there's a fundamental reason about that sequence because mobile-first gave a full range of user engagement that has never existed before. And all this user engagement generated a lot of data and this data power AI. So then the whole entire industry is also going through, falling through this same transition. When I see, oh, okay, this AI is powering all this data generation and look at where's our AI stack. There's no software, there's no hardware, there's no people, there's no team. I want to dive up there and help this movement. So when I started, it's very interesting industry landscape. There are a lot of AI frameworks. It's a kind of proliferation of AI frameworks happening in the industry. But all the AI frameworks focus on production and they use a very certain way of defining the graph of neural network and then use that to drive the model iteration and productionization. And PyTorch is completely different. So they could also assume that he was the user of his product. And he basically says, researchers face so much pain using existing AI frameworks, this is really hard to use and I'm going to do something different for myself. And that's the origin story of PyTorch. PyTorch actually started as the framework for researchers. They don't care about production at all. And as they grow in terms of adoption, so the interesting part of AI is research is the top of our normal production. There are so many researchers across academic, across industry, they innovate and they put their results out there in open source and that power the downstream productionization. So it's brilliant for MATA to establish PyTorch as a strategy to drive massive adoption in open source because MATA internally is a PyTorch shop. So it creates a flying wheel effect. So that's kind of a strategy behind PyTorch. But when I took on PyTorch, it's kind of at Caspo, MATA established PyTorch as the framework for both research and production. So no one has done that before. And we have to kind of rethink how to architect PyTorch so we can really sustain production workload, the stability, reliability, low latency, all this production concern was never a concern before. Now it's a concern. And we actually have to adjust its design and make it work for both sides. And that took us five years because MATA has so many AI use cases, all the way from ranking recommendation as powering the business top line or as ranking newsfeed, video ranking to site integrity detect bad content automatically using AI to all kinds of effects, translation, image classification, object detection, all this. And also across AI running on the server side, on mobile phones, on AI VR devices, the wide spectrum. So by the time we actually basically managed to support AI across ubiquitous everywhere across MATA. But interestingly, through open source engagement, we work with a lot of companies. It is clear to us like this industry is starting to take on AI first transition. And of course, MATA's hyperscale always go ahead of industry. And it feels like when we start this AI journey at MATA, there's no software, no hardware, no team. For many companies we engage with through PyTorch, we feel the pain. That's the genesis why we feel like, hey, if we create fireworks and support industry going through this transition, it will be a huge amount of impact. Of course, the problem that the industry is facing will not be the same as MATA. MATA is so big, right? So it's kind of skewed towards extreme scale and extreme optimization in the industry will be different. But we feel like we have the technical chop and we've seen a lot. We'll look to kind of drive that. So yeah, so that's how we started. Swyx [00:06:58]: When you and I chatted about the origins of fireworks, it was originally envisioned more as a PyTorch platform, and then later became much more focused on generative AI. Is that fair to say? What was the customer discovery here? Lin [00:07:13]: Right. So I would say our initial blueprint is we should build a PyTorch cloud because a PyTorch library and there's no SaaS platform to enable AI workloads. Swyx [00:07:26]: Even in 2022, it's interesting. Lin [00:07:28]: I would not say absolutely no, but cloud providers have some of those, but it's not first class citizen, right? At 2022, there's still like TensorFlow is massively in production. And this is all pre-gen AI, and PyTorch is kind of getting more and more adoption. But there's no PyTorch-first SaaS platform existing. At the same time, we are also a very pragmatic set of people. We really want to make sure from the get-go, we get really, really close to customers. We understand their use case, we understand their pain points, we understand the value we deliver to them. So we want to take a different approach in

    58 min
  5. 15 NOV

    Agents @ Work: Lindy.ai

    Alessio will be at AWS re:Invent next week and hosting a casual coffee meetup on Wednesday, RSVP here! And subscribe to our calendar for our Singapore, NeurIPS, and all upcoming meetups! We are still taking questions for our next big recap episode! Submit questions and messages on Speakpipe here for a chance to appear on the show! If you've been following the AI agents space, you have heard of Lindy AI; while founder Flo Crivello is hesitant to call it "blowing up," when folks like Andrew Wilkinson start obsessing over your product, you're definitely onto something. In our latest episode, Flo walked us through Lindy's evolution from late 2022 to now, revealing some design choices about agent platform design that go against conventional wisdom in the space. The Great Reset: From Text Fields to Rails Remember late 2022? Everyone was "LLM-pilled," believing that if you just gave a language model enough context and tools, it could do anything. Lindy 1.0 followed this pattern: * Big prompt field ✅ * Bunch of tools ✅ * Prayer to the LLM gods ✅ Fast forward to today, and Lindy 2.0 looks radically different. As Flo put it (~17:00 in the episode): "The more you can put your agent on rails, one, the more reliable it's going to be, obviously, but two, it's also going to be easier to use for the user." Instead of a giant, intimidating text field, users now build workflows visually: * Trigger (e.g., "Zendesk ticket received") * Required actions (e.g., "Check knowledge base") * Response generation This isn't just a UI change - it's a fundamental rethinking of how to make AI agents reliable. As Swyx noted during our discussion: "Put Shoggoth in a box and make it a very small, minimal viable box. Everything else should be traditional if-this-then-that software." The Surprising Truth About Model Limitations Here's something that might shock folks building in the space: with Claude 3.5 Sonnet, the model is no longer the bottleneck. Flo's exact words (~31:00): "It is actually shocking the extent to which the model is no longer the limit. It was the limit a year ago. It was too expensive. The context window was too small." Some context: Lindy started when context windows were 4K tokens. Today, their system prompt alone is larger than that. But what's really interesting is what this means for platform builders: * Raw capabilities aren't the constraint anymore * Integration quality matters more than model performance * User experience and workflow design are the new bottlenecks The Search Engine Parallel: Why Horizontal Platforms Might Win One of the spiciest takes from our conversation was Flo's thesis on horizontal vs. vertical agent platforms. He draws a fascinating parallel to search engines (~56:00): "I find it surprising the extent to which a horizontal search engine has won... You go through Google to search Reddit. You go through Google to search Wikipedia... search in each vertical has more in common with search than it does with each vertical." His argument: agent platforms might follow the same pattern because: * Agents across verticals share more commonalities than differences * There's value in having agents that can work together under one roof * The R&D cost of getting agents right is better amortized across use cases This might explain why we're seeing early vertical AI companies starting to expand horizontally. The core agent capabilities - reliability, context management, tool integration - are universal needs. What This Means for Builders If you're building in the AI agents space, here are the key takeaways: * Constrain First: Rather than maximizing capabilities, focus on reliable execution within narrow bounds * Integration Quality Matters: With model capabilities plateauing, your competitive advantage lies in how well you integrate with existing tools * Memory Management is Key: Flo revealed they actively prune agent memories - even with larger context windows, not all memories are useful * Design for Discovery: Lindy's visual workflow builder shows how important interface design is for adoption The Meta Layer There's a broader lesson here about AI product development. Just as Lindy evolved from "give the LLM everything" to "constrain intelligently," we might see similar evolution across the AI tooling space. The winners might not be those with the most powerful models, but those who best understand how to package AI capabilities in ways that solve real problems reliably. Full Video Podcast Flo’s talk at AI Engineer Summit Chapters * 00:00:00 Introductions * 00:04:05 AI engineering and deterministic software * 00:08:36 Lindys demo * 00:13:21 Memory management in AI agents * 00:18:48 Hierarchy and collaboration between Lindys * 00:21:19 Vertical vs. horizontal AI tools * 00:24:03 Community and user engagement strategies * 00:26:16 Rickrolling incident with Lindy * 00:28:12 Evals and quality control in AI systems * 00:31:52 Model capabilities and their impact on Lindy * 00:39:27 Competition and market positioning * 00:42:40 Relationship between Factorio and business strategy * 00:44:05 Remote work vs. in-person collaboration * 00:49:03 Europe vs US Tech * 00:58:59 Testing the Overton window and free speech * 01:04:20 Balancing AI safety concerns with business innovation Show Notes * Lindy.ai * Rick Rolling * Flo on X * TeamFlow * Andrew Wilkinson * Dust * Poolside.ai * SB1047 * Gathertown * Sid Sijbrandij * Matt Mullenweg * Factorio * Seeing Like a State Transcript Alessio [00:00:00]: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner and CTO at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol.ai. Swyx [00:00:12]: Hey, and today we're joined in the studio by Florent Crivello. Welcome. Flo [00:00:15]: Hey, yeah, thanks for having me. Swyx [00:00:17]: Also known as Altimore. I always wanted to ask, what is Altimore? Flo [00:00:21]: It was the name of my character when I was playing Dungeons & Dragons. Always. I was like 11 years old. Swyx [00:00:26]: What was your classes? Flo [00:00:27]: I was an elf. I was a magician elf. Swyx [00:00:30]: Well, you're still spinning magic. Right now, you're a solo founder and CEO of Lindy.ai. What is Lindy? Flo [00:00:36]: Yeah, we are a no-code platform letting you build your own AI agents easily. So you can think of we are to LangChain as Airtable is to MySQL. Like you can just pin up AI agents super easily by clicking around and no code required. You don't have to be an engineer and you can automate business workflows that you simply could not automate before in a few minutes. Swyx [00:00:55]: You've been in our orbit a few times. I think you spoke at our Latent Space anniversary. You spoke at my summit, the first summit, which was a really good keynote. And most recently, like we actually already scheduled this podcast before this happened. But Andrew Wilkinson was like, I'm obsessed by Lindy. He's just created a whole bunch of agents. So basically, why are you blowing up? Flo [00:01:16]: Well, thank you. I think we are having a little bit of a moment. I think it's a bit premature to say we're blowing up. But why are things going well? We revamped the product majorly. We called it Lindy 2.0. I would say we started working on that six months ago. We've actually not really announced it yet. It's just, I guess, I guess that's what we're doing now. And so we've basically been cooking for the last six months, like really rebuilding the product from scratch. I think I'll list you, actually, the last time you tried the product, it was still Lindy 1.0. Oh, yeah. If you log in now, the platform looks very different. There's like a ton more features. And I think one realization that we made, and I think a lot of folks in the agent space made the same realization, is that there is such a thing as too much of a good thing. I think many people, when they started working on agents, they were very LLM peeled and chat GPT peeled, right? They got ahead of themselves in a way, and us included, and they thought that agents were actually, and LLMs were actually more advanced than they actually were. And so the first version of Lindy was like just a giant prompt and a bunch of tools. And then the realization we had was like, hey, actually, the more you can put your agent on Rails, one, the more reliable it's going to be, obviously, but two, it's also going to be easier to use for the user, because you can really, as a user, you get, instead of just getting this big, giant, intimidating text field, and you type words in there, and you have no idea if you're typing the right word or not, here you can really click and select step by step, and tell your agent what to do, and really give as narrow or as wide a guardrail as you want for your agent. We started working on that. We called it Lindy on Rails about six months ago, and we started putting it into the hands of users over the last, I would say, two months or so, and I think things really started going pretty well at that point. The agent is way more reliable, way easier to set up, and we're already seeing a ton of new use cases pop up. Swyx [00:03:00]: Yeah, just a quick follow-up on that. You launched the first Lindy in November last year, and you were already talking about having a DSL, right? I remember having this discussion with you, and you were like, it's just much more reliable. Is this still the DSL under the hood? Is this a UI-level change, or is it a bigger rewrite? Flo [00:03:17]: No, it is a much bigger rewrite. I'll give you a concrete example. Suppose you want to have an agent that observes your Zendesk tickets, and it's like, hey, every time you receive a Zendesk ticket, I want you to check my knowledge base, so it's like a RAG module and whatnot, and then answer the ticket. The way it used to work with Lindy before was, you would type the prompt asking it to do that. You check my knowledge base, and so on and so forth. The problem with doing

    1h 10m
  6. 11 NOV

    Agents @ Work: Dust.tt

    We are recording our next big recap episode and taking questions! Submit questions and messages on Speakpipe here for a chance to appear on the show! Also subscribe to our calendar for our Singapore, NeurIPS, and all upcoming meetups! In our first ever episode with Logan Kilpatrick we called out the two hottest LLM frameworks at the time: LangChain and Dust. We’ve had Harrison from LangChain on twice (as a guest and as a co-host), and we’ve now finally come full circle as Stanislas from Dust joined us in the studio. After stints at Oracle and Stripe, Stan had joined OpenAI to work on mathematical reasoning capabilities. He describes his time at OpenAI as "the PhD I always wanted to do" while acknowledging the challenges of research work: "You're digging into a field all day long for weeks and weeks, and you find something, you get super excited for 12 seconds. And at the 13 seconds, you're like, 'oh, yeah, that was obvious.' And you go back to digging." This experience, combined with early access to GPT-4's capabilities, shaped his decision to start Dust: "If we believe in AGI and if we believe the timelines might not be too long, it's actually the last train leaving the station to start a company. After that, it's going to be computers all the way down." The History of Dust Dust's journey can be broken down into three phases: * Developer Framework (2022): Initially positioned as a competitor to LangChain, Dust started as a developer tooling platform. While both were open source, their approaches differed – LangChain focused on broad community adoption and integration as a pure developer experience, while Dust emphasized UI-driven development and better observability that wasn’t just `print` statements. * Browser Extension (Early 2023): The company pivoted to building XP1, a browser extension that could interact with web content. This experiment helped validate user interaction patterns with AI, even while using less capable models than GPT-4. * Enterprise Platform (Current): Today, Dust has evolved into an infrastructure platform for deploying AI agents within companies, with impressive metrics like 88% daily active users in some deployments. The Case for Being Horizontal The big discussion for early stage companies today is whether or not to be horizontal or vertical. Since models are so good at general tasks, a lot of companies are building vertical products that take care of a workflow end-to-end in order to offer more value and becoming more of “Services as Software”. Dust on the other hand is a platform for the users to build their own experiences, which has had a few advantages: * Maximum Penetration: Dust reports 60-70% weekly active users across entire companies, demonstrating the potential reach of horizontal solutions rather than selling into a single team. * Emergent Use Cases: By allowing non-technical users to create agents, Dust enables use cases to emerge organically from actual business needs rather than prescribed solutions. * Infrastructure Value: The platform approach creates lasting value through maintained integrations and connections, similar to how Stripe's value lies in maintaining payment infrastructure. Rather than relying on third-party integration providers, Dust maintains its own connections to ensure proper handling of different data types and structures. The Vertical Challenge However, this approach comes with trade-offs: * Harder Go-to-Market: As Stan talked about: "We spike at penetration... but it makes our go-to-market much harder. Vertical solutions have a go-to-market that is much easier because they're like, 'oh, I'm going to solve the lawyer stuff.'" * Complex Infrastructure: Building a horizontal platform requires maintaining numerous integrations and handling diverse data types appropriately – from structured Salesforce data to unstructured Notion pages. As you scale integrations, the cost of maintaining them also scales. * Product Surface Complexity: Creating an interface that's both powerful and accessible to non-technical users requires careful design decisions, down to avoiding technical terms like "system prompt" in favor of "instructions." The Future of AI Platforms Stan initially predicted we'd see the first billion-dollar single-person company in 2023 (a prediction later echoed by Sam Altman), but he's now more focused on a different milestone: billion-dollar companies with engineering teams of just 20 people, enabled by AI assistance. This vision aligns with Dust's horizontal platform approach – building the infrastructure that allows small teams to achieve outsized impact through AI augmentation. Rather than replacing entire job functions (the vertical approach), they're betting on augmenting existing workflows across organizations. Full YouTube Episode Chapters * 00:00:00 Introductions * 00:04:33 Joining OpenAI from Paris * 00:09:54 Research evolution and compute allocation at OpenAI * 00:13:12 Working with Ilya Sutskever and OpenAI's vision * 00:15:51 Leaving OpenAI to start Dust * 00:18:15 Early focus on browser extension and WebGPT-like functionality * 00:20:20 Dust as the infrastructure for agents * 00:24:03 Challenges of building with early AI models * 00:28:17 LLMs and Workflow Automation * 00:35:28 Building dependency graphs of agents * 00:37:34 Simulating API endpoints * 00:40:41 State of AI models * 00:43:19 Running evals * 00:46:36 Challenges in building AI agents infra * 00:49:21 Buy vs. build decisions for infrastructure components * 00:51:02 Future of SaaS and AI's Impact on Software * 00:53:07 The single employee $1B company race * 00:56:32 Horizontal vs. vertical approaches to AI agents Transcript Alessio [00:00:00]: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol.ai. Swyx [00:00:11]: Hey, and today we're in a studio with Stanislas, welcome. Stan [00:00:14]: Thank you very much for having me. Swyx [00:00:16]: Visiting from Paris. Stan [00:00:17]: Paris. Swyx [00:00:18]: And you have had a very distinguished career. It's very hard to summarize, but you went to college in both Ecopolytechnique and Stanford, and then you worked in a number of places, Oracle, Totems, Stripe, and then OpenAI pre-ChatGPT. We'll talk, we'll spend a little bit of time about that. About two years ago, you left OpenAI to start Dust. I think you were one of the first OpenAI alum founders. Stan [00:00:40]: Yeah, I think it was about at the same time as the Adept guys, so that first wave. Swyx [00:00:46]: Yeah, and people really loved our David episode. We love a few sort of OpenAI stories, you know, for back in the day, like we're talking about pre-recording. Probably the statute of limitations on some of those stories has expired, so you can talk a little bit more freely without them coming after you. But maybe we'll just talk about, like, what was your journey into AI? You know, you were at Stripe for almost five years, there are a lot of Stripe alums going into OpenAI. I think the Stripe culture has come into OpenAI quite a bit. Stan [00:01:11]: Yeah, so I think the buses of Stripe people really started flowing in, I guess, after ChatGPT. But, yeah, my journey into AI is a... I mean, Greg Brockman. Yeah, yeah. From Greg, of course. And Daniela, actually, back in the days, Daniela Amodei. Swyx [00:01:27]: Yes, she was COO, I mean, she is COO, yeah. She had a pretty high job at OpenAI at the time, yeah, for sure. Stan [00:01:34]: My journey started as anybody else, you're fascinated with computer science and you want to make them think, it's awesome, but it doesn't work. I mean, it was a long time ago, it was like maybe 16, so it was 25 years ago. Then the first big exposure to AI would be at Stanford, and I'm going to, like, disclose a whole lamb, because at the time it was a class taught by Andrew Ng, and there was no deep learning. It was half features for vision and a star algorithm. So it was fun. But it was the early days of deep learning. At the time, I think a few years after, it was the first project at Google. But you know, that cat face or the human face trained from many images. I went to, hesitated doing a PhD, more in systems, eventually decided to go into getting a job. Went at Oracle, started a company, did a gazillion mistakes, got acquired by Stripe, worked with Greg Buckman there. And at the end of Stripe, I started interesting myself in AI again, felt like it was the time, you had the Atari games, you had the self-driving craziness at the time. And I started exploring projects, it felt like the Atari games were incredible, but there were still games. And I was looking into exploring projects that would have an impact on the world. And so I decided to explore three things, self-driving cars, cybersecurity and AI, and math and AI. It's like I sing it by a decreasing order of impact on the world, I guess. Swyx [00:03:01]: Discovering new math would be very foundational. Stan [00:03:03]: It is extremely foundational, but it's not as direct as driving people around. Swyx [00:03:07]: Sorry, you're doing this at Stripe, you're like thinking about your next move. Stan [00:03:09]: No, it was at Stripe, kind of a bit of time where I started exploring. I did a bunch of work with friends on trying to get RC cars to drive autonomously. Almost started a company in France or Europe about self-driving trucks. We decided to not go for it because it was probably very operational. And I think the idea of the company, of the team wasn't there. And also I realized that if I wake up a day and because of a bug I wrote, I killed a family, it would be a bad experience. And so I just decided like, no, that's just too crazy. And then I explored cybersecurity with a friend. We're trying to apply transformers to cut fuzzing. So cut fuzzing, you have kind of an algorithm that goes really fast and tries to mutate the inputs of a library to find bugs.

    1 hr
  7. 1 NOV

    In the Arena: How LMSys changed LLM Benchmarking Forever

    Apologies for lower audio quality; we lost recordings and had to use backup tracks. Our guests today are Anastasios Angelopoulos and Wei-Lin Chiang, leads of Chatbot Arena, fka LMSYS, the crowdsourced AI evaluation platform developed by the LMSys student club at Berkeley, which became the de facto standard for comparing language models. Arena Elo is often more cited than MMLU scores to many folks, and they have attracted >1,000,000 people to cast votes since its launch, leading top model trainers to cite them over their own formal academic benchmarks: The Limits of Static Benchmarks We’ve done two benchmarks episodes: Benchmarks 101 and Benchmarks 201. One issue we’ve always brought up with static benchmarks is that 1) many are getting saturated, with models scoring almost perfectly on them 2) they often don’t reflect production use cases, making it hard for developers and users to use them as guidance. The fundamental challenge in AI evaluation isn't technical - it's philosophical. How do you measure something that increasingly resembles human intelligence? Rather than trying to define intelligence upfront, Arena let users interact naturally with models and collect comparative feedback. It's messy and subjective, but that's precisely the point - it captures the full spectrum of what people actually care about when using AI. The Pareto Frontier of Cost vs Intelligence Because the Elo scores are remarkably stable over time, we can put all the chat models on a map against their respective cost to gain a view of at least 3 orders of magnitude of model sizes/costs and observe the remarkable shift in intelligence per dollar over the past year: This frontier stood remarkably firm through the recent releases of o1-preview and price cuts of Gemini 1.5: The Statistics of Subjectivity In our Benchmarks 201 episode, Clémentine Fourrier from HuggingFace thought this design choice was one of shortcomings of arenas: they aren’t reproducible. You don’t know who ranked what and what exactly the outcome was at the time of ranking. That same person might rank the same pair of outputs differently on a different day, or might ask harder questions to better models compared to smaller ones, making it imbalanced. Another argument that people have brought up is confirmation bias. We know humans prefer longer responses and are swayed by formatting - Rob Mulla from Dreadnode had found some interesting data on this in May: The approach LMArena is taking is to use logistic regression to decompose human preferences into constituent factors. As Anastasios explains: "We can say what components of style contribute to human preference and how they contribute." By adding these style components as parameters, they can mathematically "suck out" their influence and isolate the core model capabilities. This extends beyond just style - they can control for any measurable factor: "What if I want to look at the cost adjusted performance? Parameter count? We can ex post facto measure that." This is one of the most interesting things about Arena: You have a data generation engine which you can clean and turn into leaderboards later. If you wanted to create a leaderboard for poetry writing, you could get existing data from Arena, normalize it by identifying these style components. Whether or not it’s possible to really understand WHAT bias the voters have, that’s a different question. Private Evals One of the most delicate challenges LMSYS faces is maintaining trust while collaborating with AI labs. The concern is that labs could game the system by testing multiple variants privately and only releasing the best performer. This was brought up when 4o-mini released and it ranked as the second best model on the leaderboard: But this fear misunderstands how Arena works. Unlike static benchmarks where selection bias is a major issue, Arena's live nature means any initial bias gets washed out by ongoing evaluation. As Anastasios explains: "In the long run, there's way more fresh data than there is data that was used to compare these five models." The other big question is WHAT model is actually being tested; as people often talk about on X / Discord, the same endpoint will randomly feel “nerfed” like it happened for “Claude European summer” and corresponding conspiracy theories: It’s hard to keep track of these performance changes in Arena as these changes (if real…?) are not observable. The Future of Evaluation The team's latest work on RouteLLM points to an interesting future where evaluation becomes more granular and task-specific. But they maintain that even simple routing strategies can be powerful - like directing complex queries to larger models while handling simple tasks with smaller ones. Arena is now going to expand beyond text into multimodal evaluation and specialized domains like code execution and red teaming. But their core insight remains: the best way to evaluate intelligence isn't to simplify it into metrics, but to embrace its complexity and find rigorous ways to analyze it. To go after this vision, they are spinning out Arena from LMSys, which will stay as an academia-driven group at Berkeley. Full Video Podcast Chapters * 00:00:00 - Introductions * 00:01:16 - Origin and development of Chatbot Arena * 00:05:41 - Static benchmarks vs. Arenas * 00:09:03 - Community building * 00:13:32 - Biases in human preference evaluation * 00:18:27 - Style Control and Model Categories * 00:26:06 - Impact of o1 * 00:29:15 - Collaborating with AI labs * 00:34:51 - RouteLLM and router models * 00:38:09 - Future of LMSys / Arena Show Notes * Anastasios Angelopoulos * Anastasios' NeurIPS Paper Conformal Risk Control * Wei-Lin Chiang * Chatbot Arena * LMSys * MTBench * ShareGPT dataset * Stanford's Alpaca project * LLMRouter * E2B * Dreadnode Transcript Alessio [00:00:00]: Hey everyone, welcome to the Latent Space podcast. This is Alessio, Partner and CTO in Residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol.ai. Swyx [00:00:14]: Hey, and today we're very happy and excited to welcome Anastasios and Wei Lin from LMSys. Welcome guys. Wei Lin [00:00:21]: Hey, how's it going? Nice to see you. Anastasios [00:00:23]: Thanks for having us. Swyx [00:00:24]: Anastasios, I actually saw you, I think at last year's NeurIPS. You were presenting a paper, which I don't really super understand, but it was some theory paper about how your method was very dominating over other sort of search methods. I don't remember what it was, but I remember that you were a very confident speaker. Anastasios [00:00:40]: Oh, I totally remember you. Didn't ever connect that, but yes, that's definitely true. Yeah. Nice to see you again. Swyx [00:00:46]: Yeah. I was frantically looking for the name of your paper and I couldn't find it. Basically I had to cut it because I didn't understand it. Anastasios [00:00:51]: Is this conformal PID control or was this the online control? Wei Lin [00:00:55]: Blast from the past, man. Swyx [00:00:57]: Blast from the past. It's always interesting how NeurIPS and all these academic conferences are sort of six months behind what people are actually doing, but conformal risk control, I would recommend people check it out. I have the recording. I just never published it just because I was like, I don't understand this enough to explain it. Anastasios [00:01:14]: People won't be interested. Wei Lin [00:01:15]: It's all good. Swyx [00:01:16]: But ELO scores, ELO scores are very easy to understand. You guys are responsible for the biggest revolution in language model benchmarking in the last few years. Maybe you guys want to introduce yourselves and maybe tell a little bit of the brief history of LMSys Wei Lin [00:01:32]: Hey, I'm Wei Lin. I'm a fifth year PhD student at UC Berkeley, working on Chatbot Arena these days, doing crowdsourcing AI benchmarking. Anastasios [00:01:43]: I'm Anastasios. I'm a sixth year PhD student here at Berkeley. I did most of my PhD on like theoretical statistics and sort of foundations of model evaluation and testing. And now I'm working 150% on this Chatbot Arena stuff. It's great. Alessio [00:02:00]: And what was the origin of it? How did you come up with the idea? How did you get people to buy in? And then maybe what were one or two of the pivotal moments early on that kind of made it the standard for these things? Wei Lin [00:02:12]: Yeah, yeah. Chatbot Arena project was started last year in April, May, around that. Before that, we were basically experimenting in a lab how to fine tune a chatbot open source based on the Llama 1 model that I released. At that time, Lama 1 was like a base model and people didn't really know how to fine tune it. So we were doing some explorations. We were inspired by Stanford's Alpaca project. So we basically, yeah, grow a data set from the internet, which is called ShareGPT data set, which is like a dialogue data set between user and chat GPT conversation. It turns out to be like pretty high quality data, dialogue data. So we fine tune on it and then we train it and release the model called V2. And people were very excited about it because it kind of like demonstrate open way model can reach this conversation capability similar to chat GPT. And then we basically release the model with and also build a demo website for the model. People were very excited about it. But during the development, the biggest challenge to us at the time was like, how do we even evaluate it? How do we even argue this model we trained is better than others? And then what's the gap between this open source model that other proprietary offering? At that time, it was like GPT-4 was just announced and it's like Cloud One. What's the difference between them? And then after that, like every week, there's a new model being fine tuned, released. So even until still now, right? And then we have that demo website for V2 now. And then we thought like, okay, maybe we ca

    41 min
  8. 25 OCT

    How NotebookLM Was Made

    If you’ve listened to the podcast for a while, you might have heard our ElevenLabs-powered AI co-host Charlie a few times. Text-to-speech has made amazing progress in the last 18 months, with OpenAI’s Advanced Voice Mode (aka “Her”) as a sneak peek of the future of AI interactions (see our “Building AGI in Real Time” recap). Yet, we had yet to see a real killer app for AI voice (not counting music). Today’s guests, Raiza Martin and Usama Bin Shafqat, are the lead PM and AI engineer behind the NotebookLM feature flag that gave us the first viral AI voice experience, the “Deep Dive” podcast: The idea behind the “Audio Overviews” feature is simple: take a bunch of documents, websites, YouTube videos, etc, and generate a podcast out of them. This was one of the first demos that people built with voice models + RAG + GPT models, but it was always a glorified speech-to-text. Raiza and Usama took a very different approach: * Make it conversational: when you listen to a NotebookLM audio there are a ton of micro-interjections (Steven Johnson calls them disfluencies) like “Oh really?” or “Totally”, as well as pauses and “uh…”, like you would expect in a real conversation. These are not generated by the LLM in the transcript, but they are built into the the audio model. See ~28:00 in the pod for more details. * Listeners love tension: if two people are always in agreement on everything, it’s not super interesting. They tuned the model to generate flowing conversations that mirror the tone and rhythm of human speech. They did not confirm this, but many suspect the 2 year old SoundStorm paper is related to this model. * Generating new insights: because the hosts’ goal is not to summarize, but to entertain, it comes up with funny metaphors and comparisons that actually help expand on the content rather than just paraphrasing like most models do. We have had listeners make podcasts out of our podcasts, like this one. This is different than your average SOTA-chasing, MMLU-driven model buildooor. Putting product and AI engineering in the same room, having them build evals together, and understanding what the goal is lets you get these unique results. The 5 rules for AI PMs We always focus on AI Engineers, but this episode had a ton of AI PM nuggets as well, which we wanted to collect as NotebookLM is one of the most successful products in the AI space: 1. Less is more: the first version of the product had 0 customization options. All you could do is give it source documents, and then press a button to generate. Most users don’t know what “temperature” or “top-k” are, so you’re often taking the magic away by adding more options in the UI. Since recording they added a few, like a system prompt, but those were features that users were “hacking in”, as Simon Willison highlighted in his blog post. 2. Use Real-Time Feedback: they built a community of 65,000 users on Discord that is constantly reporting issues and giving feedback; sometimes they noticed server downtime even before the Google internal monitoring did. Getting real time pings > aggregating user data when doing initial iterations. 3. Embrace Non-Determinism: AI outputs variability is a feature, not a bug. Rather than limiting the outputs from the get-go, build toggles that you can turn on/off with feature flags as the feedback starts to roll in. 4. Curate with Taste: if you try your product and it sucks, you don’t need more data to confirm it. Just scrap that and iterate again. This is even easier for a product like this; if you start listening to one of the podcasts and turn it off after 10 seconds, it’s never a good sign. 5. Stay Hands-On: It’s hard to build taste if you don’t experiment. Trying out all your competitors products as well as unrelated tools really helps you understand what users are seeing in market, and how to improve on it. Chapters 00:00 Introductions01:39 From Project Tailwind to NotebookLM09:25 Learning from 65,000 Discord members12:15 How NotebookLM works18:00 Working with Steven Johnson23:00 How to prioritize features25:13 Structuring the data pipelines29:50 How to eval34:34 Steering the podcast outputs37:51 Defining speakers personalities39:04 How do you make audio engaging?45:47 Humor is AGI51:38 Designing for non-determinism53:35 API when?55:05 Multilingual support and dialect considerations57:50 Managing system prompts and feature requests01:00:58 Future of NotebookLM01:04:59 Podcasts for your codebase01:07:16 Plans for real-time chat01:08:27 Wrap up Show Notes * Notebook LM * AI Test Kitchen * Nicholas Carlini * Steven Johnson * Wealth of Nations * Histories of Mysteries by Andrej Karpathy * chicken.pdf Threads * Area 120 * Raiza Martin * Usama Bin Shafqat Transcript NotebookLM [00:00:00]: Hey everyone, we're here today as guests on Latent Space. It's great to be here, I'm a long time listener and fan, they've had some great guests on this show before. Yeah, what an honor to have us, the hosts of another podcast, join as guests. I mean a huge thank you to Swyx and Alessio for the invite, thanks for having us on the show. Yeah really, it seems like they brought us here to talk a little bit about our show, our podcast. Yeah, I mean we've had lots of listeners ourselves, listeners at Deep Dive. Oh yeah, we've made a ton of audio overviews since we launched and we're learning a lot. There's probably a lot we can share around what we're building next, huh? Yeah, we'll share a little bit at least. The short version is we'll keep learning and getting better for you. We're glad you're along for the ride. So yeah, keep listening. Keep listening and stay curious. We promise to keep diving deep and bringing you even better options in the future. Stay curious. Alessio [00:00:52]: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner and CTO at Residence at Decibel Partners. And I'm joined by my co-host, Swyx, founder of Smol.ai. Swyx [00:01:01]: Hey, and today we're back in the studio with our special guest, Raiza Martin. And Raiza, I forgot to get your last name, Shafqat. Raiza [00:01:10]: Yes. Swyx [00:01:10]: Okay, welcome. Raiza [00:01:12]: Hello, thank you for having us. Swyx [00:01:14]: So AI podcasters meet human podcasters, always fun. Congrats on the success of Notebook LM. I mean, how does it feel? Raiza [00:01:22]: It's been a lot of fun. A lot of it, honestly, was unexpected. But my favorite part is really listening to the audio overviews that people have been making. Swyx [00:01:29]: Maybe we should do a little bit of intros and tell the story. You know, what is your path into the sort of Google AI org? Or maybe, actually, I don't even know what org you guys are in. Raiza [00:01:39]: I can start. My name is Raisa. I lead the Notebook LM team inside of Google Labs. So specifically, that's the org that we're in. It's called Google Labs. It's only about two years old. And our whole mandate is really to build AI products. That's it. We work super closely with DeepMind. Our entire thing is just, like, try a bunch of things and see what's landing with users. And the background that I have is, really, I worked in payments before this, and I worked in ads right before, and then startups. I tell people, like, at every time that I changed orgs, I actually almost quit Google. Like, specifically, like, in between ads and payments, I was like, all right, I can't do this. Like, this is, like, super hard. I was like, it's not for me. I'm, like, a very zero-to-one person. But then I was like, okay, I'll try. I'll interview with other teams. And when I interviewed in payments, I was like, oh, these people are really cool. I don't know if I'm, like, a super good fit with this space, but I'll try it because the people are cool. And then I really enjoyed that, and then I worked on, like, zero-to-one features inside of payments, and I had a lot of fun. But then the time came again where I was like, oh, I don't know. It's like, it's time to leave. It's time to start my own thing. But then I interviewed inside of Google Labs, and I was like, oh, darn. Like, there's definitely, like— Alessio [00:02:48]: They got you again. Raiza [00:02:49]: They got me again. And so now I've been here for two years, and I'm happy that I stayed because especially with, you know, the recent success of Notebook LM, I'm like, dang, we did it. I actually got to do it. So that was really cool. Usama [00:03:02]: Kind of similar, honestly. I was at a big team at Google. We do sort of the data center supply chain planning stuff. Google has, like, the largest sort of footprint. Obviously, there's a lot of management stuff to do there. But then there was this thing called Area 120 at Google, which does not exist anymore. But I sort of wanted to do, like, more zero-to-one building and landed a role there. We were trying to build, like, a creator commerce platform called Kaya. It launched briefly a couple years ago. But then Area 120 sort of transitioned and morphed into Labs. And, like, over the last few years, like, the focus just got a lot clearer. Like, we were trying to build new AI products and do it in the wild and sort of co-create and all of that. So, you know, we've just been trying a bunch of different things. And this one really landed, which has felt pretty phenomenal. Really, really landed. Swyx [00:03:53]: Let's talk about the brief history of Notebook LM. You had a tweet, which is very helpful for doing research. May 2023, during Google I.O., you announced Project Tailwind. Raiza [00:04:03]: Yeah. Swyx [00:04:03]: So today is October 2024. So you joined October 2022? Raiza [00:04:09]: Actually, I used to lead AI Test Kitchen. And this was actually, I think, not I.O. 2023. I.O. 2022 is when we launched AI Test Kitchen, or announced it. And I don't know if you remember it. Swyx [00:04:23]: That's how you, like, had the basic prototype for Gemini. Raiza [00:04:26]: Yes, yes, exactly. La

    1h 14m

About

The podcast by and for AI Engineers! In 2023, over 1 million visitors came to Latent Space to hear about news, papers and interviews in Software 3.0. We cover Foundation Models changing every domain in Code Generation, Multimodality, AI Agents, GPU Infra and more, directly from the founders, builders, and thinkers involved in pushing the cutting edge. Striving to give you both the definitive take on the Current Thing down to the first introduction to the tech you'll be using in the next 3 months! We break news and exclusive interviews from OpenAI, tiny (George Hotz), Databricks/MosaicML (Jon Frankle), Modular (Chris Lattner), Answer.ai (Jeremy Howard), et al. Full show notes always on https://latent.space www.latent.space

You Might Also Like

To listen to explicit episodes, sign in.

Stay up to date with this show

Sign in or sign up to follow shows, save episodes and get the latest updates.

Select a country or region

Africa, Middle East, and India

Asia Pacific

Europe

Latin America and the Caribbean

The United States and Canada