Latent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and al

Alessio + swyx
Latent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and al

The podcast by and for AI Engineers! In 2023, over 1 million visitors came to Latent Space to hear about news, papers and interviews in Software 3.0. We cover Foundation Models changing every domain in Code Generation, Multimodality, AI Agents, GPU Infra and more, directly from the founders, builders, and thinkers involved in pushing the cutting edge. Striving to give you both the definitive take on the Current Thing down to the first introduction to the tech you'll be using in the next 3 months! We break news and exclusive interviews from OpenAI, tiny (George Hotz), Databricks/MosaicML (Jon Frankle), Modular (Chris Lattner), Answer.ai (Jeremy Howard), et al. Full show notes always on https://latent.space www.latent.space

  1. 2 DAYS AGO

    Agents @ Work: Dust.tt

    We are recording our next big recap episode and taking questions! Submit questions and messages on Speakpipe here for a chance to appear on the show! Also subscribe to our calendar for our Singapore, NeurIPS, and all upcoming meetups! In our first ever episode with Logan Kilpatrick we called out the two hottest LLM frameworks at the time: LangChain and Dust. We’ve had Harrison from LangChain on twice (as a guest and as a co-host), and we’ve now finally come full circle as Stanislas from Dust joined us in the studio. After stints at Oracle and Stripe, Stan had joined OpenAI to work on mathematical reasoning capabilities. He describes his time at OpenAI as "the PhD I always wanted to do" while acknowledging the challenges of research work: "You're digging into a field all day long for weeks and weeks, and you find something, you get super excited for 12 seconds. And at the 13 seconds, you're like, 'oh, yeah, that was obvious.' And you go back to digging." This experience, combined with early access to GPT-4's capabilities, shaped his decision to start Dust: "If we believe in AGI and if we believe the timelines might not be too long, it's actually the last train leaving the station to start a company. After that, it's going to be computers all the way down." The History of Dust Dust's journey can be broken down into three phases: * Developer Framework (2022): Initially positioned as a competitor to LangChain, Dust started as a developer tooling platform. While both were open source, their approaches differed – LangChain focused on broad community adoption and integration as a pure developer experience, while Dust emphasized UI-driven development and better observability that wasn’t just `print` statements. * Browser Extension (Early 2023): The company pivoted to building XP1, a browser extension that could interact with web content. This experiment helped validate user interaction patterns with AI, even while using less capable models than GPT-4. * Enterprise Platform (Current): Today, Dust has evolved into an infrastructure platform for deploying AI agents within companies, with impressive metrics like 88% daily active users in some deployments. The Case for Being Horizontal The big discussion for early stage companies today is whether or not to be horizontal or vertical. Since models are so good at general tasks, a lot of companies are building vertical products that take care of a workflow end-to-end in order to offer more value and becoming more of “Services as Software”. Dust on the other hand is a platform for the users to build their own experiences, which has had a few advantages: * Maximum Penetration: Dust reports 60-70% weekly active users across entire companies, demonstrating the potential reach of horizontal solutions rather than selling into a single team. * Emergent Use Cases: By allowing non-technical users to create agents, Dust enables use cases to emerge organically from actual business needs rather than prescribed solutions. * Infrastructure Value: The platform approach creates lasting value through maintained integrations and connections, similar to how Stripe's value lies in maintaining payment infrastructure. Rather than relying on third-party integration providers, Dust maintains its own connections to ensure proper handling of different data types and structures. The Vertical Challenge However, this approach comes with trade-offs: * Harder Go-to-Market: As Stan talked about: "We spike at penetration... but it makes our go-to-market much harder. Vertical solutions have a go-to-market that is much easier because they're like, 'oh, I'm going to solve the lawyer stuff.'" * Complex Infrastructure: Building a horizontal platform requires maintaining numerous integrations and handling diverse data types appropriately – from structured Salesforce data to unstructured Notion pages. As you scale integrations, the cost of maintaining them also scales. * Product Surface Complexity: Creating an in

    1 hr
  2. 1 NOV

    In the Arena: How LMSys changed LLM Benchmarking Forever

    Apologies for lower audio quality; we lost recordings and had to use backup tracks. Our guests today are Anastasios Angelopoulos and Wei-Lin Chiang, leads of Chatbot Arena, fka LMSYS, the crowdsourced AI evaluation platform developed by the LMSys student club at Berkeley, which became the de facto standard for comparing language models. Arena Elo is often more cited than MMLU scores to many folks, and they have attracted >1,000,000 people to cast votes since its launch, leading top model trainers to cite them over their own formal academic benchmarks: The Limits of Static Benchmarks We’ve done two benchmarks episodes: Benchmarks 101 and Benchmarks 201. One issue we’ve always brought up with static benchmarks is that 1) many are getting saturated, with models scoring almost perfectly on them 2) they often don’t reflect production use cases, making it hard for developers and users to use them as guidance. The fundamental challenge in AI evaluation isn't technical - it's philosophical. How do you measure something that increasingly resembles human intelligence? Rather than trying to define intelligence upfront, Arena let users interact naturally with models and collect comparative feedback. It's messy and subjective, but that's precisely the point - it captures the full spectrum of what people actually care about when using AI. The Pareto Frontier of Cost vs Intelligence Because the Elo scores are remarkably stable over time, we can put all the chat models on a map against their respective cost to gain a view of at least 3 orders of magnitude of model sizes/costs and observe the remarkable shift in intelligence per dollar over the past year: This frontier stood remarkably firm through the recent releases of o1-preview and price cuts of Gemini 1.5: The Statistics of Subjectivity In our Benchmarks 201 episode, Clémentine Fourrier from HuggingFace thought this design choice was one of shortcomings of arenas: they aren’t reproducible. You don’t know who ranked what and what exactly the outcome was at the time of ranking. That same person might rank the same pair of outputs differently on a different day, or might ask harder questions to better models compared to smaller ones, making it imbalanced. Another argument that people have brought up is confirmation bias. We know humans prefer longer responses and are swayed by formatting - Rob Mulla from Dreadnode had found some interesting data on this in May: The approach LMArena is taking is to use logistic regression to decompose human preferences into constituent factors. As Anastasios explains: "We can say what components of style contribute to human preference and how they contribute." By adding these style components as parameters, they can mathematically "suck out" their influence and isolate the core model capabilities. This extends beyond just style - they can control for any measurable factor: "What if I want to look at the cost adjusted performance? Parameter count? We can ex post facto measure that." This is one of the most interesting things about Arena: You have a data generation engine which you can clean and turn into leaderboards later. If you wanted to create a leaderboard for poetry writing, you could get existing data from Arena, normalize it by identifying these style components. Whether or not it’s possible to really understand WHAT bias the voters have, that’s a different question. Private Evals One of the most delicate challenges LMSYS faces is maintaining trust while collaborating with AI labs. The concern is that labs could game the system by testing multiple variants privately and only releasing the best performer. This was brought up when 4o-mini released and it ranked as the second best model on the leaderboard: But this fear misunderstands how Arena works. Unlike static benchmarks where selection bias is a major issue, Arena's live nature means any initial bias gets washed out by ongoing evaluation. As Anastasios explains: "In the lon

    41 min
  3. 25 OCT

    How NotebookLM Was Made

    If you’ve listened to the podcast for a while, you might have heard our ElevenLabs-powered AI co-host Charlie a few times. Text-to-speech has made amazing progress in the last 18 months, with OpenAI’s Advanced Voice Mode (aka “Her”) as a sneak peek of the future of AI interactions (see our “Building AGI in Real Time” recap). Yet, we had yet to see a real killer app for AI voice (not counting music). Today’s guests, Raiza Martin and Usama Bin Shafqat, are the lead PM and AI engineer behind the NotebookLM feature flag that gave us the first viral AI voice experience, the “Deep Dive” podcast: The idea behind the “Audio Overviews” feature is simple: take a bunch of documents, websites, YouTube videos, etc, and generate a podcast out of them. This was one of the first demos that people built with voice models + RAG + GPT models, but it was always a glorified speech-to-text. Raiza and Usama took a very different approach: * Make it conversational: when you listen to a NotebookLM audio there are a ton of micro-interjections (Steven Johnson calls them disfluencies) like “Oh really?” or “Totally”, as well as pauses and “uh…”, like you would expect in a real conversation. These are not generated by the LLM in the transcript, but they are built into the the audio model. See ~28:00 in the pod for more details. * Listeners love tension: if two people are always in agreement on everything, it’s not super interesting. They tuned the model to generate flowing conversations that mirror the tone and rhythm of human speech. They did not confirm this, but many suspect the 2 year old SoundStorm paper is related to this model. * Generating new insights: because the hosts’ goal is not to summarize, but to entertain, it comes up with funny metaphors and comparisons that actually help expand on the content rather than just paraphrasing like most models do. We have had listeners make podcasts out of our podcasts, like this one. This is different than your average SOTA-chasing, MMLU-driven model buildooor. Putting product and AI engineering in the same room, having them build evals together, and understanding what the goal is lets you get these unique results. The 5 rules for AI PMs We always focus on AI Engineers, but this episode had a ton of AI PM nuggets as well, which we wanted to collect as NotebookLM is one of the most successful products in the AI space: 1. Less is more: the first version of the product had 0 customization options. All you could do is give it source documents, and then press a button to generate. Most users don’t know what “temperature” or “top-k” are, so you’re often taking the magic away by adding more options in the UI. Since recording they added a few, like a system prompt, but those were features that users were “hacking in”, as Simon Willison highlighted in his blog post. 2. Use Real-Time Feedback: they built a community of 65,000 users on Discord that is constantly reporting issues and giving feedback; sometimes they noticed server downtime even before the Google internal monitoring did. Getting real time pings > aggregating user data when doing initial iterations. 3. Embrace Non-Determinism: AI outputs variability is a feature, not a bug. Rather than limiting the outputs from the get-go, build toggles that you can turn on/off with feature flags as the feedback starts to roll in. 4. Curate with Taste: if you try your product and it sucks, you don’t need more data to confirm it. Just scrap that and iterate again. This is even easier for a product like this; if you start listening to one of the podcasts and turn it off after 10 seconds, it’s never a good sign. 5. Stay Hands-On: It’s hard to build taste if you don’t experiment. Trying out all your competitors products as well as unrelated tools really helps you understand what users are seeing in market, and how to improve on it. Chapters 00:00 Introductions01:39 From Project Tailwind to NotebookLM09:25

    1h 14m
  4. 19 OCT

    Building the AI Engineer Nation — with Josephine Teo, Minister of Digital Development and Information, Singapore

    Singapore's GovTech is hosting an AI CTF challenge with ~$15,000 in prizes, starting October 26th, open to both local and virtual hackers. It will be hosted on Dreadnode's Crucible platform; signup here! It is common to say if you want to work in AI, you should come to San Francisco. Not everyone can. Not everyone should. If you can only do meaningful AI work in one city, then AI has failed to generalize meaningfully. As non-Americans working in the US, we know what it’s like to see AI progress so rapidly here, and yet be at a loss for what our home countries can do. Through Latent Space we’ve tried to tell the story of AI outside of the Bay Area bubble; we talked to Notion in New York and Humanloop and Wondercraft in London and HuggingFace in Paris and ICLR in Vienna, and the Reka, RWKV, and Winds of AI Winter episodes were taped in Singapore (the World’s Fair also had Latin America representation and we intend to at least add China, Japan, and India next year). The Role of Government with AI As an intentionally technical resource, we’ve mostly steered clear of regulation and safety debates on the podcast; whether it is safety bills or technoalarmism, often at the cost of our engagement numbers or ability to book big name guests with a political agenda. When SOTA shifts 3x faster than it takes to pass a law, when nobody agrees on definitions of important things, when you can elicit never-before-seen behavior by slightly different prompting or sampling, it is hard enough to simply keep up to speed, so we are happy limiting our role to that. The story of AI progress has more often been achieved in the private sector, usually in spite of, rather than with thanks to, government intervention. But industrial policy is inextricably linked to the business of AI, which we do very much care about, has an explicitly accelerationist intent if not impact, and has a track record of success in correcting for legitimate market failures in private sector investment, particularly outside of the US. It is with this lens we approach today’s episode and special guest, our first with a sitting Cabinet member. Singapore’s National AI Strategy It is well understood that much of Singapore’s economic success is attributable to industrial policy, from direct efforts like the Jurong Town Corporation industrialization to indirect ones like going all in on English as national first language. Singapore’s National AI Strategy grew out of its 2014 Smart Nation initiative, first launched in 2019 and then refreshed in 2023 by Minister Josephine Teo, our guest today. While Singapore is not often thought of as an AI leader, the National University ranks in the top 10 in publications (above Oxford/Harvard!), and many overseas Singaporeans work at the leading AI companies and institutions in the US (and some of us even run leading AI Substacks?). OpenAI has often publicly named the Singapore government as their model example of government collaborator and is opening an office in Singapore in time for DevDay 2024. AI Engineer Nations Swyx first pitched the AI Engineer Nation concept at a private Sovereign AI summit featuring Dr. He Ruimin, Chief AI Officer of Singapore, which eventually led to an invitation to discuss the concept with Minister Teo, the country’s de-facto minister for tech (she calls it Digital Development, for good reasons she explains in the pod). This chat happened (with thanks to Jing Long, Joyce, and other folks from MDDI)! The central pitch for any country, not just Singapore, to emphasize and concentrate bets on AI Engineers, compared with other valuable efforts like training more researchers, releasing more government-approved data, or offering more AI funding, is a calculated one, based on the fact that: * GPU clusters and researchers have massive returns to scale and colocation, mostly concentrated in the US, that are irresponsibly expensive to replicate * Even if research stopped today and there was no progress f

    57 min
  5. 18 OCT

    Building the Silicon Brain - with Drew Houston of Dropbox

    CEOs of publicly traded companies are often in the news talking about their new AI initiatives, but few of them have built anything with it. Drew Houston from Dropbox is different; he has spent over 400 hours coding with LLMs in the last year and is now refocusing his 2,500+ employees around this new way of working, 17 years after founding the company. Timestamps 00:00 Introductions 00:43 Drew's AI journey 04:14 Revalidating expectations of AI 08:23 Simulation in self-driving vs. knowledge work 12:14 Drew's AI Engineering setup 15:24 RAG vs. long context in AI models 18:06 From "FileGPT" to Dropbox AI 23:20 Is storage solved?26:30 Products vs Features 30:48 Building trust for data access 33:42 Dropbox Dash and universal search 38:05 The evolution of Dropbox 42:39 Building a "silicon brain" for knowledge work 48:45 Open source AI and its impact 51:30 "Rent, Don't Buy" for AI 54:50 Staying relevant 58:57 Founder Mode 01:03:10 Advice for founders navigating AI 01:07:36 Building and managing teams in a growing company Transcript Alessio [00:00:00]: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO at Decibel Partners, and there's no Swyx today, but I'm joined by Drew Houston of Dropbox. Welcome, Drew. Drew [00:00:14]: Thanks for having me. Alessio [00:00:15]: So we're not going to talk about the Dropbox story. We're not going to talk about the Chinatown bus and the flash drive and all that. I think you've talked enough about it. Where I want to start is you as an AI engineer. So as you know, most of our audience is engineering folks, kind of like technology leaders. You obviously run Dropbox, which is a huge company, but you also do a lot of coding. I think that's how you spend almost 400 hours, just like coding. So let's start there. What was the first interaction you had with an LLM API and when did the journey start for you? Drew [00:00:43]: Yeah. Well, I think probably all AI engineers or whatever you call an AI engineer, those people started out as engineers before that. So engineering is my first love. I mean, I grew up as a little kid. I was that kid. My first line of code was at five years old. I just really loved, I wanted to make computer games, like this whole path. That also led me into startups and eventually starting Dropbox. And then with AI specifically, I studied computer science, I got my, I did my undergrad, but I didn't do like grad level computer science. I didn't, I sort of got distracted by all the startup things, so I didn't do grad level work. But about several years ago, I made a couple of things. So one is I sort of, I knew I wanted to go from being an engineer to a founder. And then, but sort of the becoming a CEO part was sort of backed into the job. And so a couple of realizations. One is that, I mean, there's a lot of like repetitive and like manual work you have to do as an executive that is actually lends itself pretty well to automation, both for like my own convenience. And then out of interest in learning, I guess what we call like classical machine learning these days, I started really trying to wrap my head around understanding machine learning and informational retrieval more, more formally. So I'd say maybe 2016, 2017 started me writing these more successively, more elaborate scripts to like understand basic like classifiers and regression and, and again, like basic information retrieval and NLP back in those days. And there's sort of like two things that came out of that. One is techniques are super powerful. And even just like studying like old school machine learning was a pretty big inversion of the way I had learned engineering, right? You know, I started programming when everyone starts programming and you're, you're sort of the human, you're giving an algorithm to the, and spelling out to the computer how it should run it. And then machine learning, here's machine learning where it's like actually flip that, like give it sort of the answer you

    1h 12m
  6. 11 OCT

    Production AI Engineering starts with Evals — with Ankur Goyal of Braintrust

    We are in 🗽 NYC this Monday! Join the AI Eng NYC meetup, bring demos and vibes! It is a bit of a meme that the first thing developer tooling founders think to build in AI is all the non-AI operational stuff outside the AI. There are well over 60 funded LLM Ops startups all with hoping to solve the new observability, cost tracking, security, and reliability problems that come with putting LLMs in production, not to mention new LLM oriented products from incumbent, established ops/o11y players like Datadog and Weights & Biases. 2 years in to the current hype cycle, the early winners have tended to be people with practical/research AI backgrounds rather than MLOps heavyweights or SWE tourists: * LangSmith: We covered how Harrison Chase worked on AI at Robust Intelligence and Kensho, the alma maters of many great AI founders * HumanLoop: We covered how Raza Habib worked at Google AI during his PhD * BrainTrust: Today’s guest Ankur Goyal founded Impira pre-Transformers and was acquihired to run Figma AI before realizing how to solve the Ops problem. There have been many VC think pieces and market maps describing what people thought were the essential pieces of the AI Engineering stack, but what was true for 2022-2023 has aged poorly. The basic insight that Ankur had is the same thesis that Hamel Husain is pushing in his World’s Fair talk and podcast with Raza and swyx: Evals are the centerpiece of systematic AI Engineering. REALLY believing in this is harder than it looks with the benefit of hindsight. It’s not like people didn’t know evals were important. Basically every LLM Ops feature list has them. It’s an obvious next step AFTER managing your prompts and logging your LLM calls. In fact, up til we met Braintrust, we were working on an expanded version of the Impossible Triangle Theory of the LLM Ops War that we first articulated in the Humanloop writeup: The single biggest criticism of the Rise of the AI Engineer piece is that we neglected to split out the role of product evals (as opposed to model evals) in the now infamous “API line” chart: With hindsight, we were very focused on the differentiating 0 to 1 phase that AI Engineers can bring to an existing team of ML engineers. As swyx says on the Day 2 keynote of AI Engineer, 2024 added a whole new set of concerns as AI Engineering grew up: A closer examination of Hamel’s product-oriented virtuous cycle and this infra-oriented SDLC would have eventually revealed that Evals, even more than logging, was the first point where teams start to get really serious about shipping to production, and therefore a great place to make an entry into the marketplace, which is exactly what Braintrust did. Also notice what’s NOT on this chart: shifting to shadow open source models, and finetuning them… per Ankur, Fine-tuning is not a viable standalone product: “The thing I would say is not debatable is whether or not fine-tuning is a business outcome or not. So let's think about the other components of your triangle. Ops/observability, that is a business… Frameworks, evals, databases [are a business, but] Fine-tuning is a very compelling method that achieves an outcome. The outcome is not fine-tuning, it is can I automatically optimize my use case to perform better if I throw data at the problem? And fine-tuning is one of multiple ways to achieve that.” OpenAI vs Open AI Market Share We last speculated about the market shifts in the End of OpenAI Hegemony and the Winds of AI Winter, and Ankur’s perspective is super valuable given his customer list: Some surprises based on what he is seeing: * Prior to Claude 3, OpenAI had near 100% market share. This tracks with what Harrison told us last year. * Claude 3.5 Sonnet and also notably Haiku have made serious dents * Open source model adoption is . Contra to Eugene Cheah’s ideal marketing pitch, virtually none of Braintrust’s customers are really finetuning open source models for cost, control, or privacy.

    1h 57m
  7. 3 OCT

    Building AGI in Real Time (OpenAI Dev Day 2024)

    We all have fond memories of the first Dev Day in 2023: and the blip that followed soon after. As Ben Thompson has noted, this year’s DevDay took a quieter, more intimate tone. No Satya, no livestream, (slightly fewer people?). Instead of putting ChatGPT announcements in DevDay as in 2023, o1 was announced 2 weeks prior, and DevDay 2024 was reserved purely for developer-facing API announcements, primarily the Realtime API, Vision Finetuning, Prompt Caching, and Model Distillation. However the larger venue and more spread out schedule did allow a lot more hallway conversations with attendees as well as more community presentations including our recent guest Alistair Pullen of Cosine as well as deeper dives from OpenAI including our recent guest Michelle Pokrass of the API Team. Thanks to OpenAI’s warm collaboration (we particularly want to thank Lindsay McCallum Rémy!), we managed to record exclusive interviews with many of the main presenters of both the keynotes and breakout sessions. We present them in full in today’s episode, together with a full lightly edited Q&A with Sam Altman. Show notes and related resources Some of these used in the final audio episode below * Simon Willison Live Blog * swyx live tweets and videos * Greg Kamradt coverage of Structured Output session, Scaling LLM Apps session * Fireside Chat Q&A with Sam Altman Timestamps * [00:00:00] Intro by Suno.ai * [00:01:23] NotebookLM Recap of DevDay * [00:09:25] Ilan's Strawberry Demo with Realtime Voice Function Calling * [00:19:16] Olivier Godement, Head of Product, OpenAI * [00:36:57] Romain Huet, Head of DX, OpenAI * [00:47:08] Michelle Pokrass, API Tech Lead at OpenAI ft. Simon Willison * [01:04:45] Alistair Pullen, CEO, Cosine (Genie) * [01:18:31] Sam Altman + Kevin Weill Q&A * [02:03:07] Notebook LM Recap of Podcast Transcript [00:00:00] Suno AI: Under dev daylights, code ignites. Real time voice streams reach new heights. O1 and GPT, 4. 0 in flight. Fine tune the future, data in sight. Schema sync up, outputs precise. Distill the models, efficiency splice. [00:00:33] AI Charlie: Happy October. This is your AI co host, Charlie. One of our longest standing traditions is covering major AI and ML conferences in podcast format. Delving, yes delving, into the vibes of what it is like to be there stitched in with short samples of conversations with key players, just to help you feel like you were there. [00:00:54] AI Charlie: Covering this year's Dev Day was significantly more challenging because we were all requested not to record the opening keynotes. So, in place of the opening keynotes, we had the viral notebook LM Deep Dive crew, my new AI podcast nemesis, Give you a seven minute recap of everything that was announced. [00:01:15] AI Charlie: Of course, you can also check the show notes for details. I'll then come back with an explainer of all the interviews we have for you today. Watch out and take care. [00:01:23] NotebookLM Recap of DevDay [00:01:23] NotebookLM: All right, so we've got a pretty hefty stack of articles and blog posts here all about open ais. Dev day 2024. [00:01:32] NotebookLM 2: Yeah, lots to dig into there. [00:01:34] NotebookLM 2: Seems [00:01:34] NotebookLM: like you're really interested in what's new with AI. [00:01:36] NotebookLM 2: Definitely. And it seems like OpenAI had a lot to announce. New tools, changes to the company. It's a lot. [00:01:43] NotebookLM: It is. And especially since you're interested in how AI can be used in the real world, you know, practical applications, we'll focus on that. [00:01:51] NotebookLM: Perfect. Like, for example, this Real time API, they announced that, right? That seems like a big deal if we want AI to sound, well, less like a robot. [00:01:59] NotebookLM 2: It could be huge. The real time API could completely change how we, like, interact with AI. Like, imagine if your voice assistant could actually handle it if you interrupted it. [00:02:08] NotebookLM: Or, like, have an actu

    2h 9m
  8. 27 SEPT

    Language Agents: From Reasoning to Acting

    OpenAI DevDay is almost here! Per tradition, we are hosting a DevDay pregame event for everyone coming to town! Join us with demos and gossip! Also sign up for related events across San Francisco: the AI DevTools Night, the xAI open house, the Replicate art show, the DevDay Watch Party (for non-attendees), Hack Night with OpenAI at Cloudflare. For everyone else, join the Latent Space Discord for our online watch party and find fellow AI Engineers in your city. OpenAI’s recent o1 release (and Reflection 70b debacle) has reignited broad interest in agentic general reasoning and tree search methods. While we have covered some of the self-taught reasoning literature on the Latent Space Paper Club, it is notable that the Eric Zelikman ended up at xAI, whereas OpenAI’s hiring of Noam Brown and now Shunyu suggests more interest in tool-using chain of thought/tree of thought/generator-verifier architectures for Level 3 Agents. We were more than delighted to learn that Shunyu is a fellow Latent Space enjoyer, and invited him back (after his first appearance on our NeurIPS 2023 pod) for a look through his academic career with Harrison Chase (one year after his first LS show). ReAct: Synergizing Reasoning and Acting in Language Models paper link Following seminal Chain of Thought papers from Wei et al and Kojima et al, and reflecting on lessons from building the WebShop human ecommerce trajectory benchmark, Shunyu’s first big hit, the ReAct paper showed that using LLMs to “generate both reasoning traces and task-specific actions in an interleaved manner” achieved remarkably greater performance (less hallucination/error propagation, higher ALFWorld/WebShop benchmark success) than CoT alone. In even better news, ReAct scales fabulously with finetuning: As a member of the elite Princeton NLP group, Shunyu was also a coauthor of the Reflexion paper, which we discuss in this pod. Tree of Thoughts paper link here Shunyu’s next major improvement on the CoT literature was Tree of Thoughts: Language models are increasingly being deployed for general problem solving across a wide range of tasks, but are still confined to token-level, left-to-right decision-making processes during inference. This means they can fall short in tasks that require exploration, strategic lookahead, or where initial decisions play a pivotal role… ToT allows LMs to perform deliberate decision making by considering multiple different reasoning paths and self-evaluating choices to decide the next course of action, as well as looking ahead or backtracking when necessary to make global choices. The beauty of ToT is it doesnt require pretraining with exotic methods like backspace tokens or other MCTS architectures. You can listen to Shunyu explain ToT in his own words on our NeurIPS pod, but also the ineffable Yannic Kilcher: Other Work We don’t have the space to summarize the rest of Shunyu’s work, you can listen to our pod with him now, and recommend the CoALA paper and his initial hit webinar with Harrison, today’s guest cohost: as well as Shunyu’s PhD Defense Lecture: as well as Shunyu’s latest lecture covering a Brief History of LLM Agents: As usual, we are live on YouTube! Show Notes * Harrison Chase * LangChain, LangSmith, LangGraph * Shunyu Yao * Alec Radford * ReAct Paper * Hotpot QA * Tau Bench * WebShop * SWE-Agent * SWE-Bench * Trees of Thought * CoALA Paper * Related Episodes * Our Thomas Scialom (Meta) episode * Shunyu on our NeurIPS 2023 Best Papers episode * Harrison on our LangChain episode * Mentions * Sierra * Voyager * Jason Wei * Tavily * SERP API * Exa Timestamps * [00:00:00] Opening Song by Suno * [00:03:00] Introductions * [00:06:16] The ReAct paper * [00:12:09] Early applications of ReAct in LangChain * [00:17:15] Discussion of the Reflection paper * [00:22:35] Tree of Thoughts paper and search algorithms in language models * [00:27:21] SWE-Agent and SWE-Bench for coding benchmarks * [00:39:21] CoALA: Cognitive Architect

    1h 30m

About

The podcast by and for AI Engineers! In 2023, over 1 million visitors came to Latent Space to hear about news, papers and interviews in Software 3.0. We cover Foundation Models changing every domain in Code Generation, Multimodality, AI Agents, GPU Infra and more, directly from the founders, builders, and thinkers involved in pushing the cutting edge. Striving to give you both the definitive take on the Current Thing down to the first introduction to the tech you'll be using in the next 3 months! We break news and exclusive interviews from OpenAI, tiny (George Hotz), Databricks/MosaicML (Jon Frankle), Modular (Chris Lattner), Answer.ai (Jeremy Howard), et al. Full show notes always on https://latent.space www.latent.space

You Might Also Like

To listen to explicit episodes, sign in.

Stay up to date with this show

Sign in or sign up to follow shows, save episodes and get the latest updates.

Select a country or region

Africa, Middle East, and India

Asia Pacific

Europe

Latin America and the Caribbean

The United States and Canada