Latent Space: The AI Engineer Podcast

Latent.Space

The podcast by and for AI Engineers! In 2025, over 10 million readers and listeners came to Latent Space to hear about news, papers and interviews in Software 3.0. We cover Foundation Models changing every domain in Code Generation, Multimodality, AI Agents, GPU Infra and more, directly from the founders, builders, and thinkers involved in pushing the cutting edge. Striving to give you both the definitive take on the Current Thing down to the first introduction to the tech you'll be using in the next 3 months! We break news and exclusive interviews from OpenAI, Anthropic, Gemini, Meta (Soumith Chintala), Sierra (Bret Taylor), tiny (George Hotz), Databricks/MosaicML (Jon Frankle), Modular (Chris Lattner), Answer.ai (Jeremy Howard), et al. Full show notes always on https://latent.space www.latent.space

  1. 5D AGO

    Owning the AI Pareto Frontier — Jeff Dean

    From rewriting Google’s search stack in the early 2000s to reviving sparse trillion-parameter models and co-designing TPUs with frontier ML research, Jeff Dean has quietly shaped nearly every layer of the modern AI stack. As Chief AI Scientist at Google and a driving force behind Gemini, Jeff has lived through multiple scaling revolutions from CPUs and sharded indices to multimodal models that reason across text, video, and code. Jeff joins us to unpack what it really means to “own the Pareto frontier,” why distillation is the engine behind every Flash model breakthrough, how energy (in picojoules) not FLOPs is becoming the true bottleneck, what it was like leading the charge to unify all of Google’s AI teams, and why the next leap won’t come from bigger context windows alone, but from systems that give the illusion of attending to trillions of tokens. We discuss: * Jeff’s early neural net thesis in 1990: parallel training before it was cool, why he believed scaling would win decades early, and the “bigger model, more data, better results” mantra that held for 15 years * The evolution of Google Search: sharding, moving the entire index into memory in 2001, softening query semantics pre-LLMs, and why retrieval pipelines already resemble modern LLM systems * Pareto frontier strategy: why you need both frontier “Pro” models and low-latency “Flash” models, and how distillation lets smaller models surpass prior generations * Distillation deep dive: ensembles → compression → logits as soft supervision, and why you need the biggest model to make the smallest one good * Latency as a first-class objective: why 10–50x lower latency changes UX entirely, and how future reasoning workloads will demand 10,000 tokens/sec * Energy-based thinking: picojoules per bit, why moving data costs 1000x more than a multiply, batching through the lens of energy, and speculative decoding as amortization * TPU co-design: predicting ML workloads 2–6 years out, speculative hardware features, precision reduction, sparsity, and the constant feedback loop between model architecture and silicon * Sparse models and “outrageously large” networks: trillions of parameters with 1–5% activation, and why sparsity was always the right abstraction * Unified vs. specialized models: abandoning symbolic systems, why general multimodal models tend to dominate vertical silos, and when vertical fine-tuning still makes sense * Long context and the illusion of scale: beyond needle-in-a-haystack benchmarks toward systems that narrow trillions of tokens to 117 relevant documents * Personalized AI: attending to your emails, photos, and documents (with permission), and why retrieval + reasoning will unlock deeply personal assistants * Coding agents: 50 AI interns, crisp specifications as a new core skill, and how ultra-low latency will reshape human–agent collaboration * Why ideas still matter: transformers, sparsity, RL, hardware, systems — scaling wasn’t blind; the pieces had to multiply together Show Notes: * Gemma 3 Paper * Gemma 3 * Gemini 2.5 Report * Jeff Dean’s “Software Engineering Advice from Building Large-Scale Distributed Systems” Presentation (with Back of the Envelope Calculations) * Latency Numbers Every Programmer Should Know by Jeff Dean * The Jeff Dean Facts * Jeff Dean Google Bio * Jeff Dean on “Important AI Trends” @Stanford AI Club * Jeff Dean & Noam Shazeer — 25 years at Google (Dwarkesh) — Jeff Dean * LinkedIn: https://www.linkedin.com/in/jeff-dean-8b212555 * X: https://x.com/jeffdean Google * https://google.com * https://deepmind.google Full Video Episode Timestamps 00:00:04 — Introduction: Alessio & Swyx welcome Jeff Dean, chief AI scientist at Google, to the Latent Space podcast00:00:30 — Owning the Pareto Frontier & balancing frontier vs low-latency models00:01:31 — Frontier models vs Flash models + role of distillation00:03:52 — History of distillation and its original motivation00:05:09 — Distillation’s role in modern model scaling00:07:02 — Model hierarchy (Flash, Pro, Ultra) and distillation sources00:07:46 — Flash model economics & wide deployment00:08:10 — Latency importance for complex tasks00:09:19 — Saturation of some tasks and future frontier tasks00:11:26 — On benchmarks, public vs internal00:12:53 — Example long-context benchmarks & limitations00:15:01 — Long-context goals: attending to trillions of tokens00:16:26 — Realistic use cases beyond pure language00:18:04 — Multimodal reasoning and non-text modalities00:19:05 — Importance of vision & motion modalities00:20:11 — Video understanding example (extracting structured info)00:20:47 — Search ranking analogy for LLM retrieval00:23:08 — LLM representations vs keyword search00:24:06 — Early Google search evolution & in-memory index00:26:47 — Design principles for scalable systems00:28:55 — Real-time index updates & recrawl strategies00:30:06 — Classic “Latency numbers every programmer should know”00:32:09 — Cost of memory vs compute and energy emphasis00:34:33 — TPUs & hardware trade-offs for serving models00:35:57 — TPU design decisions & co-design with ML00:38:06 — Adapting model architecture to hardware00:39:50 — Alternatives: energy-based models, speculative decoding00:42:21 — Open research directions: complex workflows, RL00:44:56 — Non-verifiable RL domains & model evaluation00:46:13 — Transition away from symbolic systems toward unified LLMs00:47:59 — Unified models vs specialized ones00:50:38 — Knowledge vs reasoning & retrieval + reasoning00:52:24 — Vertical model specialization & modules00:55:21 — Token count considerations for vertical domains00:56:09 — Low resource languages & contextual learning00:59:22 — Origins: Dean’s early neural network work01:10:07 — AI for coding & human–model interaction styles01:15:52 — Importance of crisp specification for coding agents01:19:23 — Prediction: personalized models & state retrieval01:22:36 — Token-per-second targets (10k+) and reasoning throughput01:23:20 — Episode conclusion and thanks Transcript Alessio Fanelli [00:00:04]: Hey everyone, welcome to the Latent Space podcast. This is Alessio, founder of Kernel Labs, and I’m joined by Swyx, editor of Latent Space. Shawn Wang [00:00:11]: Hello, hello. We’re here in the studio with Jeff Dean, chief AI scientist at Google. Welcome. Thanks for having me. It’s a bit surreal to have you in the studio. I’ve watched so many of your talks, and obviously your career has been super legendary. So, I mean, congrats. I think the first thing must be said, congrats on owning the Pareto Frontier. Jeff Dean [00:00:30]: Thank you, thank you. Pareto Frontiers are good. It’s good to be out there. Shawn Wang [00:00:34]: Yeah, I mean, I think it’s a combination of both. You have to own the Pareto Frontier. You have to have like frontier capability, but also efficiency, and then offer that range of models that people like to use. And, you know, some part of this was started because of your hardware work. Some part of that is your model work, and I’m sure there’s lots of secret sauce that you guys have worked on cumulatively. But, like, it’s really impressive to see it all come together in, like, this slittily advanced. Jeff Dean [00:01:04]: Yeah, yeah. I mean, I think, as you say, it’s not just one thing. It’s like a whole bunch of things up and down the stack. And, you know, all of those really combine to help make UNOS able to make highly capable large models, as well as, you know, software techniques to get those large model capabilities into much smaller, lighter weight models that are, you know, much more cost effective and lower latency, but still, you know, quite capable for their size. Yeah. Alessio Fanelli [00:01:31]: How much pressure do you have on, like, having the lower bound of the Pareto Frontier, too? I think, like, the new labs are always trying to push the top performance frontier because they need to raise more money and all of that. And you guys have billions of users. And I think initially when you worked on the CPU, you were thinking about, you know, if everybody that used Google, we use the voice model for, like, three minutes a day, they were like, you need to double your CPU number. Like, what’s that discussion today at Google? Like, how do you prioritize frontier versus, like, we have to do this? How do we actually need to deploy it if we build it? Jeff Dean [00:02:03]: Yeah, I mean, I think we always want to have models that are at the frontier or pushing the frontier because I think that’s where you see what capabilities now exist that didn’t exist at the sort of slightly less capable last year’s version or last six months ago version. At the same time, you know, we know those are going to be really useful for a bunch of use cases, but they’re going to be a bit slower and a bit more expensive than people might like for a bunch of other broader models. So I think what we want to do is always have kind of a highly capable sort of affordable model that enables a whole bunch of, you know, lower latency use cases. People can use them for agentic coding much more readily and then have the high-end, you know, frontier model that is really useful for, you know, deep reasoning, you know, solving really complicated math problems, those kinds of things. And it’s not that. One or the other is useful. They’re both useful. So I think we’d like to do both. And also, you know, through distillation, which is a key technique for making the smaller models more capable, you know, you have to have the frontier model in order to then distill it into your smaller model. So it’s not like an either or choice. You sort of need that in order to actually get a highly capable, more modest size model. Yeah. Alessio Fanelli [00:03:24]: I mean, you and Jeffrey came up with the solution in 2014. Jeff Dean [00:03:

    1h 24m
  2. 6D AGO

    🔬Beyond AlphaFold: How Boltz is Open-Sourcing the Future of Drug Discovery

    This podcast features Gabriele Corso and Jeremy Wohlwend, co-founders of Boltz and authors of the Boltz Manifesto, discussing the rapid evolution of structural biology models from AlphaFold to their own open-source suite, Boltz-1 and Boltz-2. The central thesis is that while single-chain protein structure prediction is largely “solved” through evolutionary hints, the next frontier lies in modeling complex interactions (protein-ligand, protein-protein) and generative protein design, which Boltz aims to democratize via open-source foundations and scalable infrastructure. Full Video Pod On YouTube! Timestamps * 00:00 Introduction to Benchmarking and the “Solved” Protein Problem * 06:48 Evolutionary Hints and Co-evolution in Structure Prediction * 10:00 The Importance of Protein Function and Disease States * 15:31 Transitioning from AlphaFold 2 to AlphaFold 3 Capabilities * 19:48 Generative Modeling vs. Regression in Structural Biology * 25:00 The “Bitter Lesson” and Specialized AI Architectures * 29:14 Development Anecdotes: Training Boltz-1 on a Budget * 32:00 Validation Strategies and the Protein Data Bank (PDB) * 37:26 The Mission of Boltz: Democratizing Access and Open Source * 41:43 Building a Self-Sustaining Research Community * 44:40 Boltz-2 Advancements: Affinity Prediction and Design * 51:03 BoltzGen: Merging Structure and Sequence Prediction * 55:18 Large-Scale Wet Lab Validation Results * 01:02:44 Boltz Lab Product Launch: Agents and Infrastructure * 01:13:06 Future Directions: Developpability and the “Virtual Cell” * 01:17:35 Interacting with Skeptical Medicinal Chemists Key Summary Evolution of Structure Prediction & Evolutionary Hints * Co-evolutionary Landscapes: The speakers explain that breakthrough progress in single-chain protein prediction relied on decoding evolutionary correlations where mutations in one position necessitate mutations in another to conserve 3D structure. * Structure vs. Folding: They differentiate between structure prediction (getting the final answer) and folding (the kinetic process of reaching that state), noting that the field is still quite poor at modeling the latter. * Physics vs. Statistics: RJ posits that while models use evolutionary statistics to find the right “valley” in the energy landscape, they likely possess a “light understanding” of physics to refine the local minimum. The Shift to Generative Architectures * Generative Modeling: A key leap in AlphaFold 3 and Boltz-1 was moving from regression (predicting one static coordinate) to a generative diffusion approach that samples from a posterior distribution. * Handling Uncertainty: This shift allows models to represent multiple conformational states and avoid the “averaging” effect seen in regression models when the ground truth is ambiguous. * Specialized Architectures: Despite the “bitter lesson” of general-purpose transformers, the speakers argue that equivariant architectures remain vastly superior for biological data due to the inherent 3D geometric constraints of molecules. Boltz-2 and Generative Protein Design * Unified Encoding: Boltz-2 (and BoltzGen) treats structure and sequence prediction as a single task by encoding amino acid identities into the atomic composition of the predicted structure. * Design Specifics: Instead of a sequence, users feed the model blank tokens and a high-level “spec” (e.g., an antibody framework), and the model decodes both the 3D structure and the corresponding amino acids. * Affinity Prediction: While model confidence is a common metric, Boltz-2 focuses on affinity prediction—quantifying exactly how tightly a designed binder will stick to its target. Real-World Validation and Productization * Generalized Validation: To prove the model isn’t just “regurgitating” known data, Boltz tested its designs on 9 targets with zero known interactions in the PDB, achieving nanomolar binders for two-thirds of them. * Boltz Lab Infrastructure: The newly launched Boltz Lab platform provides “agents” for protein and small molecule design, optimized to run 10x faster than open-source versions through proprietary GPU kernels. * Human-in-the-Loop: The platform is designed to convert skeptical medicinal chemists by allowing them to run parallel screens and use their intuition to filter model outputs. Transcript RJ [00:05:35]: But the goal remains to, like, you know, really challenge the models, like, how well do these models generalize? And, you know, we’ve seen in some of the latest CASP competitions, like, while we’ve become really, really good at proteins, especially monomeric proteins, you know, other modalities still remain pretty difficult. So it’s really essential, you know, in the field that there are, like, these efforts to gather, you know, benchmarks that are challenging. So it keeps us in line, you know, about what the models can do or not. Gabriel [00:06:26]: Yeah, it’s interesting you say that, like, in some sense, CASP, you know, at CASP 14, a problem was solved and, like, pretty comprehensively, right? But at the same time, it was really only the beginning. So you can say, like, what was the specific problem you would argue was solved? And then, like, you know, what is remaining, which is probably quite open. RJ [00:06:48]: I think we’ll steer away from the term solved, because we have many friends in the community who get pretty upset at that word. And I think, you know, fairly so. But the problem that was, you know, that a lot of progress was made on was the ability to predict the structure of single chain proteins. So proteins can, like, be composed of many chains. And single chain proteins are, you know, just a single sequence of amino acids. And one of the reasons that we’ve been able to make such progress is also because we take a lot of hints from evolution. So the way the models work is that, you know, they sort of decode a lot of hints. That comes from evolutionary landscapes. So if you have, like, you know, some protein in an animal, and you go find the similar protein across, like, you know, different organisms, you might find different mutations in them. And as it turns out, if you take a lot of the sequences together, and you analyze them, you see that some positions in the sequence tend to evolve at the same time as other positions in the sequence, sort of this, like, correlation between different positions. And it turns out that that is typically a hint that these two positions are close in three dimension. So part of the, you know, part of the breakthrough has been, like, our ability to also decode that very, very effectively. But what it implies also is that in absence of that co-evolutionary landscape, the models don’t quite perform as well. And so, you know, I think when that information is available, maybe one could say, you know, the problem is, like, somewhat solved. From the perspective of structure prediction, when it isn’t, it’s much more challenging. And I think it’s also worth also differentiating the, sometimes we confound a little bit, structure prediction and folding. Folding is the more complex process of actually understanding, like, how it goes from, like, this disordered state into, like, a structured, like, state. And that I don’t think we’ve made that much progress on. But the idea of, like, yeah, going straight to the answer, we’ve become pretty good at. Brandon [00:08:49]: So there’s this protein that is, like, just a long chain and it folds up. Yeah. And so we’re good at getting from that long chain in whatever form it was originally to the thing. But we don’t know how it necessarily gets to that state. And there might be intermediate states that it’s in sometimes that we’re not aware of. RJ [00:09:10]: That’s right. And that relates also to, like, you know, our general ability to model, like, the different, you know, proteins are not static. They move, they take different shapes based on their energy states. And I think we are, also not that good at understanding the different states that the protein can be in and at what frequency, what probability. So I think the two problems are quite related in some ways. Still a lot to solve. But I think it was very surprising at the time, you know, that even with these evolutionary hints that we were able to, you know, to make such dramatic progress. Brandon [00:09:45]: So I want to ask, why does the intermediate states matter? But first, I kind of want to understand, why do we care? What proteins are shaped like? Gabriel [00:09:54]: Yeah, I mean, the proteins are kind of the machines of our body. You know, the way that all the processes that we have in our cells, you know, work is typically through proteins, sometimes other molecules, sort of intermediate interactions. And through that interactions, we have all sorts of cell functions. And so when we try to understand, you know, a lot of biology, how our body works, how disease work. So we often try to boil it down to, okay, what is going right in case of, you know, our normal biological function and what is going wrong in case of the disease state. And we boil it down to kind of, you know, proteins and kind of other molecules and their interaction. And so when we try predicting the structure of proteins, it’s critical to, you know, have an understanding of kind of those interactions. It’s a bit like seeing the difference between... Having kind of a list of parts that you would put it in a car and seeing kind of the car in its final form, you know, seeing the car really helps you understand what it does. On the other hand, kind of going to your question of, you know, why do we care about, you know, how the protein falls or, you know, how the car is made to some extent is that, you know, sometimes when something goes wrong, you know, there are, you know, cases of, you know, proteins misfolding. In some diseases and so on, if we don’t understand this fo

    1h 21m
  3. FEB 6

    The First Mechanistic Interpretability Frontier Lab — Myra Deng & Mark Bissell of Goodfire AI

    From Palantir and Two Sigma to building Goodfire into the poster-child for actionable mechanistic interpretability, Mark Bissell (Member of Technical Staff) and Myra Deng (Head of Product) are trying to turn “peeking inside the model” into a repeatable production workflow by shipping APIs, landing real enterprise deployments, and now scaling the bet with a recent $150M Series B funding round at a $1.25B valuation. In this episode, we go far beyond the usual “SAEs are cool” take. We talk about Goodfire’s core bet: that the AI lifecycle is still fundamentally broken because the only reliable control we have is data and we post-train, RLHF, and fine-tune by “slurping supervision through a straw,” hoping the model picks up the right behaviors while quietly absorbing the wrong ones. Goodfire’s answer is to build a bi-directional interface between humans and models: read what’s happening inside, edit it surgically, and eventually use interpretability during training so customization isn’t just brute-force guesswork. Mark and Myra walk through what that looks like when you stop treating interpretability like a lab demo and start treating it like infrastructure: lightweight probes that add near-zero latency, token-level safety filters that can run at inference time, and interpretability workflows that survive messy constraints (multilingual inputs, synthetic→real transfer, regulated domains, no access to sensitive data). We also get a live window into what “frontier-scale interp” means operationally (i.e. steering a trillion-parameter model in real time by targeting internal features) plus why the same tooling generalizes cleanly from language models to genomics, medical imaging, and “pixel-space” world models. We discuss: * Myra + Mark’s path: Palantir (health systems, forward-deployed engineering) → Goodfire early team; Two Sigma → Head of Product, translating frontier interpretability research into a platform and real-world deployments * What “interpretability” actually means in practice: not just post-hoc poking, but a broader “science of deep learning” approach across the full AI lifecycle (data curation → post-training → internal representations → model design) * Why post-training is the first big wedge: “surgical edits” for unintended behaviors likereward hacking, sycophancy, noise learned during customization plus the dream of targeted unlearning and bias removal without wrecking capabilities * SAEs vs probes in the real world: why SAE feature spaces sometimes underperform classifiers trained on raw activations for downstream detection tasks (hallucination, harmful intent, PII), and what that implies about “clean concept spaces” * Rakuten in production: deploying interpretability-based token-level PII detection at inference time to prevent routing private data to downstream providers plus the gnarly constraints: no training on real customer PII, synthetic→real transfer, English + Japanese, and tokenization quirks * Why interp can be operationally cheaper than LLM-judge guardrails: probes are lightweight, low-latency, and don’t require hosting a second large model in the loop * Real-time steering at frontier scale: a demo of steering Kimi K2 (~1T params) live and finding features via SAE pipelines, auto-labeling via LLMs, and toggling a “Gen-Z slang” feature across multiple layers without breaking tool use * Hallucinations as an internal signal: the case that models have latent uncertainty / “user-pleasing” circuitry you can detect and potentially mitigate more directly than black-box methods * Steering vs prompting: the emerging view that activation steering and in-context learning are more closely connected than people think, including work mapping between the two (even for jailbreak-style behaviors) * Interpretability for science: using the same tooling across domains (genomics, medical imaging, materials) to debug spurious correlations and extract new knowledge up to and including early biomarker discovery work with major partners * World models + “pixel-space” interpretability: why vision/video models make concepts easier to see, how that accelerates the feedback loop, and why robotics/world-model partners are especially interesting design partners * The north star: moving from “data in, weights out” to intentional model design where experts can impart goals and constraints directly, not just via reward signals and brute-force post-training — Goodfire AI * Website: https://goodfire.ai * LinkedIn: https://www.linkedin.com/company/goodfire-ai/ * X: https://x.com/GoodfireAI Myra Deng * Website: https://myradeng.com/ * LinkedIn: https://www.linkedin.com/in/myra-deng/ * X: https://x.com/myra_deng Mark Bissell * LinkedIn: https://www.linkedin.com/in/mark-bissell/ * X: https://x.com/MarkMBissell Full Video Episode Timestamps 00:00:00 Introduction 00:00:05 Introduction to the Latent Space Podcast and Guests from Goodfire 00:00:29 What is Goodfire? Mission and Focus on Interpretability 00:01:01 Goodfire’s Practical Approach to Interpretability 00:01:37 Goodfire’s Series B Fundraise Announcement 00:02:04 Backgrounds of Mark and Myra from Goodfire 00:02:51 Team Structure and Roles at Goodfire 00:05:13 What is Interpretability? Definitions and Techniques 00:05:30 Understanding Errors 00:07:29 Post-training vs. Pre-training Interpretability Applications 00:08:51 Using Interpretability to Remove Unwanted Behaviors 00:10:09 Grokking, Double Descent, and Generalization in Models 00:10:15 404 Not Found Explained 00:12:06 Subliminal Learning and Hidden Biases in Models 00:14:07 How Goodfire Chooses Research Directions and Projects 00:15:00 Troubleshooting Errors 00:16:04 Limitations of SAEs and Probes in Interpretability 00:18:14 Rakuten Case Study: Production Deployment of Interpretability 00:20:45 Conclusion 00:21:12 Efficiency Benefits of Interpretability Techniques 00:21:26 Live Demo: Real-Time Steering in a Trillion Parameter Model 00:25:15 How Steering Features are Identified and Labeled 00:26:51 Detecting and Mitigating Hallucinations Using Interpretability 00:31:20 Equivalence of Activation Steering and Prompting 00:34:06 Comparing Steering with Fine-Tuning and LoRA Techniques 00:36:04 Model Design and the Future of Intentional AI Development 00:38:09 Getting Started in Mechinterp: Resources, Programs, and Open Problems 00:40:51 Industry Applications and the Rise of Mechinterp in Practice 00:41:39 Interpretability for Code Models and Real-World Usage 00:43:07 Making Steering Useful for More Than Stylistic Edits 00:46:17 Applying Interpretability to Healthcare and Scientific Discovery 00:49:15 Why Interpretability is Crucial in High-Stakes Domains like Healthcare 00:52:03 Call for Design Partners Across Domains 00:54:18 Interest in World Models and Visual Interpretability 00:57:22 Sci-Fi Inspiration: Ted Chiang and Interpretability 01:00:14 Interpretability, Safety, and Alignment Perspectives 01:04:27 Weak-to-Strong Generalization and Future Alignment Challenges 01:05:38 Final Thoughts and Hiring/Collaboration Opportunities at Goodfire Transcript Shawn Wang [00:00:05]: So welcome to the Latent Space pod. We’re back in the studio with our special MechInterp co-host, Vibhu. Welcome. Mochi, Mochi’s special co-host. And Mochi, the mechanistic interpretability doggo. We have with us Mark and Myra from Goodfire. Welcome. Thanks for having us on. Maybe we can sort of introduce Goodfire and then introduce you guys. How do you introduce Goodfire today? Myra Deng [00:00:29]: Yeah, it’s a great question. So Goodfire, we like to say, is an AI research lab that focuses on using interpretability to understand, learn from, and design AI models. And we really believe that interpretability will unlock the new generation, next frontier of safe and powerful AI models. That’s our description right now, and I’m excited to dive more into the work we’re doing to make that happen. Shawn Wang [00:00:55]: Yeah. And there’s always like the official description. Is there an understatement? Is there an unofficial one that sort of resonates more with a different audience? Mark Bissell [00:01:01]: Well, being an AI research lab that’s focused on interpretability, there’s obviously a lot of people have a lot that they think about when they think of interpretability. And I think we have a pretty broad definition of what that means and the types of places that can be applied. And in particular, applying it in production scenarios, in high stakes industries, and really taking it sort of from the research world into the real world. Which, you know. It’s a new field, so that hasn’t been done all that much. And we’re excited about actually seeing that sort of put into practice. Shawn Wang [00:01:37]: Yeah, I would say it wasn’t too long ago that Anthopic was like still putting out like toy models or superposition and that kind of stuff. And I wouldn’t have pegged it to be this far along. When you and I talked at NeurIPS, you were talking a little bit about your production use cases and your customers. And then not to bury the lead, today we’re also announcing the fundraise, your Series B. $150 million. $150 million at a 1.25B valuation. Congrats, Unicorn. Mark Bissell [00:02:02]: Thank you. Yeah, no, things move fast. Shawn Wang [00:02:04]: We were talking to you in December and already some big updates since then. Let’s dive, I guess, into a bit of your backgrounds as well. Mark, you were at Palantir working on health stuff, which is really interesting because the Goodfire has some interesting like health use cases. I don’t know how related they are in practice. Mark Bissell [00:02:22]: Yeah, not super related, but I don’t know. It was helpful context to know what it’s like. Just to work. Just to work with health systems and generally in that domain. Yeah. Shawn Wang [00:02:32]: And Mara, you were at Two Sigma, which actually I w

    1h 8m
  4. JAN 28

    🔬 Automating Science: World Models, Scientific Taste, Agent Loops — Andrew White

    Editor’s note: Welcome to our new AI for Science pod, with your new hosts RJ and Brandon! See the writeup on Latent.Space (https://Latent.Space) for more details on why we’re launching 2 new pods this year. RJ Honicky is a co-founder and CTO at MiraOmics (https://miraomics.bio/), building AI models and services for single cell, spatial transcriptomics and pathology slide analysis. Brandon Anderson builds AI systems for RNA drug discovery at Atomic AI (https://atomic.ai). Anything said on this podcast is his personal take — not Atomic’s.—From building molecular dynamics simulations at the University of Washington to red-teaming GPT-4 for chemistry applications and co-founding Future House (a focused research organization) and Edison Scientific (a venture-backed startup automating science at scale)—Andrew White has spent the last five years living through the full arc of AI’s transformation of scientific discovery, from ChemCrow (the first Chemistry LLM agent) triggering White House briefings and three-letter agency meetings, to shipping Kosmos, an end-to-end autonomous research system that generates hypotheses, runs experiments, analyzes data, and updates its world model to accelerate the scientific method itself. * The ChemCrow story: GPT-4 + React + cloud lab automation, released March 2023, set off a storm of anxiety about AI-accelerated bioweapons/chemical weapons, led to a White House briefing (Jake Sullivan presented the paper to the president in a 30-minute block), and meetings with three-letter agencies asking “how does this change breakout time for nuclear weapons research?” * Why scientific taste is the frontier: RLHF on hypotheses didn’t work (humans pay attention to tone, actionability, and specific facts, not “if this hypothesis is true/false, how does it change the world?”), so they shifted to end-to-end feedback loops where humans click/download discoveries and that signal rolls up to hypothesis quality * Cosmos: the full scientific agent with a world model (distilled memory system, like a Git repo for scientific knowledge) that iterates on hypotheses via literature search, data analysis, and experiment design—built by Ludo after weeks of failed attempts, the breakthrough was putting data analysis in the loop (literature alone didn’t work) * Why molecular dynamics and DFT are overrated: “MD and DFT have consumed an enormous number of PhDs at the altar of beautiful simulation, but they don’t model the world correctly—you simulate water at 330 Kelvin to get room temperature, you overfit to validation data with GGA/B3LYP functionals, and real catalysts (grain boundaries, dopants) are too complicated for DFT” * The AlphaFold vs. DE Shaw Research counterfactual: DE Shaw built custom silicon, taped out chips with MD algorithms burned in, ran MD at massive scale in a special room in Times Square, and David Shaw flew in by helicopter to present—Andrew thought protein folding would require special machines to fold one protein per day, then AlphaFold solved it in Google Colab on a desktop GPU * The E3 Zero reward hacking saga: trained a model to generate molecules with specific atom counts (verifiable reward), but it kept exploiting loopholes, then a Nature paper came out that year proving six-nitrogen compounds are possible under extreme conditions, then it started adding nitrogen gas (purchasable, doesn’t participate in reactions), then acid-base chemistry to move one atom, and Andrew ended up “building a ridiculous catalog of purchasable compounds in a Bloom filter” to close the loop Andrew White * FutureHouse: http://futurehouse.org/ * Edison Scientific: http://edisonscientific.com/ * X: https://x.com/andrewwhite01 * Cosmos paper: https://futurediscovery.org/cosmos Full Video Episode Timestamps 00:00:00 Introduction: Andrew White on Automating Science with Future House and Edison Scientific00:02:22 The Academic to Startup Journey: Red Teaming GPT-4 and the ChemCrow Paper00:11:35 Future House Origins: The FRO Model and Mission to Automate Science00:12:32 Resigning Tenure: Why Leave Academia for AI Science00:15:54 What Does ‘Automating Science’ Actually Mean?00:17:30 The Lab-in-the-Loop Bottleneck: Why Intelligence Isn’t Enough00:18:39 Scientific Taste and Human Preferences: The 52% Agreement Problem00:20:05 Paper QA, Robin, and the Road to Cosmos00:21:57 World Models as Scientific Memory: The GitHub Analogy00:40:20 The Bitter Lesson for Biology: Why Molecular Dynamics and DFT Are Overrated00:43:22 AlphaFold’s Shock: When First Principles Lost to Machine Learning00:46:25 Enumeration and Filtration: How AI Scientists Generate Hypotheses00:48:15 CBRN Safety and Dual-Use AI: Lessons from Red Teaming01:00:40 The Future of Chemistry is Language: Multimodal Debate01:08:15 Ether Zero: The Hilarious Reward Hacking Adventures01:10:12 Will Scientists Be Displaced? Jevons Paradox and Infinite Discovery01:13:46 Cosmos in Practice: Open Access and Enterprise Partnerships Get full access to Latent.Space at www.latent.space/subscribe

    1h 14m
  5. JAN 23

    Captaining IMO Gold, Deep Think, On-Policy RL, Feeling the AGI in Singapore — Yi Tay

    From shipping Gemini Deep Think and IMO Gold to launching the Reasoning and AGI team in Singapore, Yi Tay has spent the last 18 months living through the full arc of Google DeepMind’s pivot from architecture research to RL-driven reasoning—watching his team go from a dozen researchers to 300+, training models that solve International Math Olympiad problems in a live competition, and building the infrastructure to scale deep thinking across every domain, and driving Gemini to the top of the leaderboards across every category. Yi Returns to dig into the inside story of the IMO effort and more! We discuss: * Yi’s path: Brain → Reka → Google DeepMind → Reasoning and AGI team Singapore, leading model training for Gemini Deep Think and IMO Gold * The IMO Gold story: four co-captains (Yi in Singapore, Jonathan in London, Jordan in Mountain View, and Tong leading the overall effort), training the checkpoint in ~1 week, live competition in Australia with professors punching in problems as they came out, and the tension of not knowing if they’d hit Gold until the human scores came in (because the Gold threshold is a percentile, not a fixed number) * Why they threw away AlphaProof: “If one model can’t do it, can we get to AGI?” The decision to abandon symbolic systems and bet on end-to-end Gemini with RL was bold and non-consensus * On-policy vs. off-policy RL: off-policy is imitation learning (copying someone else’s trajectory), on-policy is the model generating its own outputs, getting rewarded, and training on its own experience—”humans learn by making mistakes, not by copying” * Why self-consistency and parallel thinking are fundamental: sampling multiple times, majority voting, LM judges, and internal verification are all forms of self-consistency that unlock reasoning beyond single-shot inference * The data efficiency frontier: humans learn from 8 orders of magnitude less data than models, so where’s the bug? Is it the architecture, the learning algorithm, backprop, off-policyness, or something else? * Three schools of thought on world models: (1) Genie/spatial intelligence (video-based world models), (2) Yann LeCun’s JEPA + FAIR’s code world models (modeling internal execution state), (3) the amorphous “resolution of possible worlds” paradigm (curve-fitting to find the world model that best explains the data) * Why AI coding crossed the threshold: Yi now runs a job, gets a bug, pastes it into Gemini, and relaunches without even reading the fix—”the model is better than me at this” * The Pokémon benchmark: can models complete Pokédex by searching the web, synthesizing guides, and applying knowledge in a visual game state? “Efficient search of novel idea space is interesting, but we’re not even at the point where models can consistently apply knowledge they look up” * DSI and generative retrieval: re-imagining search as predicting document identifiers with semantic tokens, now deployed at YouTube (symmetric IDs for RecSys) and Spotify * Why RecSys and IR feel like a different universe: “modeling dynamics are strange, like gravity is different—you hit the shuttlecock and hear glass shatter, cause and effect are too far apart” * The closed lab advantage is increasing: the gap between frontier labs and open source is growing because ideas compound over time, and researchers keep finding new tricks that play well with everything built before * Why ideas still matter: “the last five years weren’t just blind scaling—transformers, pre-training, RL, self-consistency, all had to play well together to get us here” * Gemini Singapore: hiring for RL and reasoning researchers, looking for track record in RL or exceptional achievement in coding competitions, and building a small, talent-dense team close to the frontier — Yi Tay * Google DeepMind: https://deepmind.google * X: https://x.com/YiTayML Full Video Episode Timestamps 00:00:00 Introduction: Returning to Google DeepMind and the Singapore AGI Team00:04:52 The Philosophy of On-Policy RL: Learning from Your Own Mistakes00:12:00 IMO Gold Medal: The Journey from AlphaProof to End-to-End Gemini00:21:33 Training IMO Cat: Four Captains Across Three Time Zones00:26:19 Pokemon and Long-Horizon Reasoning: Beyond Academic Benchmarks00:36:29 AI Coding Assistants: From Lazy to Actually Useful00:32:59 Reasoning, Chain of Thought, and Latent Thinking00:44:46 Is Attention All You Need? Architecture, Learning, and the Local Minima00:55:04 Data Efficiency and World Models: The Next Frontier01:08:12 DSI and Generative Retrieval: Reimagining Search with Semantic IDs01:17:59 Building GDM Singapore: Geography, Talent, and the Symposium01:24:18 Hiring Philosophy: High Stats, Research Taste, and Student Budgets01:28:49 Health, HRV, and Research Performance: The 23kg Journey Get full access to Latent.Space at www.latent.space/subscribe

    1h 32m
  6. JAN 17

    Brex’s AI Hail Mary — With CTO James Reggio

    From building internal AI labs to becoming CTO of Brex, James Reggio has helped lead one of the most disciplined AI transformations inside a real financial institution where compliance, auditability, and customer trust actually matter. We sat down with Reggio to unpack Brex’s three-pillar AI strategy (corporate, operational, and product AI) [https://www.brex.com/journal/brex-ai-native-operations], how SOP-driven agents beat overengineered RL in ops, why Brex lets employees “build their own AI stack” instead of picking winners [https://www.conductorone.com/customers/brex/], and how a small, founder-heavy AI team is shipping production agents to 40,000+ companies. Reggio also goes deep on Brex’s multi-agent “network” architecture, evals for multi-turn systems, agentic coding’s second-order effects on codebase understanding, and why the future of finance software looks less like dashboards and more like executive assistants coordinating specialist agents behind the scenes. We discuss: * Brex’s three-pillar AI strategy: corporate AI for 10x employee workflows, operational AI for cost and compliance leverage, and product AI that lets customers justify Brex as part of their AI strategy to the board * Why SOP-driven agents beat overengineered RL in finance ops, and how breaking work into auditable, repeatable steps unlocked faster automation in KYC, underwriting, fraud, and disputes * Building an internal AI platform early: LLM gateways, prompt/version management, evals, cost observability, and why platform work quietly became the force multiplier behind everything else * Multi-agent “networks” vs single-agent tools: why Brex’s EA-style assistant coordinates specialist agents (policy, travel, reimbursements) through multi-turn conversations instead of one-shot tool calls * The audit agent pattern: separating detection, judgment, and follow-up into different agents to reduce false negatives without overwhelming finance teams * Centralized AI teams without resentment: how Brex avoided “AI envy” by tying work to business impact and letting anyone transfer in if they cared deeply enough * Letting employees build their own AI stack: ChatGPT vs Claude vs Gemini, Cursor vs Windsurf, and why Brex refuses to pick winners in fast-moving tool races * Measuring adoption without vanity metrics: why “% of code written by AI” is the wrong KPI and what second-order effects (slop, drift, code ownership) actually matter * Evals in the real world: regression tests from ops QA, LLM-as-judge for multi-turn agents, and why integration-style evals break faster than you expect * Teaching AI fluency at scale: the user → advocate → builder → native framework, ops-led training, spot bonuses, and avoiding fear-based adoption * Re-interviewing the entire engineering org: using agentic coding interviews internally to force hands-on skill upgrades without formal performance scoring * Headcount in the age of agents: why Brex grew the business without growing engineering, and why AI amplifies bad architecture as fast as good decisions * The future of finance software: why dashboards fade, assistants take over, and agent-to-agent collaboration becomes the real UI — James Reggio * X: https://x.com/jamesreggio * LinkedIn: https://www.linkedin.com/in/jamesreggio/ Where to find Latent Space * X: https://x.com/latentspacepod Full Video Episode Timestamps 00:00:00 Introduction00:01:24 From Mobile Engineer to CTO: The Founder's Path00:03:00 Quitters Welcome: Building a Founder-Friendly Culture00:05:13 The AI Team Structure: 10-Person Startup Within Brex00:11:55 Building the Brex Agent Platform: Multi-Agent Networks00:13:45 Tech Stack Decisions: TypeScript, Mastra, and MCP00:24:32 Operational AI: Automating Underwriting, KYC, and Fraud00:16:40 The Brex Assistant: Executive Assistant for Every Employee00:40:26 Evaluation Strategy: From Simple SOPs to Multi-Turn Evals00:37:11 Agentic Coding Adoption: Cursor, Windsurf, and the Engineering Interview00:58:51 AI Fluency Levels: From User to Native01:09:14 The Audit Agent Network: Finance Team Agents in Action01:03:33 The Future of Engineering Headcount and AI Leverage Get full access to Latent.Space at www.latent.space/subscribe

    1h 13m
  7. JAN 8

    Artificial Analysis: Independent LLM Evals as a Service — with George Cameron and Micah-Hill Smith

    Happy New Year! You may have noticed that in 2025 we had moved toward YouTube as our primary podcasting platform. As we’ll explain in the next State of Latent Space post, we’ll be doubling down on Substack again and improving the experience for the over 100,000 of you who look out for our emails and website updates! We first mentioned Artificial Analysis in 2024, when it was still a side project in a Sydney basement. They then were one of the few Nat Friedman and Daniel Gross’ AIGrant companies to raise a full seed round from them and have now become the independent gold standard for AI benchmarking—trusted by developers, enterprises, and every major lab to navigate the exploding landscape of models, providers, and capabilities. We have chatted with both Clementine Fourrier of HuggingFace’s OpenLLM Leaderboard and (the freshly valued at $1.7B) Anastasios Angelopoulos of LMArena on their approaches to LLM evals and trendspotting, but Artificial Analysis have staked out an enduring and important place in the toolkit of the modern AI Engineer by doing the best job of independently running the most comprehensive set of evals across the widest range of open and closed models, and charting their progress for broad industry analyst use. George Cameron and Micah-Hill Smith have spent two years building Artificial Analysis into the platform that answers the questions no one else will: Which model is actually best for your use case? What are the real speed-cost trade-offs? And how open is “open” really? We discuss: * The origin story: built as a side project in 2023 while Micah was building a legal AI assistant, launched publicly in January 2024, and went viral after Swyx’s retweet * Why they run evals themselves: labs prompt models differently, cherry-pick chain-of-thought examples (Google Gemini 1.0 Ultra used 32-shot prompts to beat GPT-4 on MMLU), and self-report inflated numbers * The mystery shopper policy: they register accounts not on their own domain and run intelligence + performance benchmarks incognito to prevent labs from serving different models on private endpoints * How they make money: enterprise benchmarking insights subscription (standardized reports on model deployment, serverless vs. managed vs. leasing chips) and private custom benchmarking for AI companies (no one pays to be on the public leaderboard) * The Intelligence Index (V3): synthesizes 10 eval datasets (MMLU, GPQA, agentic benchmarks, long-context reasoning) into a single score, with 95% confidence intervals via repeated runs * Omissions Index (hallucination rate): scores models from -100 to +100 (penalizing incorrect answers, rewarding \”I don’t know\”), and Claude models lead with the lowest hallucination rates despite not always being the smartest * GDP Val AA: their version of OpenAI’s GDP-bench (44 white-collar tasks with spreadsheets, PDFs, PowerPoints), run through their Stirrup agent harness (up to 100 turns, code execution, web search, file system), graded by Gemini 3 Pro as an LLM judge (tested extensively, no self-preference bias) * The Openness Index: scores models 0-18 on transparency of pre-training data, post-training data, methodology, training code, and licensing (AI2 OLMo 2 leads, followed by Nous Hermes and NVIDIA Nemotron) * The smiling curve of AI costs: GPT-4-level intelligence is 100-1000x cheaper than at launch (thanks to smaller models like Amazon Nova), but frontier reasoning models in agentic workflows cost more than ever (sparsity, long context, multi-turn agents) * Why sparsity might go way lower than 5%: GPT-4.5 is ~5% active, Gemini models might be ~3%, and Omissions Index accuracy correlates with total parameters (not active), suggesting massive sparse models are the future * Token efficiency vs. turn efficiency: GPT-5 costs more per token but solves Tau-bench in fewer turns (cheaper overall), and models are getting better at using more tokens only when needed (5.1 Codex has tighter token distributions) * V4 of the Intelligence Index coming soon: adding GDP Val AA, Critical Point, hallucination rate, and dropping some saturated benchmarks (human-eval-style coding is now trivial for small models) Links to Artificial Analysis * Website: https://artificialanalysis.ai * George Cameron on X: https://x.com/georgecameron * Micah-Hill Smith on X: https://x.com/micahhsmith Full Episode on YouTube Timestamps * 00:00 Introduction: Full Circle Moment and Artificial Analysis Origins * 01:19 Business Model: Independence and Revenue Streams * 04:33 Origin Story: From Legal AI to Benchmarking Need * 16:22 AI Grant and Moving to San Francisco * 19:21 Intelligence Index Evolution: From V1 to V3 * 11:47 Benchmarking Challenges: Variance, Contamination, and Methodology * 13:52 Mystery Shopper Policy and Maintaining Independence * 28:01 New Benchmarks: Omissions Index for Hallucination Detection * 33:36 Critical Point: Hard Physics Problems and Research-Level Reasoning * 23:01 GDP Val AA: Agentic Benchmark for Real Work Tasks * 50:19 Stirrup Agent Harness: Open Source Agentic Framework * 52:43 Openness Index: Measuring Model Transparency Beyond Licenses * 58:25 The Smiling Curve: Cost Falling While Spend Rising * 1:02:32 Hardware Efficiency: Blackwell Gains and Sparsity Limits * 1:06:23 Reasoning Models and Token Efficiency: The Spectrum Emerges * 1:11:00 Multimodal Benchmarking: Image, Video, and Speech Arenas * 1:15:05 Looking Ahead: Intelligence Index V4 and Future Directions * 1:16:50 Closing: The Insatiable Demand for Intelligence Transcript Micah [00:00:06]: This is kind of a full circle moment for us in a way, because the first time artificial analysis got mentioned on a podcast was you and Alessio on Latent Space. Amazing. swyx [00:00:17]: Which was January 2024. I don’t even remember doing that, but yeah, it was very influential to me. Yeah, I’m looking at AI News for Jan 17, or Jan 16, 2024. I said, this gem of a models and host comparison site was just launched. And then I put in a few screenshots, and I said, it’s an independent third party. It clearly outlines the quality versus throughput trade-off, and it breaks out by model and hosting provider. I did give you s**t for missing fireworks, and how do you have a model benchmarking thing without fireworks? But you had together, you had perplexity, and I think we just started chatting there. Welcome, George and Micah, to Latent Space. I’ve been following your progress. Congrats on... It’s been an amazing year. You guys have really come together to be the presumptive new gardener of AI, right? Which is something that... George [00:01:09]: Yeah, but you can’t pay us for better results. swyx [00:01:12]: Yes, exactly. George [00:01:13]: Very important. Micah [00:01:14]: Start off with a spicy take. swyx [00:01:18]: Okay, how do I pay you? Micah [00:01:20]: Let’s get right into that. swyx [00:01:21]: How do you make money? Micah [00:01:24]: Well, very happy to talk about that. So it’s been a big journey the last couple of years. Artificial analysis is going to be two years old in January 2026. Which is pretty soon now. We first run the website for free, obviously, and give away a ton of data to help developers and companies navigate AI and make decisions about models, providers, technologies across the AI stack for building stuff. We’re very committed to doing that and tend to keep doing that. We have, along the way, built a business that is working out pretty sustainably. We’ve got just over 20 people now and two main customer groups. So we want to be... We want to be who enterprise look to for data and insights on AI, so we want to help them with their decisions about models and technologies for building stuff. And then on the other side, we do private benchmarking for companies throughout the AI stack who build AI stuff. So no one pays to be on the website. We’ve been very clear about that from the very start because there’s no use doing what we do unless it’s independent AI benchmarking. Yeah. But turns out a bunch of our stuff can be pretty useful to companies building AI stuff. swyx [00:02:38]: And is it like, I am a Fortune 500, I need advisors on objective analysis, and I call you guys and you pull up a custom report for me, you come into my office and give me a workshop? What kind of engagement is that? George [00:02:53]: So we have a benchmarking and insight subscription, which looks like standardized reports that cover key topics or key challenges enterprises face when looking to understand AI and choose between all the technologies. And so, for instance, one of the report is a model deployment report, how to think about choosing between serverless inference, managed deployment solutions, or leasing chips. And running inference yourself is an example kind of decision that big enterprises face, and it’s hard to reason through, like this AI stuff is really new to everybody. And so we try and help with our reports and insight subscription. Companies navigate that. We also do custom private benchmarking. And so that’s very different from the public benchmarking that we publicize, and there’s no commercial model around that. For private benchmarking, we’ll at times create benchmarks, run benchmarks to specs that enterprises want. And we’ll also do that sometimes for AI companies who have built things, and we help them understand what they’ve built with private benchmarking. Yeah. So that’s a piece mainly that we’ve developed through trying to support everybody publicly with our public benchmarks. Yeah. swyx [00:04:09]: Let’s talk about TechStack behind that. But okay, I’m going to rewind all the way to when you guys started this project. You were all the way in Sydney? Yeah. Well, Sydney, Australia for me. Micah [00:04:19]: George was an SF, but he’s Australian, but he moved here already. Yeah. swyx [00:04:22]: And I remember I had the Zoom call with you. What was the impetus for starting

    1h 18m
  8. JAN 6

    [State of Evals] LMArena's $1.7B Vision — Anastasios Angelopoulos, LMArena

    We are reupping this episode after LMArena announced their fresh Series A (https://www.theinformation.com/articles/ai-evaluation-startup-lmarena-valued-1-7-billion-new-funding-round?rc=luxwz4), raising $150m at a $1.7B valuation, with $30M annualized consumption revenue (aka $2.5m MRR) after their September evals product launch. —- From building LMArena in a Berkeley basement to raising $100M and becoming the de facto leaderboard for frontier AI, Anastasios Angelopoulos returns to Latent Space to recap 2025 in one of the most influential platforms in AI—trusted by millions of users, every major lab, and the entire industry to answer one question: which model is actually best for real-world use cases? We caught up with Anastasios live at NeurIPS 2025 to dig into the origin story (spoiler: it started as an academic project incubated by Anjney Midha at a16z, who formed an entity and gave grants before they even committed to starting a company), why they decided to spin out instead of staying academic or nonprofit (the only way to scale was to build a company), how they’re spending that $100M (inference costs, React migration off Gradio, and hiring world-class talent across ML, product, and go-to-market), the leaderboard delusion controversy and why their response demolished the paper’s claims (factual errors, misrepresentation of open vs. closed source sampling, and ignoring the transparency of preview testing that the community loves), why platform integrity comes first (the public leaderboard is a charity, not a pay-to-play system—models can’t pay to get on, can’t pay to get off, and scores reflect millions of real votes), how they’re expanding into occupational verticals (medicine, legal, finance, creative marketing) and multimodal arenas (video coming soon), why consumer retention is earned every single day (sign-in and persistent history were the unlock, but users are fickle and can leave at any moment), and his vision for Arena as the central evaluation platform that provides the North Star for the industry—constantly fresh, immune to overfitting, and grounded in millions of real-world conversations from real users. We discuss: * The $100M raise: use of funds is primarily inference costs (funding free usage for tens of millions of monthly conversations), React migration off Gradio (custom loading icons, better developer hiring, more flexibility), and hiring world-class talent * The scale: 250M+ conversations on the platform, tens of millions per month, 25% of users do software for a living, and half of users are now logged in * The leaderboard illusion controversy: Cohere researchers claimed undisclosed private testing created inequities, but Arena’s response demolished the paper’s factual errors (misrepresented open vs. closed source sampling, ignored transparency of preview testing that the community loves) * Why preview testing is loved by the community: secret codenames (Gemini Nano Banana, named after PM Naina’s nickname), early access to unreleased models, and the thrill of being first to vote on frontier capabilities * The Nano Banana moment: changed Google’s market share overnight, billions of dollars in stock movement, and validated that multimodal models (image generation, video) are economically critical for marketing, design, and AI-for-science * New categories: occupational and expert arenas (medicine, legal, finance, creative marketing), Code Arena, and video arena coming soon Full Video Episode Timestamps 00:00:00 Introduction: Anastasios from Arena and the LM Arena Journey00:01:36 The Anjney Midha Incubation: From Berkeley Basement to Startup00:02:47 The Decision to Start a Company: Scaling Beyond Academia00:03:38 The $100M Raise: Use of Funds and Platform Economics00:05:10 Arena's User Base: 5M+ Users and Diverse Demographics00:06:02 The Competitive Landscape: Artificial Analysis, AI.xyz, and Arena's Differentiation00:08:12 Educational Value and Learning from the Community00:08:41 Technical Migration: From Gradio to React and Platform Evolution00:10:18 Leaderboard Delusion Paper: Addressing Critiques and Maintaining Integrity00:12:29 Nano Banana Moment: How Preview Models Create Market Impact00:13:41 Multimodal AI and Image Generation: From Skepticism to Economic Value00:15:37 Core Principles: Platform Integrity and the Public Leaderboard as Charity00:18:29 Future Roadmap: Expert Categories, Multimodal, Video, and Occupational Verticals00:19:10 API Strategy and Focus: Doing One Thing Well00:19:51 Community Management and Retention: Sign-In, History, and Daily Value00:22:21 Partnerships and Agent Evaluation: From Devon to Full-Featured Harnesses00:21:49 Hiring and Building a High-Performance Team Get full access to Latent.Space at www.latent.space/subscribe

    24 min

Ratings & Reviews

4.8
out of 5
10 Ratings

About

The podcast by and for AI Engineers! In 2025, over 10 million readers and listeners came to Latent Space to hear about news, papers and interviews in Software 3.0. We cover Foundation Models changing every domain in Code Generation, Multimodality, AI Agents, GPU Infra and more, directly from the founders, builders, and thinkers involved in pushing the cutting edge. Striving to give you both the definitive take on the Current Thing down to the first introduction to the tech you'll be using in the next 3 months! We break news and exclusive interviews from OpenAI, Anthropic, Gemini, Meta (Soumith Chintala), Sierra (Bret Taylor), tiny (George Hotz), Databricks/MosaicML (Jon Frankle), Modular (Chris Lattner), Answer.ai (Jeremy Howard), et al. Full show notes always on https://latent.space www.latent.space

You Might Also Like