75 episodes

The podcast by and for AI Engineers! In 2023, over 1 million visitors came to Latent Space to hear about news, papers and interviews in Software 3.0.

We cover Foundation Models changing every domain in Code Generation, Multimodality, AI Agents, GPU Infra and more, directly from the founders, builders, and thinkers involved in pushing the cutting edge. Striving to give you both the definitive take on the Current Thing down to the first introduction to the tech you'll be using in the next 3 months! We break news and exclusive interviews from OpenAI, tiny (George Hotz), Databricks/MosaicML (Jon Frankle), Modular (Chris Lattner), Answer.ai (Jeremy Howard), et al.

Full show notes always on https://latent.space

www.latent.space

Latent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and al Alessio + swyx

- Technology

- 25 JUN 2024
State of the Art: Training >70B LLMs on 10,000 H100 clusters

State of the Art: Training >70B LLMs on 10,000 H100 clusters

It’s return guest season here at Latent Space! We last talked to Kanjun in October and Jonathan in May (and December post Databricks acquisition):

Imbue and Databricks are back for a rare treat: a double-header interview talking about DBRX from Databricks and Imbue 70B, a new internal LLM that “outperforms GPT-4o” zero-shot on a range of reasoning and coding-related benchmarks and datasets, while using 7x less data than Llama 3 70B.
While Imbue, being an agents company rather than a model provider, are not releasing their models today, they are releasing almost everything else:
* Cleaned-up and extended versions of 11 of the most popular NLP reasoning benchmarks
* An entirely new code-focused reasoning benchmark
* A fine-tuned 70B model, built with Meta Llama 3, to identify ambiguity
* A new dataset of 450,000 human judgments about ambiguity
* Infrastructure scripts for bringing a cluster from bare metal to robust, high performance training
* Our cost-aware hyperparameter optimizer, CARBS, which automatically and systematically fine-tunes all hyperparameters to derive optimum performance for models of any size
As well as EXTREMELY detailed posts on the infrastructure needs, hyperparameter search, and clean versions of the sorry state of industry standard benchmarks. This means for the FIRST TIME (perhaps since Meta’s OPT-175B in 2022?) you have this level of educational detail into the hardware and ML nitty gritty of training extremely large LLMs, and if you are in fact training LLMs of this scale you now have evals, optimizers, scripts, and human data/benchmarks you can use to move the industry forward together with Imbue.
We are busy running the sold-out AI Engineer World’s Fair today, and so are unable to do our usual quality writeup, however, please enjoy our show notes and the excellent conversation! Thanks also to Kanjun, Ashley, Tom and the rest of team Imbue for setting up this interview behind the scenes.
Video pod

Timestamps
* [00:00:00] Introduction and catch up with guests
* [00:01:55] Databricks' text to image model release
* [00:03:46] Details about the DBRX model
* [00:05:26] Imbue's infrastructure, evaluation, and hyperparameter optimizer releases
* [00:09:18] Challenges of training foundation models and getting infrastructure to work
* [00:12:03] Details of Imbue's cluster setup
* [00:18:53] Process of bringing machines online and common failures
* [00:22:52] Health checks and monitoring for the cluster
* [00:25:06] Typical timelines and team composition for setting up a cluster
* [00:27:24] Monitoring GPU utilization and performance
* [00:29:39] Open source tools and libraries used
* [00:32:33] Reproducibility and portability of cluster setup
* [00:35:57] Infrastructure changes needed for different model architectures
* [00:40:49] Imbue's focus on text-only models for coding and reasoning
* [00:42:26] CARBS hyperparameter tuner and cost-aware optimization
* [00:51:01] Emergence and CARBS
* [00:53:18] Evaluation datasets and reproducing them with high quality
* [00:58:40] Challenges of evaluating on more realistic tasks
* [01:06:01] Abstract reasoning benchmarks like ARC
* [01:10:13] Long context evaluation and needle-in-a-haystack tasks
* [01:13:50] Function calling and tool use evaluation
* [01:19:19] Imbue's future plans for coding and reasoning applications
* [01:20:14] Databricks' future plans for useful applications and upcoming blog posts

Transcript
SWYX [00:00:00]: Welcome to the Latent Space Podcast, another super special edition. Today, we have sort of like a two-header. John Frankel from Mosaic Databricks, or Databricks Mosaic, and Josh Albrecht from MBU. Welcome.
JOSH [00:00:12]: Hey, glad to be here.
SWYX [00:00:14]: Thank you for having us. Hey, so both of you are kind of past guests. Jonathan, you were actually one of the most popular episodes from last year talking about MPT7B. Remember the days when we trained large models and there was 7B?
JONATHAN [00:00:30]: Yeah, back
- 1 hr 21 min
- 24 JUN 2024
[High Agency] AI Engineer World's Fair Preview

[High Agency] AI Engineer World's Fair Preview

The World’s Fair is officially sold out! Thanks for all the support and stay tuned for recaps of all the great goings on in this very special celebration of the AI Engineer!
Longtime listeners will remember the fan favorite Raza Habib, CEO of HumanLoop, on the pod:
Well, he’s caught the podcasting bug and is now flipping the tables on swyx!
Subscribe to High Agency wherever the finest Artificial Intelligence podcast are sold.

High Agency Pod Description
In this episode, I chatted with Shawn Wang about his upcoming AI engineering conference and what an AI engineer really is. It's been a year since he penned the viral essay "Rise of the AI Engineer' and we discuss if this new role will be enduring, the make up of the optimal AI team and trends in machine learning.
Timestamps
00:00 - Introduction and background on Shawn Wang (Swyx)03:45 - Reflecting on the "Rise of the AI Engineer" essay07:30 - Skills and characteristics of AI Engineers12:15 - Team composition for AI products16:30 - Vertical vs. horizontal AI startups23:00 - Advice for AI product creators and leaders28:15 - Tools and buying vs. building for AI products33:30 - Key trends in AI research and development41:00 - Closing thoughts and information on the AI Engineer World Fair Summit
Video

Get full access to Latent Space at www.latent.space/subscribe
- 49 min
- 21 JUN 2024
How To Hire AI Engineers — with James Brady & Adam Wiggins of Elicit

How To Hire AI Engineers — with James Brady & Adam Wiggins of Elicit

Editor’s note: One of the top reasons we have hundreds of companies and thousands of AI Engineers joining the World’s Fair next week is, apart from discussing technology and being present for the big launches planned, to hire and be hired!
Listeners loved our previous Elicit episode and were so glad to welcome 2 more members of Elicit back for a guest post (and bonus podcast) on how they think through hiring. Don’t miss their AI engineer job description, and template which you can use to create your own hiring plan!
How to Hire AI Engineers
James Brady, Head of Engineering @ Elicit (ex Spring, Square, Trigger.io, IBM)
Adam Wiggins, Internal Journalist @ Elicit (Cofounder Ink & Switch and Heroku)
If you’re leading a team that uses AI in your product in some way, you probably need to hire AI engineers. As defined in this article, that’s someone with conventional engineering skills in addition to knowledge of language models and prompt engineering, without being a full-fledged Machine Learning expert.
But how do you hire someone with this skillset? At Elicit we’ve been applying machine learning to reasoning tools since 2018, and our technical team is a mix of ML experts and what we can now call AI engineers. This article will cover our process from job description through interviewing. (You can also flip the perspectives here and use it just as easily for how to get hired as an AI engineer!)
My own journey
Before getting into the brass tacks, I want to share my journey to becoming an AI engineer.
Up until a few years ago, I was happily working my job as an engineering manager of a big team at a late-stage startup. Like many, I was tracking the rapid increase in AI capabilities stemming from the deep learning revolution, but it was the release of GPT-3 in 2020 which was the watershed moment. At the time, we were all blown away by how the model could string together coherent sentences on demand. (Oh how far we’ve come since then!)
I’d been a professional software engineer for nearly 15 years—enough to have experienced one or two technology cycles—but I could see this was something categorically new. I found this simultaneously exciting and somewhat disconcerting. I knew I wanted to dive into this world, but it seemed like the only path was going back to school for a master’s degree in Machine Learning. I started talking with my boss about options for taking a sabbatical or doing a part-time distance learning degree.
In 2021, I instead decided to launch a startup focused on productizing new research ideas on ML interpretability. It was through that process that I reached out to Andreas—a leading ML researcher and founder of Elicit—to see if he would be an advisor. Over the next few months, I learned more about Elicit: that they were trying to apply these fascinating technologies to the real-world problems of science, and with a business model that aligned it with safety goals. I realized that I was way more excited about Elicit than I was about my own startup ideas, and wrote about my motivations at the time.
Three years later, it’s clear this was a seismic shift in my career on the scale of when I chose to leave my comfy engineering job at IBM to go through the Y Combinator program back in 2008. Working with this new breed of technology has been more intellectually stimulating, challenging, and rewarding than I could have imagined.
Deep ML expertise not required
It’s important to note that AI engineers are not ML experts, nor is that their best contribution to a tech team.
In our article Living documents as an AI UX pattern, we wrote:
It’s easy to think that AI advancements are all about training and applying new models, and certainly this is a huge part of our work in the ML team at Elicit. But those of us working in the UX part of the team believe that we have a big contribution to make in how AI is applied to end-user problems.
We think of LLMs as a new medium to work with, one that we’ve b
- 1 hr 3 min
- 11 JUN 2024
How AI is eating Finance — with Mike Conover of Brightwave

How AI is eating Finance — with Mike Conover of Brightwave

In April 2023 we released an episode named “Mapping the future of *truly* open source models” to talk about Dolly, the first open, commercial LLM.
Mike was leading the OSS models team at Databricks at the time. Today, Mike is back on the podcast to give us the “one year later” update on the evolution of large language models and how he’s been using them to build Brightwave, an an AI research assistant for investment professionals.
Today they are announcing a $6M seed round (led by Alessio and Decibel!), and sharing some of the learnings from serving customers with >$120B of assets under management in production in the last 4 months since launch.
Losing faith in long context windows
In our recent “Llama3 1M context window” episode we talked about the amazing progress we have done in context window size, but it’s good to remember that Dolly’s original context size was 1,024 tokens, and this was only 14 months ago.
But while understanding length has increased, models are still not able to generate very long answers. His empirical intuition (which matches ours while building smol-podcaster) is that most commercial LLMs, as well as Llama, tend to generate responses most of the time. While Needle in a Haystack tests will pass with flying colors at most context sizes, the granularity of the summary decreases as the context increases as it tries to fit the answer in the same tokens range, rather than returning tokens close to the 4,096 max_output, for example.
Recently Rob Mulla from Dreadnode highlighted how LMSys Arena results prefer longer responses by a large margin, so both LLMs and humans have a well documented length bias which doesn’t necessarily track the quality of answer:
The way Mike and team solved this is by breaking down the task in multiple subtasks, and then merging them back together. For example, have a book summarized chapter by chapter to preserve more details, and then put those summaries together. In Brightwave’s case, it’s creating multiple subsystems that accomplish different tasks on a large corpus of text separately, and then bringing them all together in a report. For example understanding intent of the question, extracting relations between companies, figuring out if it’s a positive / negative, etc.
Mike’s question is whether or not we’ll be able to imbue better synthesis capabilities in the models: can you have synthesis-oriented demonstrations at training time rather than single token prediction?
“LLMs as Judges” Strategies
In our David Luan episode he mentioned they don’t use any benchmarks for their models, because the benchmarks don’t reflect their customer needs. Brightwave shared some tips on leveraging LLMs as Judges:
* Human vs LLM reviews: while they work with human annotators to create high quality datasets, that data isn’t just used to fine tune models but also as a reference basis for future LLM reviews. Having a set of trusted data to use as calibration helps you trust the LLM judgement even more.
* Ensemble consistency checking: rather than using an LLM as judge for one output, you use different LLMs to generate a result for the same task, and then use another LLM to highlight where those generations differ. Do the two outputs differ meaningfully? Do they have different beliefs about the implications of something? If there are a lot of discrepancies between generations coming from different models, you then do additional passes to try and resolve them.
* Entailment verification: for each unique insight that they generate, they take the output and separately ask LLMs to verify factuality of information based on the original sources. In the actual product, user can then highlight any piece of text and ask it to 1) “Tell Me More” 2) “Show Sources”. Since there’s no way to guarantee factuality of 100% of outputs, and humans have good intuition for things that look out of the ordinary, giving the user access to the review tool helps th
- 54 min
- 9 JUN 2024
ICLR 2024 — Best Papers & Talks (Benchmarks, Reasoning & Agents) — ft. Graham Neubig, Aman Sanger, Moritz Hardt)

ICLR 2024 — Best Papers & Talks (Benchmarks, Reasoning & Agents) — ft. Graham Neubig, Aman Sanger, Moritz Hardt)

Our second wave of speakers for AI Engineer World’s Fair were announced! The conference sold out of Platinum/Gold/Silver sponsors and Early Bird tickets! See our Microsoft episode for more info and buy now with code LATENTSPACE.
This episode is straightforwardly a part 2 to our ICLR 2024 Part 1 episode, so without further ado, we’ll just get right on with it!

Timestamps
[00:03:43] Section A: Code Edits and Sandboxes, OpenDevin, and Academia vs Industry — ft. Graham Neubig and Aman Sanger
* [00:07:44] WebArena
* [00:18:45] Sotopia
* [00:24:00] Performance Improving Code Edits
* [00:29:39] OpenDevin
* [00:47:40] Industry and Academia
[01:05:29] Section B: Benchmarks
* [01:05:52] SWEBench
* [01:17:05] SWEBench/SWEAgent Interview
* [01:27:40] Dataset Contamination Detection
* [01:39:20] GAIA Benchmark
* [01:49:18] Moritz Hart - Science of Benchmarks
[02:36:32] Section C: Reasoning and Post-Training
* [02:37:41] Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection
* [02:51:00] Let’s Verify Step By Step
* [02:57:04] Noam Brown
* [03:07:43] Lilian Weng - Towards Safe AGI
* [03:36:56] A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis
* [03:48:43] MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework
[04:00:51] Bonus: Notable Related Papers on LLM Capabilities

Section A: Code Edits and Sandboxes, OpenDevin, and Academia vs Industry — ft. Graham Neubig and Aman Sanger
* Guests
* Graham Neubig
* Aman Sanger - Previous guest and NeurIPS friend of the pod!
* WebArena
*
* Sotopia (spotlight paper, website)
*
* Learning Performance-Improving Code Edits
* OpenDevin
* Junyang Opendevin
* Morph Labs, Jesse Han
* SWE-Bench
* SWE-Agent
* Aman tweet on swebench
* LiteLLM
* Livecodebench
* the role of code in reasoning
* Language Models of Code are Few-Shot Commonsense Learners
* Industry vs academia
* the matryoshka embeddings incident
* other directions
* Unlimiformer
Section A timestamps
* [00:00:00] Introduction to Guests and the Impromptu Nature of the Podcast
* [00:00:45] Graham's Experience in Japan and Transition into Teaching NLP
* [00:01:25] Discussion on What Constitutes a Good Experience for Students in NLP Courses
* [00:02:22] The Relevance and Teaching of Older NLP Techniques Like Ngram Language Models
* [00:03:38] Speculative Decoding and the Comeback of Ngram Models
* [00:04:16] Introduction to WebArena and Zotopia Projects
* [00:05:19] Deep Dive into the WebArena Project and Benchmarking
* [00:08:17] Performance Improvements in WebArena Using GPT-4
* [00:09:39] Human Performance on WebArena Tasks and Challenges in Evaluation
* [00:11:04] Follow-up Work from WebArena and Focus on Web Browsing as a Benchmark
* [00:12:11] Direct Interaction vs. Using APIs in Web-Based Tasks
* [00:13:29] Challenges in Base Models for WebArena and the Potential of Visual Models
* [00:15:33] Introduction to Zootopia and Exploring Social Interactions with Language Models
* [00:16:29] Different Types of Social Situations Modeled in Zootopia
* [00:17:34] Evaluation of Language Models in Social Simulations
* [00:20:41] Introduction to Performance-Improving Code Edits Project
* [00:26:28] Discussion on DevIn and the Future of Coding Agents
* [00:32:01] Planning in Coding Agents and the Development of OpenDevon
* [00:38:34] The Changing Role of Academia in the Context of Large Language Models
* [00:44:44] The Changing Nature of Industry and Academia Collaboration
* [00:54:07] Update on NLP Course Syllabus and Teaching about Large Language Models
* [01:00:40] Call to Action: Contributions to OpenDevon and Open Source AI Projects
* [01:01:56] Hiring at Cursor for Roles in Code Generation and Assistive Coding
* [01:02:12] Promotion of the AI Engineer Conference

Section B: Benchmarks
* Carlos Jimenez & John Yang (Princeton) et al: SWE-bench: Can Language Models Resolve Real-world Github Issues? (ICLR Oral, Paper, website)
* “We introduce SWE-bench, an evaluation fram
- 4 hrs 29 min
- 30 MAY 2024
How to train a Million Context LLM — with Mark Huang of Gradient.ai

How to train a Million Context LLM — with Mark Huang of Gradient.ai

AI Engineer World’s Fair in SF! Prices go up soon.
Note that there are 4 tracks per day and dozens of workshops/expo sessions; the livestream will air the most stacked speaker list/AI expo floor of 2024.
Apply for free/discounted Diversity Program and Scholarship tickets here. We hope to make this the definitive technical conference for ALL AI engineers.
Exactly a year ago, we declared the Beginning of Context=Infinity when Mosaic made their breakthrough training an 84k token context MPT-7B.

A Brief History of Long Context
Of course right when we released that episode, Anthropic fired the starting gun proper with the first 100k context window model from a frontier lab, spawning smol-developer and other explorations. In the last 6 months, the fight (and context lengths) has intensified another order of magnitude, kicking off the "Context Extension Campaigns" chapter of the Four Wars:
* In October 2023, Claude's 100,000 token windows was still SOTA (we still use it for Latent Space’s show notes to this day).
* On November 6th, OpenAI launched GPT-4 Turbo with 128k context.
* On November 21st, Anthropic fired back extending Claude 2.1 to 200k tokens.
* Feb 15 (the day everyone launched everything) was Gemini's turn, announcing the first LLM with 1 million token context window.
* In May 2024 at Google I/O, Gemini 1.5 Pro announced a 2m token context window
In parallel, open source/academia had to fight its own battle to keep up with the industrial cutting edge. Nous Research famously turned a reddit comment into YaRN, extending Llama 2 models to 128k context. So when Llama 3 dropped, the community was ready, and just weeks later, we had Llama3 with 4M+ context!
A year ago we didn’t really have an industry standard way of measuring context utilization either: it’s all well and good to technically make an LLM generate non-garbage text at 1m tokens, but can you prove that the LLM actually retrieves and attends to information inside that long context? Greg Kamradt popularized the Needle In A Haystack chart which is now a necessary (if insufficient) benchmark — and it turns out we’ve solved that too in open source:
Today's guest, Mark Huang, is the co-founder of Gradient, where they are building a full stack AI platform to power enterprise workflows and automations. They are also the team behind the first Llama3's 1M+ and 4M+ context window finetunes.
Long Context Algorithms: RoPE, ALiBi, and Ring Attention
Positional encodings allow the model to understand the relative position of tokens in the input sequence, present in what (upcoming guest!) Yi Tay affectionately calls the OG “Noam architecture”. But if we want to increase a model’s context length, these encodings need to gracefully extrapolate to longer sequences.
ALiBi, used in models like MPT (see our "Context=Infinity" episode with the MPT leads, Jonathan Frankle and Abhinav), was one of the early approaches to this space. It lets the context window stretch as it grows, using a linearly decreasing penalty between attention weights of different positions; the further two tokens are, the higher the penalty. Of course, this isn’t going to work for usecases that actually require global attention across a long context.
In more recent architectures and finetunes, RoPE (Rotary Position Embedding) encoding is more commonly used and is also what Llama3 was based on. RoPE uses a rotational matrix to encode positions, which empirically performs better for longer sequences.
The main innovation from Gradient was to focus on tuning the theta hyperparameter that governs the frequency of the rotational encoding.
Audio note: If you want the details, jump to 15:55 in the podcast (or scroll down to the transcript!)
By carefully increasing theta as context length grew, they were able to scale Llama3 up to 1 million tokens and potentially beyond.
Once you've scaled positional embeddings, there's still the issue of attention's quadratic complexity, and how longer and longer
- 57 min