78 episodes

The podcast by and for AI Engineers! In 2023, over 1 million visitors came to Latent Space to hear about news, papers and interviews in Software 3.0.

We cover Foundation Models changing every domain in Code Generation, Multimodality, AI Agents, GPU Infra and more, directly from the founders, builders, and thinkers involved in pushing the cutting edge. Striving to give you both the definitive take on the Current Thing down to the first introduction to the tech you'll be using in the next 3 months! We break news and exclusive interviews from OpenAI, tiny (George Hotz), Databricks/MosaicML (Jon Frankle), Modular (Chris Lattner), Answer.ai (Jeremy Howard), et al.

Full show notes always on https://latent.space

www.latent.space

Latent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and al Alessio + swyx

- Technology

- 23 JUL 2024
Llama 2, 3 & 4: Synthetic Data, RLHF, Agents on the path to Open Source AGI

Llama 2, 3 & 4: Synthetic Data, RLHF, Agents on the path to Open Source AGI

If you see this in time, join our emergency LLM paper club on the Llama 3 paper!
For everyone else, join our special AI in Action club on the Latent Space Discord for a special feature with the Cursor cofounders on Composer, their newest coding agent!
Today, Meta is officially releasing the largest and most capable open model to date, Llama3-405B, a dense transformer trained on 15T tokens that beats GPT-4 on all major benchmarks:
The 8B and 70B models from the April Llama 3 release have also received serious spec bumps, warranting the new label of Llama 3.1.
If you are curious about the infra / hardware side, go check out our episode with Soumith Chintala, one of the AI infra leads at Meta. Today we have Thomas Scialom, who led Llama2 and now Llama3 post-training, so we spent most of our time on pre-training (synthetic data, data pipelines, scaling laws, etc) and post-training (RLHF vs instruction tuning, evals, tool calling).
Synthetic data is all you need
Llama3 was trained on 15T tokens, 7x more than Llama2 and with 4 times as much code and 30 different languages represented. But as Thomas beautifully put it:
“My intuition is that the web is full of s**t in terms of text, and training on those tokens is a waste of compute.”
“Llama 3 post-training doesn't have any human written answers there basically… It's just leveraging pure synthetic data from Llama 2.”
While it is well speculated that the 8B and 70B were "offline distillations" of the 405B, there are a good deal more synthetic data elements to Llama 3.1 than the expected. The paper explicitly calls out:
* SFT for Code: 3 approaches for synthetic data for the 405B bootstrapping itself with code execution feedback, programming language translation, and docs backtranslation.
* SFT for Math: The Llama 3 paper credits the Let’s Verify Step By Step authors, who we interviewed at ICLR:
* SFT for Multilinguality: "To collect higher quality human annotations in non-English languages, we train a multilingual expert by branching off the pre-training run and continuing to pre-train on a data mix that consists of 90% multilingualtokens."
* SFT for Long Context: "It is largely impractical to get humans to annotate such examples due to the tedious and time-consuming nature of reading lengthy contexts, so we predominantly rely on synthetic data to fill this gap. We use earlier versions of Llama 3 to generate synthetic data based on the key long-context use-cases: (possibly multi-turn) question-answering, summarization for long documents, and reasoning over code repositories, and describe them in greater detail below"
* SFT for Tool Use: trained for Brave Search, Wolfram Alpha, and a Python Interpreter (a special new ipython role) for single, nested, parallel, and multiturn function calling.
* RLHF: DPO preference data was used extensively on Llama 2 generations. This is something we partially covered in RLHF 201: humans are often better at judging between two options (i.e. which of two poems they prefer) than creating one (writing one from scratch). Similarly, models might not be great at creating text but they can be good at classifying their quality.
Last but not least, Llama 3.1 received a license update explicitly allowing its use for synthetic data generation.
Llama2 was also used as a classifier for all pre-training data that went into the model. It both labelled it by quality so that bad tokens were removed, but also used type (i.e. science, law, politics) to achieve a balanced data mix.
Tokenizer size matters
The tokens vocab of a model is the collection of all tokens that the model uses. Llama2 had a 34,000 tokens vocab, GPT-4 has 100,000, and 4o went up to 200,000. Llama3 went up 4x to 128,000 tokens. You can find the GPT-4 vocab list on Github.
This is something that people gloss over, but there are many reason why a large vocab matters:
* More tokens allow it to represent more concepts, and then be better at understanding the nuances.
* The larger the token
- 1 hr 5 min
- 12 JUL 2024
Benchmarks 201: Why Leaderboards > Arenas >> LLM-as-Judge

Benchmarks 201: Why Leaderboards > Arenas >> LLM-as-Judge

The first AI Engineer World’s Fair talks from OpenAI and Cognition are up!
In our Benchmarks 101 episode back in April 2023 we covered the history of AI benchmarks, their shortcomings, and our hopes for better ones.
Fast forward 1.5 years, the pace of model development has far exceeded the speed at which benchmarks are updated. Frontier labs are still using MMLU and HumanEval for model marketing, even though most models are reaching their natural plateau at a ~90% success rate (any higher and they’re probably just memorizing/overfitting).
From Benchmarks to Leaderboards
Outside of being stale, lab-reported benchmarks also suffer from non-reproducibility. The models served through the API also change over time, so at different points in time it might return different scores.
Today’s guest, Clémentine Fourrier, is the lead maintainer of HuggingFace’s OpenLLM Leaderboard. Their goal is standardizing how models are evaluated by curating a set of high quality benchmarks, and then publishing the results in a reproducible way with tools like EleutherAI’s Harness.
The leaderboard was first launched summer 2023 and quickly became the de facto standard for open source LLM performance. To give you a sense for the scale:
* Over 2 million unique visitors
* 300,000 active community members
* Over 7,500 models evaluated
Last week they announced the second version of the leaderboard. Why? Because models were getting too good!
The new version of the leaderboard is based on 6 benchmarks:
* 📚 MMLU-Pro (Massive Multitask Language Understanding - Pro version, paper)
* 📚 GPQA (Google-Proof Q&A Benchmark, paper)
* 💭MuSR (Multistep Soft Reasoning, paper)
* 🧮 MATH (Mathematics Aptitude Test of Heuristics, Level 5 subset, paper)
* 🤝 IFEval (Instruction Following Evaluation, paper)
* 🧮 🤝 BBH (Big Bench Hard, paper)
You can read the reasoning behind each of them on their announcement blog post. These updates had some clear winners and losers, with models jumping up or down up to 50 spots at once; the most likely reason for this is that the models were overfit to the benchmarks, or had some contamination in their training dataset.
But the most important change is in the absolute scores. All models score much lower on v2 than they do on v1, which now creates a lot more room for models to show improved performance.
On Arenas
Another high-signal platform for AI Engineers is the LMSys Arena, which asks users to rank the output of two different models on the same prompt, and then give them an ELO score based on the outcomes.
Clémentine called arenas “sociological experiments”: it tells you a lot about the users preference, but not always much about the model capabilities. She pointed to Anthropic’s sycophancy paper as early research in this space:
We find that when a response matches a user’s views, it is more likely to be preferred. Moreover, both humans and preference models (PMs) prefer convincingly-written sycophantic responses over correct ones a non-negligible fraction of the time.
The other issue is that Arena rankings aren’t reproducible, as you don’t know who ranked what and what exactly the outcome was at the time of ranking. They are still quite helpful as tools, but they aren’t a rigorous way to rank capabilities of the models.
Her advice for both arena and leaderboard is to use these tools as ranges; find 3-4 models that fit your needs (speed, cost, capabilities, etc) and then do vibe checks to figure out which one is best for your specific task.
LLMs aren’t good judges
In the last ~6 months, there has been an increased interest in using LLMs as Judges: rather than asking a person to evaluate the outcome of a model, you can ask a more powerful LLM to score it. We covered this a bit in our Brightwave episode last month as well. HuggingFace also has a cookbook on it, but Clémentine was actually not a fan of this approach:
* Mode collapse: if you are asking a model to choose which ou
- 58 min
- 5 JUL 2024
The 10,000x Yolo Researcher Metagame — with Yi Tay of Reka

The 10,000x Yolo Researcher Metagame — with Yi Tay of Reka

Livestreams for the AI Engineer World’s Fair (Multimodality ft. the new GPT-4o demo, GPUs and Inference (ft. Cognition/Devin), CodeGen, Open Models tracks) are now live! Subscribe to @aidotEngineer to get notifications of the other workshops and tracks!
It’s easy to get de-sensitized to new models topping leaderboards every other week — however, the top of the LMsys leaderboard has typically been the exclusive domain of very large, very very well funded model labs like OpenAI, Anthropic, Google, and Meta. OpenAI had about 600 people at the time of GPT-4, and Google Gemini had 950 co-authors. This is why Reka Core made waves in May - not only debuting at #7 on the leaderboard, but doing so with all-new GPU infrastructure and 20 employees with and a relatively puny $60m in funding.
Shortly after the release of GPT3, Sam Altman speculated on the qualities of “10,000x researchers”:
* “They spend a lot of time reflecting on some version of the Hamming question—"what are the most important problems in your field, and why aren’t you working on them?” In general, no one reflects on this question enough, but the best people do it the most, and have the best ‘problem taste’, which is some combination of learning to think independently, reason about the future, and identify attack vectors.” — sama
* Taste is something both John Schulman and Yi Tay emphasize greatly
* “They have a laser focus on the next step in front of them combined with long-term vision.” — sama
* “They are extremely persistent and willing to work hard… They have a bias towards action and trying things, and they’re clear-eyed and honest about what is working and what isn’t” — sama
“There's a certain level of sacrifice to be an AI researcher, especially if you're training at LLMs, because you cannot really be detached… your jobs could die on a Saturday at 4am, and there are people who will just leave it dead until Monday morning, or there will be people who will crawl out of bed at 4am to restart the job, or check the TensorBoard” – Yi Tay (at 28 mins)

“I think the productivity hack that I have is, I didn't have a boundary between my life and my work for a long time. So I think I just cared a lot about working most of the time. Actually, during my PhD, Google and everything [else], I'll be just working all the time. It's not like the most healthy thing, like ever, but I think that that was actually like one of the biggest, like, productivity, like and I spent, like, I like to spend a lot of time, like, writing code and I just enjoy running experiments, writing code” — Yi Tay (at 90 mins)
* See @YiTayML example for honest alpha on what is/is not working
and so on.
More recently, Yi’s frequent co-author, Jason Wei, wrote about the existence of Yolo researchers he witnessed at OpenAI:
Given the very aggressive timeline — Yi left Google in April 2023, was GPU constrained until December 2023, and then Reka Flash (21B) was released in Feb 2024, and Reka Core (??B) was released in April 2024 — Reka’s 3-5 person pretraining team had no other choice but to do Yolo runs. Per Yi:
“Scaling models systematically generally requires one to go from small to large in a principled way, i.e., run experiments in multiple phrases (1B->8B->64B->300B etc) and pick the winners and continuously scale them up. In a startup, we had way less compute to perform these massive sweeps to check hparams. In the end, we had to work with many Yolo runs (that fortunately turned out well).
In the end it took us only a very small number of smaller scale & shorter ablation runs to get to the strong 21B Reka Flash and 7B edge model (and also our upcoming largest core model). Finding a solid recipe with a very limited number of runs is challenging and requires changing many variables at once given the ridiculously enormous search space. In order to do this, one has to abandon the systematicity of Bigtech and rely a lot on “Yolo”, gut
- 1 hr 44 min
- 25 JUN 2024
State of the Art: Training >70B LLMs on 10,000 H100 clusters

State of the Art: Training >70B LLMs on 10,000 H100 clusters

It’s return guest season here at Latent Space! We last talked to Kanjun in October and Jonathan in May (and December post Databricks acquisition):

Imbue and Databricks are back for a rare treat: a double-header interview talking about DBRX from Databricks and Imbue 70B, a new internal LLM that “outperforms GPT-4o” zero-shot on a range of reasoning and coding-related benchmarks and datasets, while using 7x less data than Llama 3 70B.
While Imbue, being an agents company rather than a model provider, are not releasing their models today, they are releasing almost everything else:
* Cleaned-up and extended versions of 11 of the most popular NLP reasoning benchmarks
* An entirely new code-focused reasoning benchmark
* A fine-tuned 70B model, built with Meta Llama 3, to identify ambiguity
* A new dataset of 450,000 human judgments about ambiguity
* Infrastructure scripts for bringing a cluster from bare metal to robust, high performance training
* Our cost-aware hyperparameter optimizer, CARBS, which automatically and systematically fine-tunes all hyperparameters to derive optimum performance for models of any size
As well as EXTREMELY detailed posts on the infrastructure needs, hyperparameter search, and clean versions of the sorry state of industry standard benchmarks. This means for the FIRST TIME (perhaps since Meta’s OPT-175B in 2022?) you have this level of educational detail into the hardware and ML nitty gritty of training extremely large LLMs, and if you are in fact training LLMs of this scale you now have evals, optimizers, scripts, and human data/benchmarks you can use to move the industry forward together with Imbue.
We are busy running the sold-out AI Engineer World’s Fair today, and so are unable to do our usual quality writeup, however, please enjoy our show notes and the excellent conversation! Thanks also to Kanjun, Ashley, Tom and the rest of team Imbue for setting up this interview behind the scenes.
Video pod

Timestamps
* [00:00:00] Introduction and catch up with guests
* [00:01:55] Databricks' text to image model release
* [00:03:46] Details about the DBRX model
* [00:05:26] Imbue's infrastructure, evaluation, and hyperparameter optimizer releases
* [00:09:18] Challenges of training foundation models and getting infrastructure to work
* [00:12:03] Details of Imbue's cluster setup
* [00:18:53] Process of bringing machines online and common failures
* [00:22:52] Health checks and monitoring for the cluster
* [00:25:06] Typical timelines and team composition for setting up a cluster
* [00:27:24] Monitoring GPU utilization and performance
* [00:29:39] Open source tools and libraries used
* [00:32:33] Reproducibility and portability of cluster setup
* [00:35:57] Infrastructure changes needed for different model architectures
* [00:40:49] Imbue's focus on text-only models for coding and reasoning
* [00:42:26] CARBS hyperparameter tuner and cost-aware optimization
* [00:51:01] Emergence and CARBS
* [00:53:18] Evaluation datasets and reproducing them with high quality
* [00:58:40] Challenges of evaluating on more realistic tasks
* [01:06:01] Abstract reasoning benchmarks like ARC
* [01:10:13] Long context evaluation and needle-in-a-haystack tasks
* [01:13:50] Function calling and tool use evaluation
* [01:19:19] Imbue's future plans for coding and reasoning applications
* [01:20:14] Databricks' future plans for useful applications and upcoming blog posts

Transcript
SWYX [00:00:00]: Welcome to the Latent Space Podcast, another super special edition. Today, we have sort of like a two-header. John Frankel from Mosaic Databricks, or Databricks Mosaic, and Josh Albrecht from MBU. Welcome.
JOSH [00:00:12]: Hey, glad to be here.
SWYX [00:00:14]: Thank you for having us. Hey, so both of you are kind of past guests. Jonathan, you were actually one of the most popular episodes from last year talking about MPT7B. Remember the days when we trained large models and there was 7B?
JONATHAN [00:00:30]: Yeah, back
- 1 hr 21 min
- 24 JUN 2024
[High Agency] AI Engineer World's Fair Preview

[High Agency] AI Engineer World's Fair Preview

The World’s Fair is officially sold out! Thanks for all the support and stay tuned for recaps of all the great goings on in this very special celebration of the AI Engineer!
Longtime listeners will remember the fan favorite Raza Habib, CEO of HumanLoop, on the pod:
Well, he’s caught the podcasting bug and is now flipping the tables on swyx!
Subscribe to High Agency wherever the finest Artificial Intelligence podcast are sold.

High Agency Pod Description
In this episode, I chatted with Shawn Wang about his upcoming AI engineering conference and what an AI engineer really is. It's been a year since he penned the viral essay "Rise of the AI Engineer' and we discuss if this new role will be enduring, the make up of the optimal AI team and trends in machine learning.
Timestamps
00:00 - Introduction and background on Shawn Wang (Swyx)03:45 - Reflecting on the "Rise of the AI Engineer" essay07:30 - Skills and characteristics of AI Engineers12:15 - Team composition for AI products16:30 - Vertical vs. horizontal AI startups23:00 - Advice for AI product creators and leaders28:15 - Tools and buying vs. building for AI products33:30 - Key trends in AI research and development41:00 - Closing thoughts and information on the AI Engineer World Fair Summit
Video

Get full access to Latent Space at www.latent.space/subscribe
- 49 min
- 21 JUN 2024
How To Hire AI Engineers — with James Brady & Adam Wiggins of Elicit

How To Hire AI Engineers — with James Brady & Adam Wiggins of Elicit

Editor’s note: One of the top reasons we have hundreds of companies and thousands of AI Engineers joining the World’s Fair next week is, apart from discussing technology and being present for the big launches planned, to hire and be hired!
Listeners loved our previous Elicit episode and were so glad to welcome 2 more members of Elicit back for a guest post (and bonus podcast) on how they think through hiring. Don’t miss their AI engineer job description, and template which you can use to create your own hiring plan!
How to Hire AI Engineers
James Brady, Head of Engineering @ Elicit (ex Spring, Square, Trigger.io, IBM)
Adam Wiggins, Internal Journalist @ Elicit (Cofounder Ink & Switch and Heroku)
If you’re leading a team that uses AI in your product in some way, you probably need to hire AI engineers. As defined in this article, that’s someone with conventional engineering skills in addition to knowledge of language models and prompt engineering, without being a full-fledged Machine Learning expert.
But how do you hire someone with this skillset? At Elicit we’ve been applying machine learning to reasoning tools since 2018, and our technical team is a mix of ML experts and what we can now call AI engineers. This article will cover our process from job description through interviewing. (You can also flip the perspectives here and use it just as easily for how to get hired as an AI engineer!)
My own journey
Before getting into the brass tacks, I want to share my journey to becoming an AI engineer.
Up until a few years ago, I was happily working my job as an engineering manager of a big team at a late-stage startup. Like many, I was tracking the rapid increase in AI capabilities stemming from the deep learning revolution, but it was the release of GPT-3 in 2020 which was the watershed moment. At the time, we were all blown away by how the model could string together coherent sentences on demand. (Oh how far we’ve come since then!)
I’d been a professional software engineer for nearly 15 years—enough to have experienced one or two technology cycles—but I could see this was something categorically new. I found this simultaneously exciting and somewhat disconcerting. I knew I wanted to dive into this world, but it seemed like the only path was going back to school for a master’s degree in Machine Learning. I started talking with my boss about options for taking a sabbatical or doing a part-time distance learning degree.
In 2021, I instead decided to launch a startup focused on productizing new research ideas on ML interpretability. It was through that process that I reached out to Andreas—a leading ML researcher and founder of Elicit—to see if he would be an advisor. Over the next few months, I learned more about Elicit: that they were trying to apply these fascinating technologies to the real-world problems of science, and with a business model that aligned it with safety goals. I realized that I was way more excited about Elicit than I was about my own startup ideas, and wrote about my motivations at the time.
Three years later, it’s clear this was a seismic shift in my career on the scale of when I chose to leave my comfy engineering job at IBM to go through the Y Combinator program back in 2008. Working with this new breed of technology has been more intellectually stimulating, challenging, and rewarding than I could have imagined.
Deep ML expertise not required
It’s important to note that AI engineers are not ML experts, nor is that their best contribution to a tech team.
In our article Living documents as an AI UX pattern, we wrote:
It’s easy to think that AI advancements are all about training and applying new models, and certainly this is a huge part of our work in the ML team at Elicit. But those of us working in the UX part of the team believe that we have a big contribution to make in how AI is applied to end-user problems.
We think of LLMs as a new medium to work with, one that we’ve b
- 1 hr 3 min