Latent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and al

Alessio + swyx
Latent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and al Podcast

The podcast by and for AI Engineers! In 2023, over 1 million visitors came to Latent Space to hear about news, papers and interviews in Software 3.0. We cover Foundation Models changing every domain in Code Generation, Multimodality, AI Agents, GPU Infra and more, directly from the founders, builders, and thinkers involved in pushing the cutting edge. Striving to give you both the definitive take on the Current Thing down to the first introduction to the tech you'll be using in the next 3 months! We break news and exclusive interviews from OpenAI, tiny (George Hotz), Databricks/MosaicML (Jon Frankle), Modular (Chris Lattner), Answer.ai (Jeremy Howard), et al. Full show notes always on https://latent.space www.latent.space

  1. 3 DAYS AGO

    The Ultimate Guide to Prompting

    Noah Hein from Latent Space University is finally launching with a free lightning course this Sunday for those new to AI Engineering. Tell a friend! Did you know there are >1,600 papers on arXiv just about prompting? Between shots, trees, chains, self-criticism, planning strategies, and all sorts of other weird names, it’s hard to keep up. Luckily for us, Sander Schulhoff and team read them all and put together The Prompt Report as the ultimate prompt engineering reference, which we’ll break down step-by-step in today’s episode. In 2022 swyx wrote “Why “Prompt Engineering” and “Generative AI” are overhyped”; the TLDR being that if you’re relying on prompts alone to build a successful products, you’re ngmi. Prompt engineering moved from being a stand-alone job to a core skill for AI Engineers now. We won’t repeat everything that is written in the paper, but this diagram encapsulates the state of prompting today: confusing. There are many similar terms, esoteric approaches that have doubtful impact on results, and lots of people that are just trying to create full papers around a single prompt just to get more publications out. Luckily, some of the best prompting techniques are being tuned back into the models themselves, as we’ve seen with o1 and Chain-of-Thought (see our OpenAI episode). Similarly, OpenAI recently announced 100% guaranteed JSON schema adherence, and Anthropic, Cohere, and Gemini all have JSON Mode (not sure if 100% guaranteed yet). No more “return JSON or my grandma is going to die” required. The next debate is human-crafted prompts vs automated approaches using frameworks like DSPy, which Sander recommended: I spent 20 hours prompt engineering for a task and DSPy beat me in 10 minutes. It’s much more complex than simply writing a prompt (and I’m not sure how many people usually spend >20 hours prompt engineering one task), but if you’re hitting a roadblock it might be worth checking out. Prompt Injection and Jailbreaks Sander and team also worked on HackAPrompt, a paper that was the outcome of an online challenge on prompt hacking techniques. They similarly created a taxonomy of prompt attacks, which is very hand if you’re building products with user-facing LLM interfaces that you’d like to test: In this episode we basically break down every category and highlight the overrated and underrated techniques in each of them. If you haven’t spent time following the prompting meta, this is a great episode to catchup! Full Video Episode Like and subscribe on YouTube! Timestamps * [00:00:00] Introductions - Intro music by Suno AI * [00:07:32] Navigating arXiv for paper evaluation * [00:12:23] Taxonomy of prompting techniques * [00:15:46] Zero-shot prompting and role prompting * [00:21:35] Few-shot prompting design advice * [00:28:55] Chain of thought and thought generation techniques * [00:34:41] Decomposition techniques in prompting * [00:37:40] Ensembling techniques in prompting * [00:44:49] Automatic prompt engineering and DSPy * [00:49:13] Prompt Injection vs Jailbreaking * [00:57:08] Multimodal prompting (audio, video) * [00:59:46] Structured output prompting * [01:04:23] Upcoming Hack-a-Prompt 2.0 project Show Notes * Sander Schulhoff * Learn Prompting * The Prompt Report * HackAPrompt * Mine RL Competition * EMNLP Conference * Noam Brown * Jordan Boydgraver * Denis Peskov * Simon Willison * Riley Goodside * David Ha * Jeremy Nixon * Shunyu Yao * Nicholas Carlini * Dreadnode Transcript Alessio [00:00:00]: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO-in-Residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol AI. Swyx [00:00:13]: Hey, and today we're in the remote studio with Sander Schulhoff, author of the Prompt Report. Sander [00:00:18]: Welcome. Thank you. Very excited to be here. Swyx [00:00:21]: Sander, I think I first chatted with you like over a year ago. What's your brief history? I wen

    1h 9m
  2. 13 SEPT

    From API to AGI: Structured Outputs, OpenAI API platform and O1 Q&A — with Michelle Pokrass & OpenAI Devrel + Strawberry team

    Congrats to Damien on successfully running AI Engineer London! See our community page and the Latent Space Discord for all upcoming events. This podcast came together in a far more convoluted way than usual, but happens to result in a tight 2 hours covering the ENTIRE OpenAI product suite across ChatGPT-latest, GPT-4o and the new o1 models, and how they are delivered to AI Engineers in the API via the new Structured Output mode, Assistants API, client SDKs, upcoming Voice Mode API, Finetuning/Vision/Whisper/Batch/Admin/Audit APIs, and everything else you need to know to be up to speed in September 2024. This podcast has two parts: the first hour is a regular, well edited, podcast on 4o, Structured Outputs, and the rest of the OpenAI API platform. The second was a rushed, noisy, hastily cobbled together recap of the top takeaways from the o1 model release from yesterday and today. Building AGI with Structured Outputs — Michelle Pokrass of OpenAI API team Michelle Pokrass built massively scalable platforms at Google, Stripe, Coinbase and Clubhouse, and now leads the API Platform at Open AI. She joins us today to talk about why structured output is such an important modality for AI Engineers that Open AI has now trained and engineered a Structured Output mode with 100% reliable JSON schema adherence. To understand why this is important, a bit of history is important: * June 2023 when OpenAI first added a "function calling" capability to GPT-4-0613 and GPT 3.5 Turbo 0613 (our podcast/writeup here) * November 2023’s OpenAI Dev Day (our podcast/writeup here) where the team shipped JSON Mode, a simpler schema-less JSON output mode that nevertheless became more popular because function calling often failed to match the JSON schema given by developers. * Meanwhile, in open source, many solutions arose, including * Instructor (our pod with Jason here) * LangChain (our pod with Harrison here, and he is returning next as a guest co-host) * Outlines (Remi Louf’s talk at AI Engineer here) * Llama.cpp’s constrained grammar sampling using GGML-BNF * April 2024: OpenAI started implementing constrained sampling with a new `tool_choice: required` parameter in the API * August 2024: the new Structured Output mode, co-led by Michelle * Sept 2024: Gemini shipped Structured Outputs as well We sat down with Michelle to talk through every part of the process, as well as quizzing her for updates on everything else the API team has shipped in the past year, from the Assistants API, to Prompt Caching, GPT4 Vision, Whisper, the upcoming Advanced Voice Mode API, OpenAI Enterprise features, and why every Waterloo grad seems to be a cracked engineer. Part 1 Timestamps and Transcript Transcript here. * [00:00:42] Episode Intro from Suno * [00:03:34] Michelle's Path to OpenAI * [00:12:20] Scaling ChatGPT * [00:13:20] Releasing Structured Output * [00:16:17] Structured Outputs vs Function Calling * [00:19:42] JSON Schema and Constrained Grammar * [00:20:45] OpenAI API team * [00:21:32] Structured Output Refusal Field * [00:24:23] ChatML issues * [00:26:20] Function Calling Evals * [00:28:34] Parallel Function Calling * [00:29:30] Increased Latency * [00:30:28] Prompt/Schema Caching * [00:30:50] Building Agents with Structured Outputs: from API to AGI * [00:31:52] Assistants API * [00:34:00] Use cases for Structured Output * [00:37:45] Prompting Structured Output * [00:39:44] Benchmarking Prompting for Structured Outputs * [00:41:50] Structured Outputs Roadmap * [00:43:37] Model Selection vs GPT4 Finetuning * [00:46:56] Is Prompt Engineering Dead? * [00:47:29] 2 models: ChatGPT Latest vs GPT 4o August * [00:50:24] Why API => AGI * [00:52:40] Dev Day * [00:54:20] Assistants API Roadmap * [00:56:14] Model Reproducibility/Determinism issues * [00:57:53] Tiering and Rate Limiting * [00:59:26] OpenAI vs Ops Startups * [01:01:06] Batch API * [01:02:54] Vision * [01:04:42] Whisper * [01:07:21] Voice Mode API * [01:08:10] Enterprise: Admin/Audit Log APIs

    2h 4m
  3. 3 SEPT

    Efficiency is Coming: 3000x Faster, Cheaper, Better AI Inference from Hardware Improvements, Quantization, and Synthetic Data Distillation

    AI Engineering is expanding! Join the first 🇬🇧 AI Engineer London meetup in Sept and get in touch for sponsoring the second 🗽 AI Engineer Summit in NYC this Dec! The commoditization of intelligence takes on a few dimensions: * Time to Open Model Equivalent: 15 months between GPT-4 and Llama 3.1 405B * 10-100x CHEAPER/year: from $30/mtok for Claude 3 Opus to $3/mtok for L3-405B, and a 400x reduction in the frontier OpenAI model from 2022-2024. Notably, for personal use cases, both Gemini Flash and now Cerebras Inference offer 1m tokens/day inference free, causing the Open Model Red Wedding. * Alternatively you can observe the frontiers of various small/medium/large sizes of intelligence per dollar shift in realtime. 2024 has been particularly aggressive with almost 2 order-of-magnitude improvements in $/Elo points in the last 8 months. * 4-8x FASTER/year: The new Cerebras Inference platform runs 70B models at 450 tok/s, almost twice as fast as the Groq Cloud example that went viral earlier this year (and at $0.60/mtok to boot). James Wang says they have room to ”~8x throughput in the next few months”, which needs to be seen in reality and at scale, but is very exciting for downstream latency/throughput-sensitive usecases. Today’s guest, Nyla Worker, a senior PM at Nvidia, Convai, and now Google, and recently host of the GPUs & Inference track at the World’s Fair, was the first to point out to us that the kind of efficiency improvements that have become a predominant theme in LLMs in 2024, have been seen before in her career in computer vision. From her start at Ebay optimizing V100 inference for a ResNet-50 model for image search, she has watched many improvements like Multi-Inference GPU (allowing multiple instances with perfect hardware parallelism), Quantization Aware Training (most recently highlighted by Noam Shazeer pre Character AI departure) and Model Distillation (most recently highlighted by the Llama 3.1 paper) stacking with baseline hardware improvements (from V100s to A100s to H100s to GH200s) to produce theoretically 3000x faster inference now than 6 years ago. What Nyla saw in her career the last 6 years, is happening to LLMs today (not exactly repeating, but surely rhyming), specifically with LoRAs, native Int8 and even Ternary models, and teacher model distillation. We were excited to delve into all things efficiency in this episode and even come out the other side with bonus discussions on what generative AI can do for gaming, fanmade TV shows, character AI conversations, and even podcasting! Show Notes: * Nyla Linkedin, Twitter * Related Nvidia research * Improving INT8 Accuracy Using Quantization Aware Training and the NVIDIA TAO Toolkit * Nvidia Jetson Nano: Bringing the power of modern AI to millions of devices. * Synthetic Data with Nvidia Omniverse Replicator: Accelerate AI Training Faster Than Ever with New NVIDIA Omniverse Replicator Capabilities Timestamps * [00:00:00] Intro from Suno * [00:03:17] Nyla's path from Astrophysics to LLMs * [00:05:45] Efficiency Curves in Computer Vision at Nvidia * [00:09:51] Optimizing for today's hardware vs tomorrow's inference * [00:16:33] Quantization vs Precision tradeoff * [00:20:42] Hitting the Data Wall: The need for Synthetic Data at Nvidia * [00:26:20] Sora, text to 3D models, and Synthetic Data from Game Engines * [00:30:55] ResNet 50 keeps coming back * [00:35:40] Gaming Benchmarks * [00:38:00] FineWeb * [00:39:43] Traditional ML vs LLMs path to general intelligence * [00:42:33] ConvAI - AI NPCs * [00:45:32] Jensen and Lisa at Computex Taiwan * [00:52:51] NPCs need to take Actions and have Context * [00:54:29] Simulating different roles for training * [00:58:37] AI Generated Fan Content - Podcasts, TV Show, Einstein Transcripts [00:00:29] AI Charlie: Happy September. This is your AI co host, Charlie. [00:00:34] AI Charlie: One topic we've developed on LatentSpace is the importance of efficiency in all forms, from sample effici

    1h 5m
  4. 29 AUG

    Why you should write your own LLM benchmarks — with Nicholas Carlini, Google DeepMind

    Today's guest, Nicholas Carlini, a research scientist at DeepMind, argues that we should be focusing more on what AI can do for us individually, rather than trying to have an answer for everyone. "How I Use AI" - A Pragmatic Approach Carlini's blog post "How I Use AI" went viral for good reason. Instead of giving a personal opinion about AI's potential, he simply laid out how he, as a security researcher, uses AI tools in his daily work. He divided it in 12 sections: * To make applications * As a tutor * To get started * To simplify code * For boring tasks * To automate tasks * As an API reference * As a search engine * To solve one-offs * To teach me * Solving solved problems * To fix errors Each of the sections has specific examples, so we recommend going through it. It also includes all prompts used for it; in the "make applications" case, it's 30,000 words total! My personal takeaway is that the majority of the work AI can do successfully is what humans dislike doing. Writing boilerplate code, looking up docs, taking repetitive actions, etc. These are usually boring tasks with little creativity, but with a lot of structure. This is the strongest arguments as to why LLMs, especially for code, are more beneficial to senior employees: if you can get the boring stuff out of the way, there's a lot more value you can generate. This is less and less true as you go entry level jobs which are mostly boring and repetitive tasks. Nicholas argues both sides ~21:34 in the pod. A New Approach to LLM Benchmarks We recently did a Benchmarks 201 episode, a follow up to our original Benchmarks 101, and some of the issues have stayed the same. Notably, there's a big discrepancy between what benchmarks like MMLU test, and what the models are used for. Carlini created his own domain-specific language for writing personalized LLM benchmarks. The idea is simple but powerful: * Take tasks you've actually needed AI for in the past. * Turn them into benchmark tests. * Use these to evaluate new models based on your specific needs. It can represent very complex tasks, from a single code generation to drawing a US flag using C: "Write hello world in python" >> LLMRun() >> PythonRun() >> SubstringEvaluator("hello world") "Write a C program that draws an american flag to stdout." >> LLMRun() >> CRun() >> \ VisionLLMRun("What flag is shown in this image?") >> \ (SubstringEvaluator("United States") | SubstringEvaluator("USA"))) This approach solves a few problems: * It measures what's actually useful to you, not abstract capabilities. * It's harder for model creators to "game" your specific benchmark, a problem that has plagued standardized tests. * It gives you a concrete way to decide if a new model is worth switching to, similar to how developers might run benchmarks before adopting a new library or framework. Carlini argues that if even a small percentage of AI users created personal benchmarks, we'd have a much better picture of model capabilities in practice. AI Security While much of the AI security discussion focuses on either jailbreaks or existential risks, Carlini's research targets the space in between. Some highlights from his recent work: * LAION 400M data poisoning: By buying expired domains referenced in the dataset, Carlini's team could inject arbitrary images into models trained on LAION 400M. You can read the paper "Poisoning Web-Scale Training Datasets is Practical", for all the details. This is a great example of expanding the scope beyond the model itself, and looking at the whole system and how ti can become vulnerable. * Stealing model weights: They demonstrated how to extract parts of production language models (like OpenAI's) through careful API queries. This research, "Extracting Training Data from Large Language Models", shows that even black-box access can leak sensitive information. * Extracting training data: In some cases, they found ways to make models regurgitate verbatim snippets from their training data. Him and Mila

    1h 10m
  5. 22 AUG

    Is finetuning GPT4o worth it? — with Alistair Pullen, Cosine (Genie)

    Betteridge's law says no: with seemingly infinite flavors of RAG, and >2million token context + prompt caching from Anthropic/Deepmind/Deepseek, it's reasonable to believe that "in context learning is all you need". But then there’s Cosine Genie, the first to make a huge bet using OpenAI’s new GPT4o fine-tuning for code at the largest scale it has ever been used externally; resulting in what is now the #1 coding agent in the world according to SWE-Bench Full, Lite, and Verified: SWE-Bench has been the most successful agent benchmark of the year, receiving honors at ICLR (our interview here) and recently being verified by OpenAI. Cognition (Devin) was valued at $2b after reaching 14% on it. So it is very, very big news when a new agent appears to beat all other solutions, by a lot: While this number is self reported, it seems to be corroborated by OpenAI, who also award it clear highest marks on SWE-Bench verified: The secret is GPT-4o finetuning on billions of tokens of synthetic data. * Finetuning: As OpenAI says: Genie is powered by a fine-tuned GPT-4o model trained on examples of real software engineers at work, enabling the model to learn to respond in a specific way. The model was also trained to be able to output in specific formats, such as patches that could be committed easily to codebases. Due to the scale of Cosine’s finetuning, OpenAI worked closely with them to figure out the size of the LoRA: “They have to decide how big your LoRA adapter is going to be… because if you had a really sparse, large adapter, you’re not going to get any signal in that at all. So they have to dynamically size these things.” * Synthetic data: we need to finetune on the process of making code work instead of only training on working code. “…we synthetically generated runtime errors. Where we would intentionally mess with the AST to make stuff not work, or index out of bounds, or refer to a variable that doesn't exist, or errors that the foundational models just make sometimes that you can't really avoid, you can't expect it to be perfect.” Genie also has a 4 stage workflow with the standard LLM OS tooling stack that lets it solve problems iteratively: Full Video Pod like and subscribe etc! Show Notes * Alistair Pullen - Twitter, Linkedin * Cosine Genie launch, technical report * OpenAI GPT-4o finetuning GA * Llama 3 backtranslation * Cursor episode and Aman + SWEBench at ICLR episode Timestamps * [00:00:00] Suno Intro * [00:05:01] Alistair and Cosine intro * [00:16:34] GPT4o finetuning * [00:20:18] Genie Data Mix * [00:23:09] Customizing for Customers * [00:25:37] Genie Workflow * [00:27:41] Code Retrieval * [00:35:20] Planning * [00:42:29] Language Mix * [00:43:46] Running Code * [00:46:19] Finetuning with OpenAI * [00:49:32] Synthetic Code Data * [00:51:54] SynData in Llama 3 * [00:52:33] SWE-Bench Submission Process * [00:58:20] Future Plans * [00:59:36] Ecosystem Trends * [01:00:55] Founder Lessons * [01:01:58] CTA: Hiring & Customers Descript Transcript [00:01:52] AI Charlie: Welcome back. This is Charlie, your AI cohost. As AI engineers, we have a special focus on coding agents, fine tuning, and synthetic data. And this week, it all comes together with the launch of Cosign's Genie, which reached 50 percent on SWE Bench Lite, 30 percent on the full SWE Bench, and 44 percent on OpenAI's new SWE Bench Verified. [00:02:17] All state of the art results by the widest ever margin recorded compared to former leaders Amazon Q and US Autocode Rover. And Factory Code Droid. As a reminder, Cognition Devon went viral with a 14 percent score just five months ago. Cosign did this by working closely with OpenAI to fine tune GPT 4. 0, now generally available to you and me, on billions of tokens of code, much of which was synthetically generated. [00:02:47] Alistair Pullen: Hi, I'm Ali. Co founder and CEO of Cosign, a human reasoning lab. And I'd like to show you Genie, our state of the art, fully autonomous software eng

    1h 5m
  6. 16 AUG

    AI Magic: Shipping 1000s of successful products with no managers and a team of 12 — Jeremy Howard of Answer.ai

    Disclaimer: We recorded this episode ~1.5 months ago, timing for the FastHTML release. It then got bottlenecked by Llama3.1, Winds of AI Winter, and SAM2 episodes, so we’re a little late. Since then FastHTML was released, swyx is building an app in it for AINews, and Anthropic has also released their prompt caching API. Remember when Dylan Patel of SemiAnalysis coined the GPU Rich vs GPU Poor war? (if not, see our pod with him). The idea was that if you’re GPU poor you shouldn’t waste your time trying to solve GPU rich problems (i.e. pre-training large models) and are better off working on fine-tuning, optimized inference, etc. Jeremy Howard (see our “End of Finetuning” episode to catchup on his background) and Eric Ries founded Answer.AI to do exactly that: “Practical AI R&D”, which is very in-line with the GPU poor needs. For example, one of their first releases was a system based on FSDP + QLoRA that let anyone train a 70B model on two NVIDIA 4090s. Since then, they have come out with a long list of super useful projects (in no particular order, and non-exhaustive): * FSDP QDoRA: this is just as memory efficient and scalable as FSDP/QLoRA, and critically is also as accurate for continued pre-training as full weight training. * Cold Compress: a KV cache compression toolkit that lets you scale sequence length without impacting speed. * colbert-small: state of the art retriever at only 33M params * JaColBERTv2.5: a new state-of-the-art retrievers on all Japanese benchmarks. * gpu.cpp: portable GPU compute for C++ with WebGPU. * Claudette: a better Anthropic API SDK. They also recently released FastHTML, a new way to create modern interactive web apps. Jeremy recently released a 1 hour “Getting started” tutorial on YouTube; while this isn’t AI related per se, but it’s close to home for any AI Engineer who are looking to iterate quickly on new products: In this episode we broke down 1) how they recruit 2) how they organize what to research 3) and how the community comes together. At the end, Jeremy gave us a sneak peek at something new that he’s working on that he calls dialogue engineering: So I've created a new approach. It's not called prompt engineering. I'm creating a system for doing dialogue engineering. It's currently called AI magic. I'm doing most of my work in this system and it's making me much more productive than I was before I used it. He explains it a bit more ~44:53 in the pod, but we’ll just have to wait for the public release to figure out exactly what he means. Timestamps * [00:00:00] Intro by Suno AI * [00:03:02] Continuous Pre-Training is Here * [00:06:07] Schedule-Free Optimizers and Learning Rate Schedules * [00:07:08] Governance and Structural Issues within OpenAI and Other AI Labs * [00:13:01] How Answer.ai works * [00:23:40] How to Recruit Productive Researchers * [00:27:45] Building a new BERT * [00:31:57] FSDP, QLoRA, and QDoRA: Innovations in Fine-Tuning Large Models * [00:36:36] Research and Development on Model Inference Optimization * [00:39:49] FastHTML for Web Application Development * [00:46:53] AI Magic & Dialogue Engineering * [00:52:19] AI wishlist & predictions Show Notes * Jeremy Howard * Previously on Latent Space: The End of Finetuning, NeurIPS Startups * Answer.ai * Fast.ai * FastHTML * answerai-colbert-small-v1 * gpu.cpp * Eric Ries * Aaron DeFazio * Yi Tai * Less Wright * Benjamin Warner * Benjamin Clavié * Jono Whitaker * Austin Huang * Eric Gilliam * Tim Dettmers * Colin Raffel * Mark Saroufim * Sebastian Raschka * Carson Gross * Simon Willison * Sepp Hochreiter * Llama3.1 episode * Snowflake Arctic * Ranger Optimizer * Gemma.cpp * HTMX * UL2 * BERT * DeBERTa * Efficient finetuning of Llama 3 with FSDP QDoRA * xLSTM Transcript Alessio [00:00:00]: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO-in-Residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol AI. Swyx [00:00:14]: And today we'

    59 min
  7. 7 AUG

    Segment Anything 2: Demo-first Model Development

    Because of the nature of SAM, this is more video heavy than usual. See our YouTube! Because vision is first among equals in multimodality, and yet SOTA vision language models are closed, we’ve always had an interest in learning what’s next in vision. Our first viral episode was Segment Anything 1, and we have since covered LLaVA, IDEFICS, Adept, and Reka. But just like with Llama 3, FAIR holds a special place in our hearts as the New Kings of Open Source AI. The list of sequels better than the originals is usually very short, but SAM 2 delighted us by not only being a better image segmentation model than SAM 1, it also conclusively and inexpensively solved video segmentation in just an elegant a way as SAM 1 did for images, and releasing everything to the community as Apache 2/CC by 4.0. “In video segmentation, we observe better accuracy, using 3x fewer interactions than prior approaches. In image segmentation, our model is more accurate and 6x faster than the Segment Anything Model (SAM).” Surprisingly Efficient The paper reports that SAM 2 was trained on 256 A100 GPUs for 108 hours (59% more than SAM 1). Taking the upper end $2 A100 cost off gpulist.ai means SAM2 cost ~$50k to train if it had an external market-rate cost - surprisingly cheap for adding video understanding! The newly released SA-V dataset is also the largest video segment dataset to date, with careful attention given to scene/object/geographical diversity, including that of annotators. In some ways, we are surprised that SOTA video segmentation can be done on only ~50,000 videos (and 640k masklet annotations). Model-in-the-loop Data Engine for Annotations and Demo-first Development Similar to SAM 1, a 3 Phase Data Engine helped greatly in bootstrapping this dataset. As Nikhila says in the episode, the demo you see wasn’t just for show, they actually used this same tool to do annotations for the model that is now demoed in the tool: “With the original SAM, we put a lot of effort in building a high-quality demo. And the other piece here is that the demo is actually the annotation tool. So we actually use the demo as a way to improve our annotation tool. And so then it becomes very natural to invest in building a good demo because it speeds up your annotation. and improve the data quality, and that will improve the model quality. With this approach, we found it to be really successful.” An incredible 90% speedup in annotation happened due to this virtuous cycle which helped SA-V reach this incredible scale. Building the demo also helped the team live the context that their own downstream users, like Roboflow, would experience, and forced them to make choices accordingly. As Nikhila says: “It's a really encouraging trend for not thinking about only the new model capability, but what sort of applications folks want to build with models as a result of that downstream. I think it also really forces you to think about many things that you might postpone. For example, efficiency. For a good demo experience, making it real time is super important. No one wants to wait. And so it really forces you to think about these things much sooner and actually makes us think about what kind of image encoder we want to use or other things. hardware efficiency improvements. So those kind of things, I think, become a first-class citizen when you put the demo first.” Indeed, the team swapped out standard ViT-H Vision Transformers for Hiera (Hierarchical) Vision Transformers as a result of efficiency considerations. Memory Attention Speaking of architecture, the model design is probably the sleeper hit of a project filled with hits. The team adapted SAM 1 to video by adding streaming memory for real-time video processing: Specifically adding memory attention, memory encoder, and memory bank, which surprisingly ablated better than more intuitive but complex architectures like Gated Recurrent Units. One has to wonder if streaming memory can be added to pure lan

    1h 4m
  8. 2 AUG

    The Winds of AI Winter (Q2 Four Wars Recap) + ChatGPT Voice Mode Preview

    Thank you for 1m downloads of the podcast and 2m readers of the Substack! 🎉 This is the audio discussion following The Winds of AI Winter essay that also serves as a recap of Q2 2024 in AI viewed through the lens of our Four Wars framework. Enjoy! Full Video Discussion Full show notes are here. Timestamps * [00:00:00] Intro Song by Suno.ai * [00:02:01] Swyx and Alessio in Singapore * [00:05:49] GPU Rich vs Poors: Frontier Labs * [00:06:35] GPU Rich Frontier Models: Claude 3.5 * [00:10:37] GPU Rich helping Poors: Llama 3.1: The Synthetic Data Model * [00:15:41] GPU Rich helping Poors: Frontier Labs Vibe Shift - Phi 3, Gemma 2 * [00:18:26] GPU Rich: Mistral Large * [00:21:56] GPU Rich: Nvidia + FlashAttention 3 * [00:23:45] GPU Rich helping Poors: Noam Shazeer & Character.AI * [00:28:14] GPU Poors: On Device LLMs: Mozilla Llamafile, Chrome (Gemini Nano), Apple Intelligence * [00:35:33] Quality Data Wars: NYT vs The Atlantic lawyer up vs partner up * [00:37:41] Quality Data Wars: Reddit, ScarJo, RIAA vs Udio & Suno * [00:41:03] Quality Data Wars: Synthetic Data, Jagged Intelligence, AlphaProof * [00:45:33] Multimodality War: ChatGPT Voice Mode, OpenAI demo at AIEWF * [00:47:34] Multimodality War: Meta Llama 3 multimodality + Chameleon * [00:50:54] Multimodality War: PaliGemma + CoPaliGemma * [00:52:55] Renaming Rag/Ops War to LLM OS War * [00:55:31] LLM OS War: Ops War: Prompt Management vs Gateway vs Observability * [01:02:57] LLM OS War: BM42 Vector DB Wars, Memory Databases, GraphRAG * [01:06:15] LLM OS War: Agent Tooling * [01:08:26] LLM OS War: Agent Protocols * [01:10:43] Trend: Commoditization of Intelligence * [01:16:45] Trend: Vertical Service as Software, AI Employees, Brightwave, Dropzone * [01:20:44] Trend: Benchmark Frontiers after MMLU * [01:23:31] Crowdstrike will save us from Skynet * [01:24:30] Bonus: ChatGPT Advanced Voice Mode Demo * [01:25:37] Voice Mode: Storytelling * [01:27:55] Voice Mode: Accents * [01:31:48] Voice Mode: Accent Detection * [01:35:00] Voice Mode: Nonverbal Emotions * [01:37:53] Voice Mode: Multiple Voices in One * [01:40:52] Voice Mode: Energy Levels Detection * [01:42:03] Voice Mode: Multilinguality * [01:43:53] Voice Mode: Shepard Tone * [01:46:57] Voice Mode: Generating Tones * [01:49:39] Voice Mode: Interruptions don't work * [01:49:55] Voice Mode: Reverberations * [01:51:37] Voice Mode: Mimicry doesn't work Transcript Charlie [00:01:08]: Welcome back, listeners. This is your AI co-host, Charlie. It's been a few months since we took a step back from the interview format and talked about the show. We're happy to share that we have crossed one million downloads and two million reads on Substack. Woo-hoo. We are really grateful to those of you who keep tuning in and sharing us with your friends, especially if who watch and comment on our new YouTube channel, where we are trying to grow next. For a special millionaire edition, SWIX and Alessio are finally back in person in sunny Singapore to discuss the big vibe shift in the last three months, that we are calling the Winds of AI Winter. We also discuss my nemesis, ChatGPT Advanced Voice Mode, with a special treat for those who stay till the end. Now, more than ever, watch out and take care. Alessio [00:02:02]: Hey, everyone. Welcome to the Latent Space Podcast. This is Alessio, partner and CTO in Residence and Decibel Partners, and today we're in the Singapore studio with SWIX. Swyx [00:02:11]: Hey, this is our long-awaited one-on-one episode. I don't know how long ago the previous one was. Do you remember? Three, four months? Alessio [00:02:20]: Yeah, it's been a while. Swyx [00:02:22]: People really enjoyed it. It's just really, I think our travel schedules have been really difficult to get this stuff together. And then we also had like a decent backlog of guests for a while. I think we've kind of depleted that backlog now and we need to build it up again. But it's been busy and there's been a lot of news. So we actually get to

    1h 55m

About

The podcast by and for AI Engineers! In 2023, over 1 million visitors came to Latent Space to hear about news, papers and interviews in Software 3.0. We cover Foundation Models changing every domain in Code Generation, Multimodality, AI Agents, GPU Infra and more, directly from the founders, builders, and thinkers involved in pushing the cutting edge. Striving to give you both the definitive take on the Current Thing down to the first introduction to the tech you'll be using in the next 3 months! We break news and exclusive interviews from OpenAI, tiny (George Hotz), Databricks/MosaicML (Jon Frankle), Modular (Chris Lattner), Answer.ai (Jeremy Howard), et al. Full show notes always on https://latent.space www.latent.space

To listen to explicit episodes, sign in.

Stay up to date with this show

Sign in or sign up to follow shows, save episodes and get the latest updates.

Select a country or region

Africa, Middle East, and India

Asia Pacific

Europe

Latin America and the Caribbean

The United States and Canada