The Information Bottleneck

Ravid Shwartz-Ziv & Allen Roush

Two AI Researchers - Ravid Shwartz Ziv, and Allen Roush, discuss the latest trends, news, and research within Generative AI, LLMs, GPUs, and Cloud Systems.

  1. Reinventing AI From Scratch with Yaroslav Bulatov

    2 DAYS AGO

    Reinventing AI From Scratch with Yaroslav Bulatov

    Yaroslav Bulatov helped build the AI era from the inside, as one of the earliest researchers at both OpenAI and Google Brain. Now he wants to tear it all down and start over. Modern deep learning, he argues, is up to 100x more wasteful than it needs to be  -  a Frankenstein of hacks designed for the wrong hardware. With a power wall approaching in two years, Yaroslav is leading an open effort to reinvent AI from scratch: no backprop, no legacy assumptions, just the benefit of hindsight and AI agents that compress decades of research into months. Along the way, we dig into why AGI is a "religious question," how a sales guy with no ML background became one of his most productive contributors, and why the Muon optimizer, one of the biggest recent breakthroughs, could only have been discovered by a non-expert. Timeline 00:12 — Introduction and Yaroslav's background at OpenAI and Google Brain 01:16 — Why deep learning isn't such a good idea 02:03 — The three definitions of AGI: religious, financial, and vibes-based 07:52 — The SAI framework: do we need the term AGI at all? 10:58 — What matters more than AGI: efficiency and refactoring the AI stack 13:28 — Jevons paradox and the coming energy wall 14:49 — The recipe: replaying 70 years of AI with hindsight 17:23 — Memory, energy, and gradient checkpointing 18:34 — Why you can't just optimize the current stack (the recurrent laryngeal nerve analogy) 21:05 — What a redesigned AI might look like: hierarchical message passing 22:31 — Can a small team replicate decades of research? 24:23 — Why non-experts outperform domain specialists 27:42 — The GPT-2 benchmark: what success looks like 29:01 — Ian Goodfellow, Theano, and the origins of TensorFlow 30:12 — The Muon optimizer origin story and beating Google on ImageNet 36:16 — AI coding agents for software engineering and research 40:12 — 10-year outlook and the voice-first workflow 42:23 — Why start with text over multimodality 45:13 — Are AI labs like SSI on the right track? 48:52 — Getting rid of backprop — and maybe math itself 53:57 — The state of ML academia and NeurIPS culture 56:41 — The Sutra group challenge: inventing better learning algorithms Music: "Kid Kodi" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0."Palms Down" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.Changes: trimmedAbout: The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.

    58 min
  2. Why Healthcare Is AI's Hardest and Most Important Problem with Kyunghyun Cho (NYU)

    24 MAR

    Why Healthcare Is AI's Hardest and Most Important Problem with Kyunghyun Cho (NYU)

    We talk with Kyunghyun Cho, who is a Professor of Health Statistics and a Professor of Computer Science and Data Science at New York University, and a former Executive Director at Genentech, about why healthcare might be the most important and most difficult domain for AI to transform. Kyunghyun shares his vision for a future where patients own their own medical records, proposes a provocative idea for running continuous society-level clinical trials by having doctors "toss a coin" between plausible diagnoses, and explains why drug discovery's stage-wise pipeline has hit a wall that only end-to-end AI thinking can break through. We also get into GLP-1 drugs and why they're more mysterious than people realize, the brutal economics of antibiotic research, how language models trained across scientific literature and clinical data could compress 50 years of drug development into five, and what Kyunghyun would do with $10 billion (spoiler: buy a hospital network in the Midwest). We wrap up with a great discussion on the rise of professor-founded "neo-labs," why academia got spoiled during the deep learning boom, and an encouraging message for PhD students who feel lost right now. Timeline: (00:00) Intro and welcome (01:25) Why healthcare is uniquely hard (04:46) Who owns your medical records? — The case for patient-controlled data and tapping your phone at the doctor's office (06:43) Centralized vs. decentralized healthcare — comparing Israel, Korea, and the US (13:19) Why most existing health data isn't as useful as we think — selection bias and the lack of randomization (16:53) The "toss a coin" proposal — continuous clinical trials through automated randomization, and the surprising connection to LLM sampling. (23:07) Drug discovery's broken pipeline — why stage-wise optimization is failing, and we need end-to-end thinking (28:30) Why the current system is already failing society — wearables, preventive care, and the case for urgency (31:13) Allen's personal healthcare journey and the GLP-1 conversation (33:13) GLP-1 deep dive — 40 years from discovery to weight loss drugs, brain receptors, and embracing uncertainty (36:28) Why antibiotic R&D is "economic suicide" and how AI can help (42:52) Language models in the clinic and the lab — from clinical notes to back-propagating clinical outcomes, all the way to molecular design (48:04) Do you need domain expertise, or can you throw compute at it? (54:30) The $10 billion question — distributed GPU clouds and a patient-in-the-loop drug discovery system (58:28) Vertical scaling vs. horizontal scaling for healthcare AI (1:01:06) AI regulation — who's missing from the conversation and why regulation should follow deployment (1:06:52) Professors as founders and the "neo-lab" phenomenon — how Ilya cracked the code (1:11:18) Can neo-labs actually ship products? Why researchers should do research (1:13:09) Academia got spoiled — the deep learning anomaly is ending, and that's okay (1:16:07) Closing message — why it's a great time to be a PhD student and researcher Music: "Kid Kodi" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0."Palms Down" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.Changes: trimmed About: The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.

    1hr 18min
  3. Diffusion LLM & Why the Future of AI Won't Be Autoregressive -  Stefano Ermon (Stanford /Inception)

    19 MAR

    Diffusion LLM & Why the Future of AI Won't Be Autoregressive - Stefano Ermon (Stanford /Inception)

    In this episode, we talk with Stefano Ermon,  Stanford professor, co-founder & CEO of Inception AI, and co-inventor of DDIM, FlashAttention, DPO, and score-based/diffusion models, about why diffusion-based language models may overtake the autoregressive paradigm that dominates today's LLMs. We start with the fundamental topics, such as what diffusion models actually are, and why iterative refinement (starting from noise, progressively denoising) offers structural advantages over autoregressive generation. From there,  we dive into the technical core of diffusion LLMs. Stefano explains how discrete diffusion works on text, why masking is just one of many possible noise processes, and how the mathematics of score matching carries over from the continuous image setting with surprising elegance. A major theme is the inference advantage. Because diffusion models produce multiple tokens in parallel, they can be dramatically faster than autoregressive models at inference time. Stefano argues this fundamentally changes the cost-quality Pareto frontier, and becomes especially powerful in RL-based post-training. We also discuss Inception AI's Mercury II model, which Stefano describes as best-in-class for latency-constrained tasks like voice agents and code completion. In the final part, we get into broader questions  - why transformers work so well, research advice for PhD students, whether recursive self-improvement is imminent, the real state of AI coding tools, and Stefano's journey from academia to startup founder. TIMESTAMPS 0:12 – Introduction 1:08 – Origins of diffusion models: from GANs to score-based models in 2019 3:13 – Diffusion vs. autoregressive: the typewriter vs. editor analogy 4:43 – Speed, creativity, and quality trade-offs between the two approaches 7:44 – Temperature and sampling in diffusion LLMs — why it's more subtle than you think 9:56 – Can diffusion LLMs scale? Inception AI and Gemini Diffusion as proof points 11:50 – State space models and hybrid transformer architectures 13:03 – Scaling laws for diffusion: pre-training, post-training, and test-time compute 14:33 – Ecosystem and tooling: what transfers and what doesn't 16:58 – From images to text: how discrete diffusion actually works 19:59 – Theory vs. practice in deep learning 21:50 – Loss functions and scoring rules for generative models 23:12 – Mercury II and where diffusion LLMs already win 26:20 – Creativity, slop, and output diversity in parallel generation 28:43 – Hardware for diffusion models: why current GPUs favor autoregressive workloads 30:56 – Optimization algorithms and managing technical risk at a startup 32:46 – Why do transformers work so well? 33:30 – Research advice for PhD students: focus on inference 34:57 – Recursive self-improvement and AGI timelines 35:56 – Will AI replace software engineers? Real-world experience at Inception 37:54 – Professor vs. startup founder: different execution, similar mission 39:56 – The founding story of Inception AI — from ICML Best Paper to company 42:30 – The researcher-to-founder pipeline and big funding rounds 45:02 – PhD vs. industry in 2026: the widening financial gap 47:30 – The industry in 5-10 years: Stefano's outlook Music: "Kid Kodi" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0."Palms Down" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.Changes: trimmedAbout: The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.

    49 min
  4. Training Is Nothing Like Learning with Naomi Saphra (Harvard)

    13 MAR

    Training Is Nothing Like Learning with Naomi Saphra (Harvard)

    Naomi Saphra, Kempner Research Fellow at Harvard and incoming Assistant Professor at Boston University, joins us to explain why you can't do interpretability without understanding training dynamics,  in the same way you can't do biology without evolution. Naomi argues that many structures researchers find inside trained models are vestigial, they mattered early in training but are meaningless by the end. Grokking is one case of a broader phenomenon: models go through multiple consecutive phase transitions during training, driven by symmetry breaking and head specialization, but the smooth loss curve hides all of it. We talk about why training is nothing like human learning, and why our intuitions about what's hard for models are consistently wrong  -  code in pretraining helps language reasoning, tokenization drives behaviors people attribute to deeper cognition, and language already encodes everything humans care about. We also get into why SAEs are basically topic models, the Platonic representation hypothesis, using AI to decode animal communication, and why non-determinism across training runs is a real problem that RL and MoE might be making worse. Timeline: (00:12) Introduction and guest welcome (01:01) Why training dynamics matter - the evolutionary biology analogy (03:05) Jennifer Aniston neurons and the danger of biological parallels (04:48) What is grokking and why it's one instance of a broader phenomenon (08:25) Phase transitions, symmetry breaking, and head specialization (11:53) Double descent, overfitting, and the death of classical train-test splits (15:10) Training is nothing like learning (16:08) Scaling axes - data, model size, compute, and why they're not interchangeable (19:29) Data quality, code as reasoning fuel, and GPT-2's real contribution (20:43) Multilingual models and the interlingua hypothesis (25:58) The Platonic representation hypothesis and why image classification was always multimodal (29:12) Sparse autoencoders, interpretability, and Marr's levels (37:32) Can we ever truly understand what models know? (43:59) The language modality chauvinist argument (51:55) Vision, redundancy, and self-supervised learning (57:18) World models - measurable capabilities over philosophical definitions (1:00:14) Is coding really a solved task? (1:04:18) Non-determinism, scaling laws, and why one training run isn't enough (1:10:12) Naomi's new lab at BU and recruiting Music: "Kid Kodi" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0."Palms Down" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0. Changes: trimmed About: The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.

    1hr 12min
  5. EP28: How to Control a Stochastic Agent with Stefano Soatto (VP AWS/ Pro. UCLA)

    6 MAR

    EP28: How to Control a Stochastic Agent with Stefano Soatto (VP AWS/ Pro. UCLA)

    Stefano Soatto, VP for AI at AWS and Professor at UCLA, the person responsible for agentic AI at AWS, joins us to explain why building reliable AI agents is fundamentally a control theory problem. Stefano sees LLMs as stochastic dynamical systems that need to be controlled, not just prompted. He introduces "strands coding," a new framework AWS is building that sits between vibe coding and spec coding, you write a skeleton with AI functions constrained by pre- and post-conditions, verifying intent before a single line of code is generated. The surprising part: even as AI coding adoption goes up, developer trust in the output is going down. We go deep into the philosophy of models and the world. Stefano argues that the dichotomy between "language models" and "world models" doesn't really exist, where a reasoning engine trained on rich enough data is a world model. He walks us through why naive realism is indefensible, how reverse diffusion was originally intended to show that models can't be identical to reality, and why that matters now. We also discuss three types of information, Shannon, algorithmic, and conceptual, and why algorithmic information is the one that actually matters to agents. Synthetic data doesn't add Shannon information, but it adds algorithmic information, which is why it works. Intelligence isn't about scaling to Solomonov's universal induction; it's about learning to solve new problems fast. Takeaways: Vibe coding is local feedback control with high cognitive load; spec coding is open-loop global control with silent failures, neither scales well alone.Trust in AI-generated code is declining even as adoption rises.The distinction between next-token prediction and world model is mostly nomenclature - reasoning engines operating on multimodal data are world models.Algorithmic information, not Shannon information, is what matters in the agentic setting.Intelligence isn't minimizing inference uncertainty - it's minimizing time to solve unforeseen tasks.The intent gap between user and model cannot be fully automated or delegated. Timeline (00:13) Introduction and guest welcome (01:12) How the agentic era changed machine learning (06:11) Vibe coding one year later (07:23) Vibe vs. spec vs. strands coding (14:30) Why English is not a programming language (16:36) Constrained generation and agent choreography (20:44) Diffusion models vs. autoregressive models (25:59) The platonic representation hypothesis and naive realism (31:14) Synthetic data and the information bottleneck (36:22) Three types of information: Shannon, algorithmic, conceptual (38:47) Scaling laws and Solomonov induction (42:14) World models and the Goethian vs. Marrian approach (49:00) Encoding vs. generation and JEPA-style training (55:50) Are language models already world models? (59:13) Closing thoughts on trust, education, and responsibility. Music: "Kid Kodi" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0."Palms Down" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0. Changes: trimmedAbout The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.

    1hr 3min
  6. EP27: Medical Foundation Models - with Tanishq Abraham (Sophont.AI)

    2 MAR

    EP27: Medical Foundation Models - with Tanishq Abraham (Sophont.AI)

    Tanishq Abraham, CEO and co-founder of Sophont.ai, joins us to talk about building foundation models specifically for medicine. Sophont is trying to be something like an OpenAI or Anthropic but for healthcare  - training models across pathology, neuroimaging, and clinical text, to eventually fuse them into one multimodal system. The surprising part: their pathology model trained on 12,000 public slides performs on par with models trained on millions of private ones. Data quality beats data quantity. We talk about what actually excites Tanishq, which is not replacing doctors, but finding things doctors can't see. AI predicting gene mutations from a tissue slide, or cardiovascular risk from an eye scan. We also talk about the regulation and how the picture is less scary than people assume. Text-based clinical decision support can ship without FDA approval. Pharma partnerships offer near-term impact. The five-to-ten-year timeline people fear is really about drug discovery, not all of medical AI. Takeaways: The real promise of medical AI is finding hidden signals in existing data, not just automating doctorsSmall, curated public datasets can rival massive private onesMultimodal fusion is the goal, but you need strong individual encoders firstAI research itself might get automated sooner than biology or chemistryFDA regulation has more flexibility than most people think Timeline (00:12) Introduction and guest welcome (02:32) Anthropic's ad about ChatGPT ads (07:26) XAI merging into SpaceX (13:32) Vibe coding one year later (17:00) Claude Code and agentic workflows (21:52) Can AI automate AI research? (26:57) What is medical AI (31:06) Sofont as a frontier medical AI lab (33:52) Public vs. private data - 12K slides vs. millions (36:43) Domain expertise vs. scaling (41:54) Cancer, diabetes, and personal stakes (47:52) Classification vs. prediction in medicine (50:36) When doctors disagree (54:43) Quackery and AI (57:15) Uncertainty in medical AI (1:03:11) Will AI replace doctors? (1:07:24) Self-supervised learning on sleep data (1:10:10) Aligning modalities (1:13:17) FDA regulation (1:22:28) Closing Music: "Kid Kodi" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0."Palms Down" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.Changes: trimmed About The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.

    1hr 26min
  7. EP26: Measuring Intelligence in the Wild -  Arena and the Future of AI Evaluation

    24 FEB

    EP26: Measuring Intelligence in the Wild - Arena and the Future of AI Evaluation

    Anastasios Angelopoulos, Co-Founder and CEO of Arena AI (formerly LMArena), joins us to talk about why static benchmarks are failing, how human preference data actually works under the hood, and what it takes to be the "gold standard" of AI evaluation. Anastasios sits at a fascinating intersection -   a theoretical statistician running the platform that every major lab watches when they release a model. We talk about the messiness of AI-generated code slop (yes, he hides Claude's commits too), then dig into the statistical machinery that powers Arena's leaderboards and why getting evaluation right is harder than most people think. We explore why style control is both necessary and philosophically tricky, where you can regress away markdown headers and response length, but separating style from substance is a genuinely unsolved causal inference problem. We also get into why users are surprisingly good judges of model quality, how Arena serves as a pre-release testing ground for labs shipping stealth models under codenames, and whether the fragmentation of the AI market (Anthropic going enterprise, OpenAI going consumer, everyone going multimodal) is actually a feature, not a bug. Plus, we discuss the role of rigorous statistics in the age of "just run it again," why structured decoding can hurt model performance, and what Arena's 2026 roadmap looks like. Timeline: (00:12) Introduction and Anastasios's Background (00:55) What Arena Does and Why Static Benchmarks Aren't Enough (02:26) Coverage of Use Cases - Is There Enough? (04:22) Style Control and the Bradley-Terry Methodology (08:35) Can You Actually Separate Style from Substance? (10:24) Measuring Slop - And the Anti-Slop Paper Plug (11:52) Can Users Judge Factual Correctness? (13:31) Tool Use and Agentic Evaluation on Arena (14:14) Intermediate Feedback Signals Beyond Final Preference (15:30) Tool Calling Accuracy and Code Arena (17:42) AI-Generated Code Slop and Hiding Claude's Commits (19:49) Do We Need Separate Code Streams for Humans and LLMs? (20:01) RL Flywheels and Arena's Preference Data (21:16) Focus as a Startup - Being the Evaluation Company (22:16) Structured vs. Unconstrained Generation (25:00) The Role of Rigorous Statistics in the Age of AI (29:23) LLM Sampling Parameters and Evaluation Complexity (30:56) Model Versioning and the Frequentist Approach to Fairness (32:12) Quantization and Its Effects on Model Quality (33:10) Pre-Release Testing and Stealth Models (34:23) Transparency - What to Share with the Public vs. Labs (36:27) When Winning Models Don't Get Released (36:59) Why Users Keep Coming Back to Arena (38:19) Market Fragmentation and Arena's Future Value (39:37) Custom Evaluation Frameworks for Specific Users (40:03) Arena's 2026 Roadmap - Science, Methodology, and New Paradigms (42:15) The Economics of Free Inference (43:13) Hiring and Closing Thoughts Music: "Kid Kodi" — Blue Dot Sessions — via Free Music Archive — CC BY-NC 4.0."Palms Down" — Blue Dot Sessions — via Free Music Archive — CC BY-NC 4.0.Changes: trimmed About: The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.

    45 min
  8. EP25: Personalization, Data, and the Chaos of Fine-Tuning with Fred Sala (UW-Madison / Snorkel AI)

    17 FEB

    EP25: Personalization, Data, and the Chaos of Fine-Tuning with Fred Sala (UW-Madison / Snorkel AI)

    Fred Sala, Assistant Professor at UW-Madison and Chief Scientist at Snorkel AI, joins us to talk about why personalization might be the next frontier for LLMs, why data still matters more than architecture, and how weak supervision refuses to die. Fred sits at a rare intersection,  building the theory of data-centric AI in academia while shipping it to enterprise clients at Snorkel. We talk about the chaos of OpenClaw (the personal AI assistant that's getting people hacked the old-fashioned way, via open ports), then focus on one of the most important questions: how do you make a model truly yours? We dig into why prompting your preferences doesn't scale, why even LoRA might be too expensive for per-user personalization, and why activation steering methods like REFT could be the sweet spot. We also explore self-distillation for continual learning, the unsolved problem of building realistic personas for evaluation, and Fred's take on the data vs. architecture debate (spoiler: data is still undervalued). Plus, we discuss why the internet's "Ouroboros effect" might not doom pre-training as much as people fear, and what happens when models become smarter than the humans who generate their training data. Takeaways: Personalization requires ultra-efficient methods - even one LoRA per user is probably too expensive. Activation steering is the promising middle ground.The "pink elephant problem" makes prompt-based personalization fundamentally limited - telling a model what not to do often makes it do it more.Self-distillation can enable on-policy continual learning without expensive RL reward functions, dramatically reducing catastrophic forgetting.Data is still undervalued relative to architecture and compute, especially high-quality post-training data, which is actually improving, not getting worse.Weak supervision principles are alive and well inside modern LLM data pipelines, even if people don't call it that anymore. Timeline: (00:13) Introduction and Fred's Background (00:39) OpenClaw — The Personal AI Assistant Taking Over Macs (03:43) Agent Security Risks and the Privacy Problem (05:13) Cloud Code, Permissions, and Living Dangerously (07:47) AI Social Media and Agents Talking to Each Other (08:56) AI Persuasion and Competitive Debate (09:51) Self-Distillation for Continual Learning (12:43) What Does Continual Learning Actually Mean? (14:12) Updating Weights on the Fly — A Grand Challenge (15:09) The Personalization Problem — Motivation and Use Cases (17:41) The Pink Elephant Problem with Prompt-Based Personalization (19:58) Taxonomy of Personalization — Preferences vs. Tone vs. Style (21:31) Activation Steering, REFT, and Parameter-Efficient Fine-Tuning (27:00) Evaluating Personalization — Benchmarks and Personas (31:14) Unlearning and Un-Personalization (31:51) Cultural Alignment as Group-Level Personalization (41:00) Can LLM Personas Replace Surveys and Polling? (44:32) Is Continued Pre-Training Still Relevant? (46:28) Data vs. Architecture — What Matters More? (52:25) Multi-Epoch Training — Is It Over? (54:53) What Makes Good Data? Matching Real-World Usage (59:23) Decomposing Uncertainty for Better Data Selection (1:01:52) Mapping Human Difficulty to Model Difficulty (1:04:49) Scaling Small Ideas — From Academic Proof to Frontier Models (1:12:01) What Happens When Models Surpass Human Training Data? (1:15:24) Closing Thoughts Music: "Kid Kodi" — Blue Dot Sessions — via Free Music Archive — CC BY-NC 4.0."Palms Down" — Blue Dot Sessions — via Free Music Archive — CC BY-NC 4.0.Changes: trimmed

    1hr 16min

About

Two AI Researchers - Ravid Shwartz Ziv, and Allen Roush, discuss the latest trends, news, and research within Generative AI, LLMs, GPUs, and Cloud Systems.

You Might Also Like