AI Post Transformers

mcgrof

0.0 (0)
TECHNOLOGY
UPDATED DAILY

AI-generated podcast where hosts Hal Turing and Dr. Ada Shannon discuss the latest research papers and reports in machine learning, AI systems, and optimization. Featuring honest critical analysis, proper citations, and nerdy humor.

22 HR AGO

Simple Self-Distillation for Better Code Generation

This episode explores Apple’s paper on whether code models can improve through an extremely simple form of self-distillation: fine-tuning on their own sampled code outputs without using a stronger teacher, execution feedback, verifiers, or reinforcement learning. It situates that idea within the broader history of knowledge distillation and post-training, comparing it to earlier work like Hinton’s distillation, sequence-level distillation, Born Again Networks, Noisy Student, and newer on-policy language model distillation. The discussion focuses on why code generation is a particularly revealing testbed, since benchmarks like pass@1 and pass@k make it easier to tell whether self-distillation is uncovering latent capability or just repackaging errors. A listener would find it interesting because the paper challenges a core assumption in modern model improvement: that meaningful gains require expensive external supervision rather than a surprisingly cheap training loop around the model itself. Sources: 1. Embarrassingly Simple Self-Distillation Improves Code Generation — Ruixiang Zhang, Richard He Bai, Huangjie Zheng, Navdeep Jaitly, Ronan Collobert, Yizhe Zhang, 2026 http://arxiv.org/abs/2604.01193 2. Distilling the Knowledge in a Neural Network — Geoffrey Hinton, Oriol Vinyals, Jeff Dean, 2015 https://scholar.google.com/scholar?q=Distilling+the+Knowledge+in+a+Neural+Network 3. Sequence-Level Knowledge Distillation — Yoon Kim, Alexander M. Rush, 2016 https://scholar.google.com/scholar?q=Sequence-Level+Knowledge+Distillation 4. Born Again Neural Networks — Tommaso Furlanello, Zachary Lipton, Michael Tschannen, Laurent Itti, Anima Anandkumar, 2018 https://scholar.google.com/scholar?q=Born+Again+Neural+Networks 5. Self-training with Noisy Student improves ImageNet classification — Qizhe Xie, Minh-Thang Luong, Eduard Hovy, Quoc V. Le, 2020 https://scholar.google.com/scholar?q=Self-training+with+Noisy+Student+improves+ImageNet+classification 6. On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes — Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, Olivier Bachem, 2024 https://scholar.google.com/scholar?q=On-Policy+Distillation+of+Language+Models:+Learning+from+Self-Generated+Mistakes 7. Evaluating Large Language Models Trained on Code — Mark Chen, Jerry Tworek, Heewoo Jun, et al., 2021 https://scholar.google.com/scholar?q=Evaluating+Large+Language+Models+Trained+on+Code 8. Program Synthesis with Large Language Models — Jacob Austin, Augustus Odena, Maxwell Nye, et al., 2021 https://scholar.google.com/scholar?q=Program+Synthesis+with+Large+Language+Models 9. Measuring Coding Challenge Competence With APPS — Dan Hendrycks, Collin Burns, Steven Basart, et al., 2021 https://scholar.google.com/scholar?q=Measuring+Coding+Challenge+Competence+With+APPS 10. Self-Consistency Improves Chain of Thought Reasoning in Language Models — Xuezhi Wang, Jason Wei, Dale Schuurmans, et al., 2022 https://scholar.google.com/scholar?q=Self-Consistency+Improves+Chain+of+Thought+Reasoning+in+Language+Models 11. DeepSeek-R1 — DeepSeek-AI, 2025 https://scholar.google.com/scholar?q=DeepSeek-R1 12. LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code — Prasenjit Jain, et al., 2024 https://scholar.google.com/scholar?q=LiveCodeBench:+Holistic+and+Contamination+Free+Evaluation+of+Large+Language+Models+for+Code 13. SelfCodeAlign: Self-Alignment for Code Generation — Yuxiang Wei, Federico Cassano, Jiawei Liu, Yifeng Ding, Naman Jain, Zachary Mueller, Harm de Vries, Leandro von Werra, Arjun Guha, Lingming Zhang, 2024 https://scholar.google.com/scholar?q=SelfCodeAlign:+Self-Alignment+for+Code+Generation 14. Iterative Self-Training for Code Generation via Reinforced Re-Ranking — Nikita Sorokin, Ivan Sedykh, Valentin Malykh, 2025 https://scholar.google.com/scholar?q=Iterative+Self-Training+for+Code+Generation+via+Reinforced+Re-Ranking 15. On the Role of Temperature Sampling in Test-Time Scaling — Yuheng Wu, Azalia Mirhoseini, Thierry Tambe, 2025 https://scholar.google.com/scholar?q=On+the+Role+of+Temperature+Sampling+in+Test-Time+Scaling 16. OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement — Tianyu Zheng, Ge Zhang, Tianhao Shen, Xueling Liu, Bill Yuchen Lin, Jie Fu, Wenhu Chen, Xiang Yue, 2024 https://scholar.google.com/scholar?q=OpenCodeInterpreter:+Integrating+Code+Generation+with+Execution+and+Refinement 17. GenX: Mastering Code and Test Generation with Execution Feedback — Nan Wang, Yafei Liu, Chen Chen, Haonan Lu, 2024 https://scholar.google.com/scholar?q=GenX:+Mastering+Code+and+Test+Generation+with+Execution+Feedback 18. InterCode: Standardizing and Benchmarking Interactive Coding with Execution Feedback — John Yang, Akshara Prabhakar, Karthik Narasimhan, Shunyu Yao, 2023 https://scholar.google.com/scholar?q=InterCode:+Standardizing+and+Benchmarking+Interactive+Coding+with+Execution+Feedback 19. AI Post Transformers: Evolving Language Models Without Labels: EVOL-RL — Hal Turing & Dr. Ada Shannon, Fri, https://podcast.do-not-panic.com/episodes/evolving-language-models-without-labels-evol-rl/ 20. AI Post Transformers: Lp-Reg: Low-Probability Tokens Sustain RL Exploration — Hal Turing & Dr. Ada Shannon, Sun, https://podcast.do-not-panic.com/episodes/lp-reg-low-probability-tokens-sustain-rl-exploration/ 21. AI Post Transformers: NeurIPS 2025: SeRL: Self-Play Reinforcement Learning for Large Language Models with Limited Data — Hal Turing & Dr. Ada Shannon, Sat, https://podcast.do-not-panic.com/episodes/neurips-2025-serl-self-play-reinforcement-learning-for-large-language-models-wit/ 22. AI Post Transformers: LLM Benchmark Robustness to Linguistic Variation — Hal Turing & Dr. Ada Shannon, Tue, https://podcast.do-not-panic.com/episodes/llm-benchmark-robustness-to-linguistic-variation/ Interactive Visualization: Simple Self-Distillation for Better Code Generation
2 DAYS AGO

MetaClaw: Just Talk and Continual Agent Adaptation

This episode takes up the thread from the published episode "MAML and the Basics of Meta-Learning" and shows how those ideas reappear in a much messier setting: a live agent that has to keep improving while it is already deployed. Instead of treating meta-learning as a clean laboratory exercise, the discussion follows MetaClaw as a continual agent system built for changing real workloads, where coding assistants, research agents, and other LLM-based tools face drift in tasks, tools, and failure modes. The hosts frame the paper as a concrete answer to a practical question: how an agent can keep learning on the job rather than waiting for the next full retraining cycle. The conversation focuses on MetaClaw’s two-speed adaptation design. The fast path updates behavior immediately through an external skill library, where failures are distilled into reusable behavioral instructions that can be injected at inference time; the slow path consolidates some of those lessons later through lightweight parameter updates. The hosts unpack the paper’s core formulation of the meta-model as base parameters plus skills, and they explain why that split matters for continual meta-learning: the agent is not only learning facts or storing transcripts, but improving its ability to adapt across a stream of tasks. They also dig into the process reward model, which scores intermediate reasoning and action steps, and the paper’s support-query separation, which keeps skill creation and later reinforcement updates from collapsing into stale self-training. A large part of the episode is about the systems implications of making that loop work in the wild. The hosts examine the paper’s zero-downtime claim in its narrower sense: skill updates can land during live use, while LoRA-based policy optimization is pushed into idle windows detected through sleep schedules, keyboard inactivity, and calendar availability, then swapped back into service later. That makes this episode a useful bridge not only from "MAML and the Basics of Meta-Learning" but, secondarily, from "Doc-to-LoRA: Internalizing Context as LoRA," because the slow adaptation path is explicitly about compressing recurring lessons into lightweight weight changes. The result is a detailed discussion of how MetaClaw tries to turn adaptation into an operational loop rather than a one-shot training event. Sources: 1. MetaClaw: Just Talk -- An Agent That Meta-Learns and Evolves in the Wild — Peng Xia, Jianwen Chen, Xinyu Yang, Haoqin Tu, Jiaqi Liu, Kaiwen Xiong, Siwei Han, Shi Qiu, Haonian Ji, Yuyin Zhou, Zeyu Zheng, Cihang Xie, Huaxiu Yao, 2026 http://arxiv.org/abs/2603.17187 2. Reflexion: Language Agents with Verbal Reinforcement Learning — Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, Shunyu Yao, 2023 https://scholar.google.com/scholar?q=Reflexion:+Language+Agents+with+Verbal+Reinforcement+Learning 3. Voyager: An Open-Ended Embodied Agent with Large Language Models — Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, Anima Anandkumar, 2023 https://scholar.google.com/scholar?q=Voyager:+An+Open-Ended+Embodied+Agent+with+Large+Language+Models 4. ExpeL: LLM Agents Are Experiential Learners — Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, Gao Huang, 2023 https://scholar.google.com/scholar?q=ExpeL:+LLM+Agents+Are+Experiential+Learners 5. Agent Lightning: Train ANY AI Agents with Reinforcement Learning — Xufang Luo, Yuge Zhang, Zhiyuan He, Zilong Wang, Siyun Zhao, Dongsheng Li, Luna K. Qiu, Yuqing Yang, 2025 https://scholar.google.com/scholar?q=Agent+Lightning:+Train+ANY+AI+Agents+with+Reinforcement+Learning 6. ReAct: Synergizing Reasoning and Acting in Language Models — Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, Yuan Cao, 2022 https://scholar.google.com/scholar?q=ReAct:+Synergizing+Reasoning+and+Acting+in+Language+Models 7. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks — Chelsea Finn, Pieter Abbeel, Sergey Levine, 2017 https://scholar.google.com/scholar?q=Model-Agnostic+Meta-Learning+for+Fast+Adaptation+of+Deep+Networks 8. LoRA: Low-Rank Adaptation of Large Language Models — Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, 2021 https://scholar.google.com/scholar?q=LoRA:+Low-Rank+Adaptation+of+Large+Language+Models 9. Who is introducing the failure? Automatically attributing failures of multi-agent systems via spectrum analysis — not verified from snippet, recent (exact year not verified from snippet) https://scholar.google.com/scholar?q=Who+is+introducing+the+failure?+Automatically+attributing+failures+of+multi-agent+systems+via+spectrum+analysis 10. Weak-to-strong generalization with failure trajectories: A tree-based approach to elicit optimal policy in strong models — not verified from snippet, recent (exact year not verified from snippet) https://scholar.google.com/scholar?q=Weak-to-strong+generalization+with+failure+trajectories:+A+tree-based+approach+to+elicit+optimal+policy+in+strong+models 11. Understanding Code Agent Behaviour: An Empirical Study of Success and Failure Trajectories — not verified from snippet, recent (exact year not verified from snippet) https://scholar.google.com/scholar?q=Understanding+Code+Agent+Behaviour:+An+Empirical+Study+of+Success+and+Failure+Trajectories 12. Twosome: An efficient online framework to align LLMs with embodied environments via reinforcement learning — not verified from snippet, recent (exact year not verified from snippet) https://scholar.google.com/scholar?q=Twosome:+An+efficient+online+framework+to+align+LLMs+with+embodied+environments+via+reinforcement+learning 13. AI Post Transformers: MAML and the Basics of Meta-Learning — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-29-maml-and-the-basics-of-meta-learning-7d449f.mp3 14. AI Post Transformers: Experiential Reinforcement Learning: Internalizing Reflection for Better Policy Training — Hal Turing & Dr. Ada Shannon, Fri, https://podcast.do-not-panic.com/episodes/experiential-reinforcement-learning-internalizing-reflection-for-better-policy-t/ 15. AI Post Transformers: Mem0: Scalable Long-Term Memory for AI Agents — Hal Turing & Dr. Ada Shannon, Tue, https://podcast.do-not-panic.com/episodes/mem0-scalable-long-term-memory-for-ai-agents/ 16. AI Post Transformers: NeurIPS 2025: A-Mem: Agentic Memory for LLM Agents — Hal Turing & Dr. Ada Shannon, Sat, https://podcast.do-not-panic.com/episodes/neurips-2025-a-mem-agentic-memory-for-llm-agents/ 17. AI Post Transformers: Evolving Language Models Without Labels: EVOL-RL — Hal Turing & Dr. Ada Shannon, Fri, https://podcast.do-not-panic.com/episodes/evolving-language-models-without-labels-evol-rl/ 18. AI Post Transformers: NeurIPS 2025: Reward Reasoning Model — Hal Turing & Dr. Ada Shannon, Sat, https://podcast.do-not-panic.com/episodes/neurips-2025-reward-reasoning-model/ 19. AI Post Transformers: Generalist Reward Modeling with Inference-Time Scaling — Hal Turing & Dr. Ada Shannon, Tue, https://podcast.do-not-panic.com/episodes/generalist-reward-modeling-with-inference-time-scaling/ 20. AI Post Transformers: LLM Benchmark Robustness to Linguistic Variation — Hal Turing & Dr. Ada Shannon, Tue, https://podcast.do-not-panic.com/episodes/llm-benchmark-robustness-to-linguistic-variation/ Interactive Visualization: MetaClaw: Just Talk and Continual Agent Adaptation
3 DAYS AGO

Doc-to-LoRA: Internalizing Context as LoRA

This episode explores Doc-to-LoRA, a method for turning an entire document into a lightweight LoRA adapter so a language model can answer later questions without repeatedly rereading the source text. It explains how the paper combines context distillation, LoRA fine-tuning, and a Perceiver-style hypernetwork that ingests variable-length documents and emits fixed-size parameter updates, using chunking to handle longer inputs. The discussion highlights reported results such as near-perfect zero-shot performance on synthetic long-context retrieval beyond 32K tokens and improved efficiency on long-document question answering through lower update latency, lower peak memory use, and reduced KV-cache costs at inference time. It also digs into the systems argument behind the work, framing reusable internalized memory as a different primitive from prompting, while questioning how well the approach holds up outside limited-query evaluations and whether its benefits persist against alternatives like prompt compression or keeping context externally. Sources: 1. Doc-to-LoRA: Internalizing Context as LoRA https://arxiv.org/pdf/2602.15902 2. 2603.13875 https://arxiv.org/abs/2603.13875 3. 2510.03215 https://arxiv.org/abs/2510.03215 4. LoRA: Low-Rank Adaptation of Large Language Models — Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, 2022 https://scholar.google.com/scholar?q=LoRA:+Low-Rank+Adaptation+of+Large+Language+Models 5. QLoRA: Efficient Finetuning of Quantized LLMs — Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer, 2023 https://scholar.google.com/scholar?q=QLoRA:+Efficient+Finetuning+of+Quantized+LLMs 6. AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning — Qingru Zhang, Minshuo Chen, Alexander Bukharin, Nikos Karampatziakis, Pengcheng He, Yu Cheng, Weizhu Chen, Tuo Zhao, 2023 https://scholar.google.com/scholar?q=AdaLoRA:+Adaptive+Budget+Allocation+for+Parameter-Efficient+Fine-Tuning 7. DoRA: Weight-Decomposed Low-Rank Adaptation — Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, Min-Hung Chen, 2024 https://scholar.google.com/scholar?q=DoRA:+Weight-Decomposed+Low-Rank+Adaptation 8. HyperNetworks — David Ha, Andrew Dai, Quoc V. Le, 2016 https://scholar.google.com/scholar?q=HyperNetworks 9. Parameter-efficient Multi-task Fine-tuning for Transformers via Shared Hypernetworks — Rabeeh Karimi Mahabadi, Sebastian Ruder, Mostafa Dehghani, James Henderson, 2021 https://scholar.google.com/scholar?q=Parameter-efficient+Multi-task+Fine-tuning+for+Transformers+via+Shared+Hypernetworks 10. HyperPrompt: Prompt-based Task-Conditioning of Transformers — Yun He, Huaixiu Steven Zheng, Yi Tay, Jai Gupta, Yu Du, Vamsi Aribandi, Zhe Zhao, Yaguang Li, Zhao Chen, Donald Metzler, Heng-Tze Cheng, Ed H. Chi, 2022 https://scholar.google.com/scholar?q=HyperPrompt:+Prompt-based+Task-Conditioning+of+Transformers 11. Doc-to-LoRA: Learning to Instantly Internalize Contexts — Rujikorn Charakorn, Edoardo Cetin, Shinnosuke Uesaka, Robert Tjarko Lange, 2026 https://scholar.google.com/scholar?q=Doc-to-LoRA:+Learning+to+Instantly+Internalize+Contexts 12. Text-to-LoRA: Instant Transformer Adaption — Rujikorn Charakorn, Edoardo Cetin, Yujin Tang, Robert Tjarko Lange, 2025 https://scholar.google.com/scholar?q=Text-to-LoRA:+Instant+Transformer+Adaption 13. Generative Adapter: Contextualizing Language Models in Parameters with a Single Forward Pass — Tianyu Chen, Huanran Fang, Patrick Xia, Xiaodong Liu, Benjamin Van Durme, Luke Zettlemoyer, Jianfeng Gao, Hao Cheng, 2025 https://scholar.google.com/scholar?q=Generative+Adapter:+Contextualizing+Language+Models+in+Parameters+with+a+Single+Forward+Pass 14. Cartridges: Lightweight and General-Purpose Long Context Representations via Self-Study — Sabri Eyuboglu, Ryan Saul Ehrlich, Simran Arora, Neel Guha, Dylan Zinsley, Emily Ruoyu Liu, William Tennien, Atri Rudra, James Zou, Azalia Mirhoseini, Christopher Re, 2025 https://scholar.google.com/scholar?q=Cartridges:+Lightweight+and+General-Purpose+Long+Context+Representations+via+Self-Study 15. Propagating Knowledge Updates to LMs through Distillation — Suchin Padmanabhan, Yoon Kim Onoe, Michael Zhang, Greg Durrett, Eunsol Choi, 2023 https://scholar.google.com/scholar?q=Propagating+Knowledge+Updates+to+LMs+through+Distillation 16. LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression — Zefan Pan, Qipeng Wu, Hao Jiang, Mengzhou Xia, Xuefei Luo, Jiaqi Zhang, Qingyu Lin, Viktor Ruhle, Yi Yang, Chin-Yew Lin, H. Vicky Zhao, Lidong Qiu, Dongmei Zhang, 2024 https://scholar.google.com/scholar?q=LLMLingua-2:+Data+Distillation+for+Efficient+and+Faithful+Task-Agnostic+Prompt+Compression 17. RazorAttention: Efficient KV Cache Compression Through Retrieval Heads — Hanlin Tang et al., 2024 https://scholar.google.com/scholar?q=RazorAttention:+Efficient+KV+Cache+Compression+Through+Retrieval+Heads 18. Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning — Yu Fu et al., 2024/2025 https://scholar.google.com/scholar?q=Not+All+Heads+Matter:+A+Head-Level+KV+Cache+Compression+Method+with+Integrated+Retrieval+and+Reasoning 19. How Much Knowledge Can You Pack into a LoRA Adapter without Harming LLM? — Sergey Pletenev et al., 2025 https://scholar.google.com/scholar?q=How+Much+Knowledge+Can+You+Pack+into+a+LoRA+Adapter+without+Harming+LLM? 20. Can Fine-Tuning Erase Your Edits? On the Fragile Coexistence of Knowledge Editing and Adaptation — Yinjie Cheng et al., 2025 https://scholar.google.com/scholar?q=Can+Fine-Tuning+Erase+Your+Edits?+On+the+Fragile+Coexistence+of+Knowledge+Editing+and+Adaptation 21. Memorization in In-Context Learning — Shahriar Golchin et al., 2024 https://scholar.google.com/scholar?q=Memorization+in+In-Context+Learning 22. In-Context Learning can Perform Continual Learning Like Humans — Liuwang Kang et al., 2025 https://scholar.google.com/scholar?q=In-Context+Learning+can+Perform+Continual+Learning+Like+Humans 23. AI Post Transformers: LoRA: Low-Rank Adaptation of Large Language Models — Hal Turing & Dr. Ada Shannon, Fri, https://podcast.do-not-panic.com/episodes/lora-low-rank-adaptation-of-large-language-models/ 24. AI Post Transformers: ShadowKV: High-Throughput Long-Context LLM Inference — Hal Turing & Dr. Ada Shannon, Wed, https://podcast.do-not-panic.com/episodes/shadowkv-high-throughput-long-context-llm-inference/ 25. AI Post Transformers: Lookahead Q-Cache for Consistent KV Eviction — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-25-lookahead-q-cache-for-consistent-kv-evic-d97b09.mp3 26. AI Post Transformers: Mem0: Scalable Long-Term Memory for AI Agents — Hal Turing & Dr. Ada Shannon, Tue, https://podcast.do-not-panic.com/episodes/mem0-scalable-long-term-memory-for-ai-agents/ 27. AI Post Transformers: Kimi Linear: Efficient Expressive Attention Architecture — Hal Turing & Dr. Ada Shannon, Sun, https://podcast.do-not-panic.com/episodes/kimi-linear-efficient-expressive-attention-architecture/ 28. AI Post Transformers: ComoRAG: Cognitively Inspired Narrative Reasoning — Hal Turing & Dr. Ada Shannon, Tue, https://podcast.do-not-panic.com/episodes/comorag-cognitively-inspired-narrative-reasoning/ Interactive Visualization: Doc-to-LoRA: Internalizing Context as LoRA
3 DAYS AGO

MAML and the Basics of Meta-Learning

This episode explores meta-learning through the lens of MAML, explaining how it differs from ordinary supervised learning and standard transfer learning by explicitly training models to adapt quickly to new tasks after just one or a few gradient updates. It walks through the core idea of optimizing for post-update performance, including the role of second-order meta-gradients and the simpler first-order approximation, while placing MAML within the broader landscape of few-shot and gradient-based meta-learning. The discussion also highlights why the paper mattered across multiple domains, covering not just classification benchmarks like Omniglot and MiniImagenet but also regression with sinusoid fitting and reinforcement learning with fast-adapting policies. A listener would find it interesting because it turns a buzzword-heavy area into a concrete framework for thinking about how models can learn to learn, setting up deeper discussions about newer systems built on these ideas. Sources: 1. MAML and the Basics of Meta-Learning https://arxiv.org/pdf/1703.03400 2. https://par.nsf.gov/servlets/purl/10427895 https://par.nsf.gov/servlets/purl/10427895 3. Optimization as a Model for Few-Shot Learning — Sachin Ravi, Hugo Larochelle, 2017 https://scholar.google.com/scholar?q=Optimization+as+a+Model+for+Few-Shot+Learning 4. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks — Chelsea Finn, Pieter Abbeel, Sergey Levine, 2017 https://scholar.google.com/scholar?q=Model-Agnostic+Meta-Learning+for+Fast+Adaptation+of+Deep+Networks 5. RL^2: Fast Reinforcement Learning via Slow Reinforcement Learning — Yan Duan, John Schulman, Xi Chen, Peter L. Bartlett, Ilya Sutskever, Pieter Abbeel, 2016 https://scholar.google.com/scholar?q=RL^2:+Fast+Reinforcement+Learning+via+Slow+Reinforcement+Learning 6. Meta-Learning in Neural Networks: A Survey — Timothy M. Hospedales, Antreas Antoniou, Paul Micaelli, Amos J. Storkey, 2021 https://scholar.google.com/scholar?q=Meta-Learning+in+Neural+Networks:+A+Survey 7. Matching Networks for One Shot Learning — Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Koray Kavukcuoglu, Daan Wierstra, 2016 https://scholar.google.com/scholar?q=Matching+Networks+for+One+Shot+Learning 8. Prototypical Networks for Few-shot Learning — Jake Snell, Kevin Swersky, Richard Zemel, 2017 https://scholar.google.com/scholar?q=Prototypical+Networks+for+Few-shot+Learning 9. A Closer Look at Few-shot Classification — Wei-Yu Chen, Yen-Cheng Liu, Zsolt Kira, Yu-Chiang Frank Wang, Jia-Bin Huang, 2019 https://scholar.google.com/scholar?q=A+Closer+Look+at+Few-shot+Classification 10. Generalizing from a Few Examples — Yaqing Wang, Quanming Yao, James T. Kwok, Lionel M. Ni, 2020 https://scholar.google.com/scholar?q=Generalizing+from+a+Few+Examples 11. Meta-SGD: Learning to Learn Quickly for Few-Shot Learning — Zhenguo Li, Fengwei Zhou, Fei Chen, Hang Li, 2017 https://scholar.google.com/scholar?q=Meta-SGD:+Learning+to+Learn+Quickly+for+Few-Shot+Learning 12. On First-Order Meta-Learning Algorithms — Alex Nichol, Joshua Achiam, John Schulman, 2018 https://scholar.google.com/scholar?q=On+First-Order+Meta-Learning+Algorithms 13. How to Train Your MAML — Antreas Antoniou, Harrison Edwards, Amos Storkey, 2019 https://scholar.google.com/scholar?q=How+to+Train+Your+MAML 14. Meta-learning with Differentiable Closed-Form Solvers — Luca Bertinetto, Joao F. Henriques, Philip H. S. Torr, Andrea Vedaldi, 2018 https://scholar.google.com/scholar?q=Meta-learning+with+Differentiable+Closed-Form+Solvers 15. Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables — Kate Rakelly, Aurick Zhou, Chelsea Finn, Sergey Levine, Deirdre Quillen, 2019 https://scholar.google.com/scholar?q=Efficient+Off-Policy+Meta-Reinforcement+Learning+via+Probabilistic+Context+Variables 16. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning — Ronald J. Williams, 1992 https://scholar.google.com/scholar?q=Simple+Statistical+Gradient-Following+Algorithms+for+Connectionist+Reinforcement+Learning 17. Trust Region Policy Optimization — John Schulman, Sergey Levine, Philipp Moritz, Michael Jordan, Pieter Abbeel, 2015 https://scholar.google.com/scholar?q=Trust+Region+Policy+Optimization 18. Proximal Policy Optimization Algorithms — John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov, 2017 https://scholar.google.com/scholar?q=Proximal+Policy+Optimization+Algorithms 19. Meta-Learning with Memory-Augmented Neural Networks — Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, Timothy Lillicrap, 2016 https://scholar.google.com/scholar?q=Meta-Learning+with+Memory-Augmented+Neural+Networks 20. Learning to Learn by Gradient Descent by Gradient Descent — Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W. Hoffman, David Pfau, Tom Schaul, Nando de Freitas, 2016 https://scholar.google.com/scholar?q=Learning+to+Learn+by+Gradient+Descent+by+Gradient+Descent 21. Transformers learn in-context by gradient descent — Johannes von Oswald et al., 2022 https://scholar.google.com/scholar?q=Transformers+learn+in-context+by+gradient+descent 22. In-context Learning and Gradient Descent Revisited — Gilad Deutch et al., 2023 https://scholar.google.com/scholar?q=In-context+Learning+and+Gradient+Descent+Revisited 23. Low-Rank Few-Shot Adaptation of Vision-Language Models — Maxime Zanella and Ismail Ben Ayed, 2024 https://scholar.google.com/scholar?q=Low-Rank+Few-Shot+Adaptation+of+Vision-Language+Models 24. Meta-Adapter: An Online Few-shot Learner for Vision-Language Model — Cheng Cheng et al., 2023 https://scholar.google.com/scholar?q=Meta-Adapter:+An+Online+Few-shot+Learner+for+Vision-Language+Model 25. Cross-Domain Few-Shot Learning via Adaptive Transformer Networks — Naeem Paeedeh et al., 2024 https://scholar.google.com/scholar?q=Cross-Domain+Few-Shot+Learning+via+Adaptive+Transformer+Networks 26. Few-shot Adaptation of Multi-modal Foundation Models: A Survey — Fan Liu et al., 2024 https://scholar.google.com/scholar?q=Few-shot+Adaptation+of+Multi-modal+Foundation+Models:+A+Survey 27. AI Post Transformers: In-Context Learning as Implicit Learning Algorithms — Hal Turing & Dr. Ada Shannon, Wed, https://podcast.do-not-panic.com/episodes/in-context-learning-as-implicit-learning-algorithms/ 28. AI Post Transformers: NVIDIA: TTT-E2E: Unlocking Long-Context Learning via End-to-End Test-Time Training — Hal Turing & Dr. Ada Shannon, Sat, https://podcast.do-not-panic.com/episodes/nvidia-ttt-e2e-unlocking-long-context-learning-via-end-to-end-test-time-training/ 29. AI Post Transformers: Zero-Shot Context Generalization in Reinforcement Learning from Few Training Contexts — Hal Turing & Dr. Ada Shannon, Tue, https://podcast.do-not-panic.com/episodes/zero-shot-context-generalization-in-reinforcement-learning-from-few-training-con/ 30. AI Post Transformers: A 2024 Survey Analyzing Generalization in Deep Reinforcement Learning — Hal Turing & Dr. Ada Shannon, Fri, https://podcast.do-not-panic.com/episodes/a-2024-survey-analyzing-generalization-in-deep-reinforcement-learning/ Interactive Visualization: MAML and the Basics of Meta-Learning
4 DAYS AGO

Agentic AI and the Next Intelligence Explosion

This episode explores a perspective paper arguing that the next major leap in AI may come less from scaling a single model and more from organizing intelligence across agents, tools, humans, and institutions. It explains key ideas including agentic AI, multi-agent reasoning, human-AI centaurs, and “societies of thought,” where useful reasoning may emerge through internal dialogue among specialized perspectives rather than just longer single-threaded outputs. The discussion contrasts straightforward parameter scaling with the harder problem of organizational design, emphasizing that collective intelligence only works under specific conditions such as good communication, balanced participation, and careful aggregation. Listeners would find it interesting because it reframes the usual singularity story into a concrete debate about coordination, role design, and whether intelligence scales socially as much as technically. Sources: 1. Agentic AI and the next intelligence explosion — James Evans, Benjamin Bratton, Blaise Agüera y Arcas, 2026 http://arxiv.org/abs/2603.20639 2. Evidence for a Collective Intelligence Factor in the Performance of Human Groups — Anita Williams Woolley, Christopher F. Chabris, Alex Pentland, Nada Hashmi, Thomas W. Malone, 2010 https://scholar.google.com/scholar?q=Evidence+for+a+Collective+Intelligence+Factor+in+the+Performance+of+Human+Groups 3. AI-enhanced Collective Intelligence — Hao Cui, Taha Yasseri, 2024 https://scholar.google.com/scholar?q=AI-enhanced+Collective+Intelligence 4. Artificial Intelligence for Collective Intelligence: a National-scale Research Strategy — Seth Bullock and many coauthors, 2024 https://scholar.google.com/scholar?q=Artificial+Intelligence+for+Collective+Intelligence:+a+National-scale+Research+Strategy 5. Artificial Intelligence versus Collective Intelligence — Harry Halpin, 2025 https://scholar.google.com/scholar?q=Artificial+Intelligence+versus+Collective+Intelligence 6. Man-Computer Symbiosis — J. C. R. Licklider, 1960 https://scholar.google.com/scholar?q=Man-Computer+Symbiosis 7. Navigating the Jagged Technological Frontier: Field Experimental Evidence of the Effects of AI on Knowledge Worker Productivity and Quality — Fabrizio Dell'Acqua, Edward McFowland III, Ethan Mollick, Hila Lifshitz-Assaf, Katherine C. Kellogg, Saran Rajendran, Lisa Krayer, Francois Candelon, Karim R. Lakhani, 2023 https://scholar.google.com/scholar?q=Navigating+the+Jagged+Technological+Frontier:+Field+Experimental+Evidence+of+the+Effects+of+AI+on+Knowledge+Worker+Productivity+and+Quality 8. When Combinations of Humans and AI are Useful: A Systematic Review and Meta-analysis — Michelle Vaccaro, Abdullah Almaatouq, Thomas W. Malone, 2024 https://scholar.google.com/scholar?q=When+Combinations+of+Humans+and+AI+are+Useful:+A+Systematic+Review+and+Meta-analysis 9. Effective Generative AI: The Human-Algorithm Centaur — Soroush Saghafian, Lihi Idan, 2024 https://scholar.google.com/scholar?q=Effective+Generative+AI:+The+Human-Algorithm+Centaur 10. Reasoning Models Generate Societies of Thought — Junsol Kim, Shiyang Lai, Nino Scherrer, Blaise Aguera y Arcas, James Evans, 2026 https://scholar.google.com/scholar?q=Reasoning+Models+Generate+Societies+of+Thought 11. Self-Consistency Improves Chain of Thought Reasoning in Language Models — Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, Denny Zhou, 2022 https://scholar.google.com/scholar?q=Self-Consistency+Improves+Chain+of+Thought+Reasoning+in+Language+Models 12. Improving Factuality and Reasoning in Language Models through Multiagent Debate — Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, Igor Mordatch, 2024 https://scholar.google.com/scholar?q=Improving+Factuality+and+Reasoning+in+Language+Models+through+Multiagent+Debate 13. DeepSeek-R1 Incentivizes Reasoning in LLMs through Reinforcement Learning — Daya Guo and many coauthors, 2025 https://scholar.google.com/scholar?q=DeepSeek-R1+Incentivizes+Reasoning+in+LLMs+through+Reinforcement+Learning 14. CAMEL: Communicative Agents for "Mind" Exploration of Large Scale Language Model Society — Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, Bernard Ghanem, 2023 https://scholar.google.com/scholar?q=CAMEL:+Communicative+Agents+for+"Mind"+Exploration+of+Large+Scale+Language+Model+Society 15. AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation — Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W. White, Doug Burger, Chi Wang, 2024 https://scholar.google.com/scholar?q=AutoGen:+Enabling+Next-Gen+LLM+Applications+via+Multi-Agent+Conversation 16. A Survey on LLM-based Multi-agent Systems: Workflow, Infrastructure, and Challenges — Xinyi Li, Sai Wang, Siqi Zeng, Yu Wu, Yi Yang, 2024 https://scholar.google.com/scholar?q=A+Survey+on+LLM-based+Multi-agent+Systems:+Workflow,+Infrastructure,+and+Challenges 17. Deep Reinforcement Learning from Human Preferences — P. F. Christiano et al., 2017 https://scholar.google.com/scholar?q=Deep+Reinforcement+Learning+from+Human+Preferences 18. Constitutional AI: Harmlessness from AI Feedback — Y. Bai et al., 2022 https://scholar.google.com/scholar?q=Constitutional+AI:+Harmlessness+from+AI+Feedback 19. Large AI Models Are Cultural and Social Technologies — H. Farrell, A. Gopnik, C. Shalizi, J. Evans, 2025 https://scholar.google.com/scholar?q=Large+AI+Models+Are+Cultural+and+Social+Technologies 20. Governing the Commons: The Evolution of Institutions for Collective Action — E. Ostrom, 1990 https://scholar.google.com/scholar?q=Governing+the+Commons:+The+Evolution+of+Institutions+for+Collective+Action 21. Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them — Mirac Suzgun, Nathan Scales, Nathanael Scharli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V. Le, Ed H. Chi, Denny Zhou, Jason Wei, 2022 https://scholar.google.com/scholar?q=Challenging+BIG-Bench+Tasks+and+Whether+Chain-of-Thought+Can+Solve+Them 22. SoftCoT++: Test-Time Scaling with Soft Chain-of-Thought Reasoning — Yige Xu, Xu Guo, Zhiwei Zeng, Chunyan Miao, 2025 https://scholar.google.com/scholar?q=SoftCoT++:+Test-Time+Scaling+with+Soft+Chain-of-Thought+Reasoning 23. Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning — Chengqi Lyu, Songyang Gao, Yuzhe Gu, Wenwei Zhang, Jianfei Gao and others, 2025 https://scholar.google.com/scholar?q=Exploring+the+Limit+of+Outcome+Reward+for+Learning+Mathematical+Reasoning 24. Linking Process to Outcome: Conditional Reward Modeling for LLM Reasoning — Zheng Zhang, Ziwei Shan, Kaitao Song, Yexin Li, Kan Ren, 2025 https://scholar.google.com/scholar?q=Linking+Process+to+Outcome:+Conditional+Reward+Modeling+for+LLM+Reasoning 25. SophiaVL-R1: Reinforcing MLLMs Reasoning with Thinking Reward — Kaixuan Fan, Kaituo Feng, Haoming Lyu, Dongzhan Zhou, Xiangyu Yue, 2025 https://scholar.google.com/scholar?q=SophiaVL-R1:+Reinforcing+MLLMs+Reasoning+with+Thinking+Reward 26. Parsel: Algorithmic Reasoning with Language Models by Composing Decompositions — Eric Zelikman, Qian Huang and others, 2022 https://scholar.google.com/scholar?q=Parsel:+Algorithmic+Reasoning+with+Language+Models+by+Composing+Decompositions 27. Emergent Hierarchical Reasoning in LLMs through Reinforcement Learning — Haozhe Wang, Qixin Xu, Che Liu, Junhong Wu, Fangzhen Lin, Wenhu Chen, 2025 https://scholar.google.com/scholar?q=Emergent+Hierarchical+Reasoning+in+LLMs+through+Reinforcement+Learning 28. Debate4MATH: Multi-Agent Debate for Fine-Grained Reasoning in Math — Shaowei Zhang, Deyi Xiong, 2025 https://scholar.google.com/scholar?q=Debate4MATH:+Multi-Agent+Debate+for+Fine-Grained+Reasoning+in+Math 29. Learning to Break: Knowledge-Enhanced Reasoning in Multi-Agent Debate System — Haotian Wang, Xiyuan Du, Weijiang Yu, Qianglong Chen, Kun Zhu, Zheng Chu, Lian Yan, Yi Guan, 2025 https://scholar.google.com/scholar?q=Learning+to+Break:+Knowledge-Enhanced+Reasoning+in+Multi-Agent+Debate+System 30. AI Post Transformers: Reasoning Models Generate Societies of Thought — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/reasoning-models-generate-societies-of-thought/ 31. AI Post Transformers: HyperAgents and Metacognitive Self-Improvement — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-26-hyperagents-and-metacognitive-self-impro-de711a.mp3 32. AI Post Transformers: Bloom: an open source tool for automated behavioral evaluations — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/bloom-an-open-source-tool-for-automated-behavioral-evaluations/ 33. AI Post Transformers: NeurIPS 2025: Reward Reasoning Model — Hal Turing & Dr. Ada Shannon, 2025 https://podcast.do-not-panic.com/episodes/neurips-2025-reward-reasoning-model/ 34. AI Post Transformers: MASA: Meta-Awareness via Self-Alignment Reinforcement Learning — Hal Turing & Dr. Ada Shannon, 2025 https://podcast.do-not-panic.com/episodes/masa-meta-awareness-via-self-alignment-reinforcement-learning/ 35. AI Post Transformers: Evolving Language Models Without Labels: EVOL-RL — Hal Turing & Dr. Ada Shannon, 2025 https://podcast.do-not-panic.com/episodes/evolving-language-models-without-labels-evol-rl/ 36. AI Post Transformers: LeCun's AMI Energy-Based Models and the Path to Autonomous Intelligence — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/lecuns-ami-energy-based-models-and-the-path-to-autonomous-intelligence/ Interactive Visualization: Agentic AI and the Next Intellig
5 DAYS AGO

Episode: TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

This episode explores TurboQuant, a method for compressing high-dimensional vectors online without learning a dataset-specific codebook first, aimed at settings like LLM KV-cache compression and approximate nearest neighbor search. It explains why vector quantization is a different problem from ordinary weight quantization, and why preserving inner products can matter just as much as minimizing reconstruction error for retrieval quality and attention behavior. The discussion focuses on the paper’s central idea that a random rotation can regularize vectors enough for simple scalar quantization to approach information-theoretic distortion limits, at least under the paper’s theoretical assumptions. Listeners would find it interesting because it connects rate-distortion theory to concrete systems bottlenecks in modern AI, while also critically examining where the paper’s theoretical strength outpaces its empirical validation. Sources: 1. TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate — Amir Zandieh, Majid Daliri, Majid Hadian, Vahab Mirrokni, 2025 http://arxiv.org/abs/2504.19874 2. Product Quantization for Nearest Neighbor Search — Herve Jegou, Matthijs Douze, Cordelia Schmid, 2011 https://scholar.google.com/scholar?q=Product+Quantization+for+Nearest+Neighbor+Search 3. Quantization based Fast Inner Product Search — Ruiqi Guo, Sanjiv Kumar, Krzysztof Choromanski, David Simcha, 2016 https://scholar.google.com/scholar?q=Quantization+based+Fast+Inner+Product+Search 4. Norm-Explicit Quantization: Improving Vector Quantization for Maximum Inner Product Search — Xinyan Dai, Xiao Yan, Kelvin K. W. Ng, Jiu Liu, James Cheng, 2020 https://scholar.google.com/scholar?q=Norm-Explicit+Quantization:+Improving+Vector+Quantization+for+Maximum+Inner+Product+Search 5. Accelerating Large-Scale Inference with Anisotropic Vector Quantization — Ruiqi Guo, Philip Sun, Erik Lindgren, Quan Geng, David Simcha, Felix Chern, Sanjiv Kumar, 2020 https://scholar.google.com/scholar?q=Accelerating+Large-Scale+Inference+with+Anisotropic+Vector+Quantization 6. QJL: 1-bit Quantized JL Transform for KV Cache Quantization with Zero Overhead — Amir Zandieh, Majid Daliri, Iman Han, 2024 https://scholar.google.com/scholar?q=QJL:+1-bit+Quantized+JL+Transform+for+KV+Cache+Quantization+with+Zero+Overhead 7. PolarQuant: Quantizing KV Caches with Polar Transformation — Iman Han, Prannay Kacham, Amin Karbasi, Vahab Mirrokni, Amir Zandieh, 2025 https://scholar.google.com/scholar?q=PolarQuant:+Quantizing+KV+Caches+with+Polar+Transformation 8. Practical and Asymptotically Optimal Quantization of High-Dimensional Vectors in Euclidean Space for Approximate Nearest Neighbor Search — Jianqiao Gao, Yuxuan Gou, Yiming Xu, Yuting Yang, Cheng Long, Raymond Chi-Wing Wong, 2024 https://scholar.google.com/scholar?q=Practical+and+Asymptotically+Optimal+Quantization+of+High-Dimensional+Vectors+in+Euclidean+Space+for+Approximate+Nearest+Neighbor+Search 9. KIVI: A Tuning-Free Asymmetric 2-bit Quantization for KV Cache — Zefan Liu, Jiapeng Yuan, Hongyin Jin, Shanghang Zhong, Zhiyuan Xu, Vladimir Braverman, Beidi Chen, Xia Hu, 2024 https://scholar.google.com/scholar?q=KIVI:+A+Tuning-Free+Asymmetric+2-bit+Quantization+for+KV+Cache 10. KVSink: Understanding and Enhancing the Preservation of Attention Sinks in KV Cache Quantization for LLMs — Zunhai Su, Kehong Yuan, 2025 https://scholar.google.com/scholar?q=KVSink:+Understanding+and+Enhancing+the+Preservation+of+Attention+Sinks+in+KV+Cache+Quantization+for+LLMs 11. ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification — Yefei He, Luoming Zhang, Weijia Wu, Jing Liu, Hong Zhou, Bohan Zhuang, 2024 https://scholar.google.com/scholar?q=ZipCache:+Accurate+and+Efficient+KV+Cache+Quantization+with+Salient+Token+Identification 12. AKVQ-VL: Attention-Aware KV Cache Adaptive 2-Bit Quantization for Vision-Language Models — Zunhai Su, Wang Shen, Linge Li, Zhe Chen, Hanyu Wei, Huangqi Yu, Kehong Yuan, 2025 https://scholar.google.com/scholar?q=AKVQ-VL:+Attention-Aware+KV+Cache+Adaptive+2-Bit+Quantization+for+Vision-Language+Models 13. SpinQuant: LLM Quantization with Learned Rotations — Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, Tijmen Blankevoort, 2024 https://scholar.google.com/scholar?q=SpinQuant:+LLM+Quantization+with+Learned+Rotations 14. Rotate, Clip, and Partition: Towards W2A4KV4 Quantization by Integrating Rotation and Learnable Non-uniform Quantizer — Euntae Choi, Sumin Song, Woosang Lim, Sungjoo Yoo, 2025 https://scholar.google.com/scholar?q=Rotate,+Clip,+and+Partition:+Towards+W2A4KV4+Quantization+by+Integrating+Rotation+and+Learnable+Non-uniform+Quantizer 15. Locally-Adaptive Quantization for Streaming Vector Search — Cecilia Aguerrebere, Mark Hildebrand, Ishwar Singh Bhati, Theodore Willke, Mariano Tepper, 2024 https://scholar.google.com/scholar?q=Locally-Adaptive+Quantization+for+Streaming+Vector+Search 16. Sampling Methods for Inner Product Sketching — Majid Daliri, Juliana Freire, Christopher Musco, Aecio Santos, Haoxiang Zhang, 2024 https://scholar.google.com/scholar?q=Sampling+Methods+for+Inner+Product+Sketching 17. AI Post Transformers: Memory Traffic Saturation in Transformer Decode — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-20-memory-traffic-saturation-in-transformer-cd4961.mp3 18. AI Post Transformers: LAQ for Smarter KV Cache Eviction — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-23-laq-for-smarter-kv-cache-eviction-3ea2b8.mp3 19. AI Post Transformers: Lookahead Q-Cache for Consistent KV Eviction — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-25-lookahead-q-cache-for-consistent-kv-evic-d97b09.mp3 20. AI Post Transformers: AWQ: On-Device LLM Compression and Acceleration — Hal Turing & Dr. Ada Shannon, 2025 https://podcast.do-not-panic.com/episodes/awq-on-device-llm-compression-and-acceleration/ 21. AI Post Transformers: Sentence-BERT: Siamese Networks for Sentence Embeddings — Hal Turing & Dr. Ada Shannon, 2025 https://podcast.do-not-panic.com/episodes/sentence-bert-siamese-networks-for-sentence-embeddings/ 22. AI Post Transformers: SolidAttention: Co-Designing Sparse Attention and SSD I/O — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-18-solidattention-co-designing-sparse-atten-5a8622.mp3 Interactive Visualization: Episode: TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate
5 DAYS AGO

Splitwise: Phase-Split LLM Inference

This episode examines Splitwise: Efficient Generative LLM Inference Using Phase Splitting, a 2024 systems paper from researchers at the University of Washington and Microsoft, and centers the discussion on a simple claim with large deployment consequences: prompt prefill and token decode are different enough that they should not necessarily run on the same hardware. The hosts walk through the basic mechanics of generative inference, explaining prefill as the parallel, compute-heavy stage that processes the prompt, and decode as the sequential, KV-cache-driven stage that generates tokens one by one. That distinction sets up the paper’s core argument that modern serving stacks are paying a penalty by treating inference as a uniform workload when its phases are constrained by very different resources. The conversation stays focused on why that split matters in practice. It unpacks phase heterogeneity in terms of throughput, latency, utilization, memory pressure, and power draw, and explains why decode can remain bottlenecked by memory bandwidth and capacity even on newer accelerators with far more raw FLOPs. From there, the episode explores Splitwise’s broader systems framing: if compute is scaling faster than memory, then assigning prefill to high-throughput hardware and decode to cheaper or lower-power machines may be a more realistic datacenter strategy than continuing to push everything through one homogeneous GPU fleet. The hosts also emphasize power-normalized evaluation as a more honest lens for operators than simple box-for-box performance comparisons. Along the way, the episode places Splitwise in public context alongside ORCA, PagedAttention, and SARATHI without losing its anchor. Those earlier systems are used to clarify what Splitwise does and does not claim: continuous batching, KV-cache-aware memory management, and batch reshaping all improve serving efficiency, but they do not eliminate the underlying asymmetry between prefill and decode. The result is a grounded discussion of phase splitting as a deployment decision rather than a purely algorithmic trick, with particular attention to where prefill-decode disaggregation looks compelling, where it depends on the realities of cluster design, and where the limits of PD disaggregation still leave open systems questions. Sources: 1. Splitwise: Efficient generative LLM inference using phase splitting — Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, Ricardo Bianchini, 2023 http://arxiv.org/abs/2311.18677 2. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving — Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, Hao Zhang, 2024 http://arxiv.org/abs/2401.09670 3. DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference — Yongtong Wu, Shaoyuan Chen, Yinmin Zhong, Rilin Huang, Yixuan Tan, Wentao Zhang, Liyue Zhang, Shangyan Zhou, Yuxuan Liu, Shunfeng Zhou, Mingxing Zhang, Xin Jin, Panpan Huang, 2026 http://arxiv.org/abs/2602.21548 4. Not All Prefills Are Equal: PPD Disaggregation for Multi-turn LLM Serving — Zongze Li, Jingyu Liu, Zach Xu, Yineng Zhang, Tahseen Rabbani, Ce Zhang, 2026 http://arxiv.org/abs/2603.13358 5. Orca: A Distributed Serving System for Transformer-Based Generative Models — Gyeong-In Yu, Joo Seong Jeong, Gun-Woo Kim, Seungtae Kim, Byung-Gon Chun, 2022 https://scholar.google.com/scholar?q=Orca:+A+Distributed+Serving+System+for+Transformer-Based+Generative+Models 6. Efficient Memory Management for Large Language Model Serving with PagedAttention — Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica, 2023 https://scholar.google.com/scholar?q=Efficient+Memory+Management+for+Large+Language+Model+Serving+with+PagedAttention 7. SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills — Animesh Agrawal, Aakash Panwar, Jaya Mohan, Nakul Kwatra, Bhaskar S. Gulavani, Ramachandran Ramjee, 2023 https://scholar.google.com/scholar?q=SARATHI:+Efficient+LLM+Inference+by+Piggybacking+Decodes+with+Chunked+Prefills 8. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving — Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, Hao Zhang, 2024 https://scholar.google.com/scholar?q=DistServe:+Disaggregating+Prefill+and+Decoding+for+Goodput-optimized+Large+Language+Model+Serving 9. Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving — Ruoyu Qin, Zheming Li, Weiran He, Mingxing Zhang, Yongwei Wu, Weimin Zheng, Xinran Xu, 2024 https://scholar.google.com/scholar?q=Mooncake:+A+KVCache-centric+Disaggregated+Architecture+for+LLM+Serving 10. KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse — Jingbo Yang, Bairu Hou, Wei Wei, Yujia Bao, Shiyu Chang, 2025 https://scholar.google.com/scholar?q=KVLink:+Accelerating+Large+Language+Models+via+Efficient+KV+Cache+Reuse 11. HyperRAG: Enhancing Quality-Efficiency Tradeoffs in Retrieval-Augmented Generation with Reranker KV-Cache Reuse — Yuwei An, Yihua Cheng, Seo Jin Park, Junchen Jiang, 2025 https://scholar.google.com/scholar?q=HyperRAG:+Enhancing+Quality-Efficiency+Tradeoffs+in+Retrieval-Augmented+Generation+with+Reranker+KV-Cache+Reuse 12. ProphetKV: User-Query-Driven Selective Recomputation for Efficient KV Cache Reuse in Retrieval-Augmented Generation — Shihao Wang, Jiahao Chen, Yanqi Pan, Hao Huang and colleagues, 2026 https://scholar.google.com/scholar?q=ProphetKV:+User-Query-Driven+Selective+Recomputation+for+Efficient+KV+Cache+Reuse+in+Retrieval-Augmented+Generation 13. Accelerating LLM Inference with Staged Speculative Decoding — Benjamin Spector, Chris Re, 2023 https://scholar.google.com/scholar?q=Accelerating+LLM+Inference+with+Staged+Speculative+Decoding 14. SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices — Ruslan Svirschevski, Avner May, Zhuoming Chen, Beidi Chen, Zhihao Jia, Max Ryabinin, 2024 https://scholar.google.com/scholar?q=SpecExec:+Massively+Parallel+Speculative+Decoding+for+Interactive+LLM+Inference+on+Consumer+Devices 15. KVDirect: Distributed Disaggregated LLM Inference — Shiyang Chen, Rain Jiang, Dezhi Yu, Jinlai Xu, Mengyuan Chao and colleagues, 2025 https://scholar.google.com/scholar?q=KVDirect:+Distributed+Disaggregated+LLM+Inference 16. Arrow: Adaptive Scheduling Mechanisms for Disaggregated LLM Inference Architecture — Yu Wu, Tongxuan Liu, Yuting Zeng, Siyu Wu, Jun Xiong, Xianzhe Dong and colleagues, 2025 https://scholar.google.com/scholar?q=Arrow:+Adaptive+Scheduling+Mechanisms+for+Disaggregated+LLM+Inference+Architecture 17. WindServe: Efficient Phase-Disaggregated LLM Serving with Stream-based Dynamic Scheduling — Jingqi Feng, Yukai Huang, Rui Zhang, Sicheng Liang, Ming Yan, Jie Wu, 2025 https://scholar.google.com/scholar?q=WindServe:+Efficient+Phase-Disaggregated+LLM+Serving+with+Stream-based+Dynamic+Scheduling 18. AI Post Transformers: CacheSlide: Position-Aware KV Cache Reuse for Agent LLMs — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-16-cacheslide-position-aware-kv-cache-reuse-cd59c7.mp3 19. AI Post Transformers: SGLang: Efficient Language Model Program Execution — Hal Turing & Dr. Ada Shannon, Sun, https://podcast.do-not-panic.com/episodes/sglang-efficient-language-model-program-execution/ 20. AI Post Transformers: Episode: Speculative Speculative Decoding — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-25-speculative-speculative-decoding-1b7a10.mp3 21. AI Post Transformers: xLLM: Co-Locating Online and Offline LLM Inference — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-16-xllm-co-locating-online-and-offline-llm-10bb81.mp3 22. AI Post Transformers: FAST26: Bidaw: Enhancing Key-Value Caching for Interactive LLM Serving via Bidirectional — Hal Turing & Dr. Ada Shannon, Wed, https://podcast.do-not-panic.com/episodes/fast26-bidaw-enhancing-key-value-caching-for-interactive-llm-serving-via-bidirec/ 23. AI Post Transformers: Lookahead Q-Cache for Consistent KV Eviction — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-25-lookahead-q-cache-for-consistent-kv-evic-d97b09.mp3 Interactive Visualization: Splitwise: Phase-Split LLM Inference
6 DAYS AGO

Episode: From Prefix Cache to Fusion RAG Cache: Accelerating LLM Inference in Retrieval-Augmented Generation

This episode explores a systems paper on speeding up retrieval-augmented generation by reusing transformer KV cache state more intelligently, instead of recomputing long retrieved prompts from scratch on every request. It explains why RAG often improves grounding yet suffers from high time-to-first-token, especially when multiple retrieved chunks must be prefetched and encoded together. The discussion focuses on the paper’s central argument that naive chunk-level cache reuse breaks important cross-chunk interactions, and that the proposed FusionRAG Cache tries to preserve quality through offline chunk enrichment and selective online recomputation. A listener would find it interesting because it connects familiar RAG concepts to the real serving bottlenecks that determine whether enterprise assistants feel practical or painfully slow. Sources: 1. From Prefix Cache to Fusion RAG Cache: Accelerating LLM Inference in Retrieval-Augmented Generation — Jiahao Wang, Weiyu Xie, Mingxing Zhang, Boxing Zhang, Jianwei Dong, Yuening Zhu, Chen Lin, Jinqi Tang, Yaochen Han, Zhiyuan Ai, Xianglin Chen, Yongwei Wu, Congfeng Jiang, 2026 http://arxiv.org/abs/2601.12904v1 2. REALM: Retrieval-Augmented Language Model Pre-Training — Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, Ming-Wei Chang, 2020 https://scholar.google.com/scholar?q=REALM:+Retrieval-Augmented+Language+Model+Pre-Training 3. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks — Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Kuttler, Mike Lewis, Wen-tau Yih, Tim Rocktaschel, Sebastian Riedel, Douwe Kiela, 2020 https://scholar.google.com/scholar?q=Retrieval-Augmented+Generation+for+Knowledge-Intensive+NLP+Tasks 4. Few-shot Learning with Retrieval Augmented Language Models — Gautier Izacard, Edouard Grave, Patrick Lewis, Benjamin Chintala, Tim Rocktaschel, Fabio Petroni, 2022 https://scholar.google.com/scholar?q=Few-shot+Learning+with+Retrieval+Augmented+Language+Models 5. Retrieval-Augmented Generation for Large Language Models: A Survey — Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, Haofen Wang, 2023 https://scholar.google.com/scholar?q=Retrieval-Augmented+Generation+for+Large+Language+Models:+A+Survey 6. APE: Faster and Longer Context-Augmented Generation via Adaptive Parallel Encoding — Xinyu Yang, Tianqi Chen, Beidi Chen, 2025 https://scholar.google.com/scholar?q=APE:+Faster+and+Longer+Context-Augmented+Generation+via+Adaptive+Parallel+Encoding 7. CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion — Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, Junchen Jiang, 2024 https://scholar.google.com/scholar?q=CacheBlend:+Fast+Large+Language+Model+Serving+for+RAG+with+Cached+Knowledge+Fusion 8. Cache-Craft: Managing Chunk-Caches for Efficient Retrieval-Augmented Generation — Shubham Agarwal, Sai Narayan Sundaresan, Subrata Mitra, Deb Mahapatra, Tong Yu, Shiv Saini, 2025 https://scholar.google.com/scholar?q=Cache-Craft:+Managing+Chunk-Caches+for+Efficient+Retrieval-Augmented+Generation 9. From Prefix Cache to Fusion RAG Cache: Accelerating LLM Inference in Retrieval-Augmented Generation — Jiahao Wang, Weiyu Xie, Mingxing Zhang, Boxing Zhang, Jianwei Dong, Yuening Zhu, Chen Lin, Jinqi Tang, Yaochen Han, Zhiyuan Ai, Xianglin Chen, Yongwei Wu, Congfeng Jiang, 2026 https://scholar.google.com/scholar?q=From+Prefix+Cache+to+Fusion+RAG+Cache:+Accelerating+LLM+Inference+in+Retrieval-Augmented+Generation 10. HyperRAG: Enhancing Quality-Efficiency Tradeoffs in Retrieval-Augmented Generation with Reranker KV-Cache Reuse — authors not reliably recoverable from the browsed snippets, 2025 https://scholar.google.com/scholar?q=HyperRAG:+Enhancing+Quality-Efficiency+Tradeoffs+in+Retrieval-Augmented+Generation+with+Reranker+KV-Cache+Reuse 11. Kvlink: Accelerating Large Language Models via Efficient KV Cache Reuse — approx. recent systems/LLM inference authors, 2025/2026 https://scholar.google.com/scholar?q=Kvlink:+Accelerating+Large+Language+Models+via+Efficient+KV+Cache+Reuse 12. Sparse Attention across Multiple-context KV Cache — approx. recent long-context/RAG inference authors, 2025/2026 https://scholar.google.com/scholar?q=Sparse+Attention+across+Multiple-context+KV+Cache 13. CacheClip: Accelerating RAG with Effective KV Cache Reuse — approx. recent RAG systems authors, 2025/2026 https://scholar.google.com/scholar?q=CacheClip:+Accelerating+RAG+with+Effective+KV+Cache+Reuse 14. CHESS: Context-aware Hierarchical Efficient Semantic Selection for Long-Context LLM Inference — approx. recent long-context inference authors, 2025/2026 https://scholar.google.com/scholar?q=CHESS:+Context-aware+Hierarchical+Efficient+Semantic+Selection+for+Long-Context+LLM+Inference 15. BudgetMem: Learning Selective Memory Policies for Cost-Efficient Long-Context Processing in Language Models — approx. recent memory-policy learning authors, 2025/2026 https://scholar.google.com/scholar?q=BudgetMem:+Learning+Selective+Memory+Policies+for+Cost-Efficient+Long-Context+Processing+in+Language+Models 16. TeleRAG: Efficient Retrieval-Augmented Generation Inference with Lookahead Retrieval — approx. recent RAG systems authors, 2025/2026 https://scholar.google.com/scholar?q=TeleRAG:+Efficient+Retrieval-Augmented+Generation+Inference+with+Lookahead+Retrieval 17. Understanding and Optimizing Multi-Stage AI Inference Pipelines — approx. recent systems authors, 2025/2026 https://scholar.google.com/scholar?q=Understanding+and+Optimizing+Multi-Stage+AI+Inference+Pipelines 18. AI Post Transformers: Episode: LookaheadKV: Fast and Accurate KV Cache Eviction by Glimpsing into the Future without Generation — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-21-lookaheadkv-fast-and-accurate-kv-c9d436.mp3 19. AI Post Transformers: CacheSlide: Position-Aware KV Cache Reuse for Agent LLMs — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-16-cacheslide-position-aware-kv-cache-reuse-cd59c7.mp3 20. AI Post Transformers: SolidAttention: Co-Designing Sparse Attention and SSD I/O — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-18-solidattention-co-designing-sparse-atten-5a8622.mp3 Interactive Visualization: Episode: From Prefix Cache to Fusion RAG Cache: Accelerating LLM Inference in Retrieval-Augmented Generation

See All (497)

Creator

mcgrof
Years Active

2025 - 2026
Episodes

497
Rating

Clean
Show Website

AI Post Transformers

Technology

Technology

Updated weekly

AI Post Transformers

Simple Self-Distillation for Better Code Generation

MetaClaw: Just Talk and Continual Agent Adaptation

Doc-to-LoRA: Internalizing Context as LoRA

MAML and the Basics of Meta-Learning

Agentic AI and the Next Intelligence Explosion

Episode: TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

Splitwise: Phase-Split LLM Inference

Episode: From Prefix Cache to Fusion RAG Cache: Accelerating LLM Inference in Retrieval-Augmented Generation

About

Information

You Might Also Like

AI Post Transformers

Episodes

Simple Self-Distillation for Better Code Generation

MetaClaw: Just Talk and Continual Agent Adaptation

Doc-to-LoRA: Internalizing Context as LoRA

MAML and the Basics of Meta-Learning

Agentic AI and the Next Intelligence Explosion

Episode: TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

Splitwise: Phase-Split LLM Inference

Episode: From Prefix Cache to Fusion RAG Cache: Accelerating LLM Inference in Retrieval-Augmented Generation

About

Information

You Might Also Like