AI Post Transformers

mcgrof

AI-generated podcast where hosts Hal Turing and Dr. Ada Shannon discuss the latest research papers and reports in machine learning, AI systems, and optimization. Featuring honest critical analysis, proper citations, and nerdy humor.

  1. 2D AGO

    Titans: Learning to Memorize at Test Time

    This episode explores the Titans paper’s proposal to pair standard attention with a separate learned long-term memory that updates during inference, aiming to preserve distant information without paying full quadratic attention costs across very long sequences. It situates that idea against earlier approaches such as Neural Turing Machines, Transformer-XL, Compressive Transformers, Memorizing Transformers, and linear-attention recurrent models, highlighting the recurring tradeoff between precise recall and scalable memory. The discussion focuses on the paper’s most distinctive claim: memory writes are driven by a loss-based notion of surprise, making test-time memory updates look more like small online learning steps than a simple cache. Listeners would find it interesting because it gets at a central open question in modern AI systems design: whether neural networks can gain durable, useful memory at inference time without becoming too unstable, expensive, or operationally awkward to deploy. Sources: 1. Titans: Learning to Memorize at Test Time https://arxiv.org/pdf/2501.00663 2. Neural Turing Machines — Alex Graves, Greg Wayne, Ivo Danihelka, 2014 https://arxiv.org/abs/1410.5401 3. Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context — Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov, 2019 https://arxiv.org/abs/1901.02860 4. Compressive Transformers for Long-Range Sequence Modelling — Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, Chloe Hillier, Timothy P. Lillicrap, 2020 https://openreview.net/forum?id=SylKikSYDH 5. Memorizing Transformers — Yuhuai Wu, Markus N. Rabe, DeLesley Hutchins, Christian Szegedy, 2022 https://arxiv.org/abs/2203.08913 6. Learning to (learn at test time): RNNs with expressive hidden states — Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, and Sanmi Koyejo, 2024 https://scholar.google.com/scholar?q=Learning+to+(learn+at+test+time):+RNNs+with+expressive+hidden+states 7. Gated Delta Networks: Improving Mamba2 with Delta Rule — Songlin Yang, Jan Kautz, and Ali Hatamizadeh, 2024 https://scholar.google.com/scholar?q=Gated+Delta+Networks:+Improving+Mamba2+with+Delta+Rule 8. RULER: What's the Real Context Size of Your Long-Context Language Models? — Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, and Boris Ginsburg, 2024 https://scholar.google.com/scholar?q=RULER:+What's+the+Real+Context+Size+of+Your+Long-Context+Language+Models? 9. BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack — Yuri Kuratov, Aydar Bulatov, Petr Anokhin, Ivan Rodkin, Dmitry Sorokin, Artyom Sorokin, and Mikhail Burtsev, 2024 https://scholar.google.com/scholar?q=BABILong:+Testing+the+Limits+of+LLMs+with+Long+Context+Reasoning-in-a-Haystack 10. ATLAS: Learning to Optimally Memorize the Context at Test Time — Ali Behrouz, Zeman Li, Praneeth Kacham, Majid Daliri, Yuan Deng, Peilin Zhong, Meisam Razaviyayn, and Vahab Mirrokni, 2025 https://scholar.google.com/scholar?q=ATLAS:+Learning+to+Optimally+Memorize+the+Context+at+Test+Time 11. KV-Fold: One-Step KV-Cache Recurrence for Long-Context Inference — Alireza Nadali, Patrick Cooper, Ashutosh Trivedi, Alvaro Velasquez, 2026 https://scholar.google.com/scholar?q=KV-Fold:+One-Step+KV-Cache+Recurrence+for+Long-Context+Inference 12. SCBench: A KV Cache-Centric Analysis of Long-Context Methods — Yucheng Li, Huiqiang Jiang, Qianhui Wu, Xufang Luo, Surin Ahn, Chengruidong Zhang, Amir H. Abdi, Dongsheng Li, Jianfeng Gao, Yuqing Yang, Lili Qiu, 2024/2025 https://scholar.google.com/scholar?q=SCBench:+A+KV+Cache-Centric+Analysis+of+Long-Context+Methods 13. Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling — Liliang Ren, Yang Liu, Yadong Lu, Yelong Shen, Chen Liang, Weizhu Chen, 2024 https://scholar.google.com/scholar?q=Samba:+Simple+Hybrid+State+Space+Models+for+Efficient+Unlimited+Context+Language+Modeling 14. Longhorn: State Space Models are Amortized Online Learners — Bo Liu, Rui Wang, Lemeng Wu, Yihao Feng, Peter Stone, Qiang Liu, 2024 https://scholar.google.com/scholar?q=Longhorn:+State+Space+Models+are+Amortized+Online+Learners 15. Retrieval meets Long Context Large Language Models — Peng Xu, Wei Ping, Xianchao Wu, Lawrence McAfee, Chen Zhu, Zihan Liu, Sandeep Subramanian, Evelina Bakhturina, Mohammad Shoeybi, Bryan Catanzaro, 2023 https://scholar.google.com/scholar?q=Retrieval+meets+Long+Context+Large+Language+Models 16. Augmenting Language Models with Long-Term Memory — Weizhi Wang, Li Dong, Hao Cheng, Xiaodong Liu, Xifeng Yan, Jianfeng Gao, Furu Wei, 2023 https://scholar.google.com/scholar?q=Augmenting+Language+Models+with+Long-Term+Memory 17. Test-Time Training Provably Improves Transformers as In-Context Learners — Halil Alperen Gozeten, M. Emrullah Ildiz, Xuechen Zhang, Mahdi Soltanolkotabi, Marco Mondelli, Samet Oymak, 2025 https://scholar.google.com/scholar?q=Test-Time+Training+Provably+Improves+Transformers+as+In-Context+Learners 18. AI Post Transformers: δ-mem and Online Memory for LLMs — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-13-d-mem-and-online-memory-for-llms-6622fa.mp3 19. AI Post Transformers: In-Place Test-Time Training for Transformers — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-09-in-place-test-time-training-for-transfor-d0b976.mp3 20. AI Post Transformers: Gated Linear Attention for Efficient Long Sequences — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-18-gated-linear-attention-for-efficient-lon-c858ab.mp3 21. AI Post Transformers: MELT: Decoupling Compute From Memory — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-13-melt-decoupling-compute-from-memory-26430c.mp3 22. AI Post Transformers: Long Context Pre-Training with Lighthouse Attention — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-13-long-context-pre-training-with-lighthous-e85bbe.mp3 23. AI Post Transformers: Training Million-Token LLMs Beyond the Memory Barrier — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-04-training-million-token-llms-beyond-the-m-324edc.mp3 24. AI Post Transformers: Recursive Language Models for Arbitrarily Long Prompts — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-recursive-language-models-for-arbitraril-fbcd1c.mp3 25. AI Post Transformers: How Induction Heads Emerge in Transformers — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-03-how-induction-heads-emerge-in-transforme-a7bfcb.mp3 26. AI Post Transformers: DeepSeek-V4 and Practical Million-Token Context — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-25-deepseek-v4-and-practical-million-token-6f4de1.mp3 Interactive Visualization: Titans: Learning to Memorize at Test Time

  2. 3D AGO

    Affordable Large-Scale Decoding Through Model-System Co-Design

    This episode explores the paper’s claim that decoding cost in large language models is driven less by raw parameter counts and more by hardware-level behavior during autoregressive generation, especially memory bandwidth pressure from the KV cache. It explains why metrics like total or activated parameters can be misleading cost proxies, and walks through the tradeoffs among standard attention, grouped-query variants, and newer approaches such as MFA that aim to preserve expressive power while reducing cache overhead. The discussion also highlights the paper’s central systems argument: attention and FFN layers have very different performance bottlenecks, so separating them through Attention-FFN Disaggregation can make large models cheaper to serve without sacrificing capability. A listener would find it interesting for its concrete, skeptical look at why inference efficiency depends on model-system co-design rather than headline model size alone. Sources: 1. Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding — StepFun, :, Bin Wang, Bojun Wang, Changyi Wan, Guanzhe Huang, Hanpeng Hu, Haonan Jia, Hao Nie, Mingliang Li, Nuo Chen, Siyu Chen, Song Yuan, Wuxun Xie, Xiaoniu Song, Xing Chen, Xingping Yang, Xuelin Zhang, Yanbo Yu, Yaoyu Wang, Yibo Zhu, Yimin Jiang, Yu Zhou, Yuanwei Lu, Houyi Li, Jingcheng Hu, Ka Man Lo, Ailin Huang, Binxing Jiao, Bo Li, Boyu Chen, Changxin Miao, Chang Lou, Chen Hu, Chen Xu, Chenfeng Yu, Chengyuan Yao, Daokuan Lv, Dapeng Shi, Deshan Sun, Ding Huang, Dingyuan Hu, Dongqing Pang, Enle Liu, Fajie Zhang, Fanqi Wan, Gulin Yan, Han Zhang, Han Zhou, Hanghao Wu, Hangyu Guo, Hanqi Chen, Hanshan Zhang, Hao Wu, Haocheng Zhang, Haolong Yan, Haoran Lv, Haoran Wei, Hebin Zhou, Heng Wang, Heng Wang, Hongxin Li, Hongyu Zhou, Hongyuan Wang, Huiyong Guo, Jia Wang, Jiahao Gong, Jialing Xie, Jian Zhou, Jianjian Sun, Jiaoren Wu, Jiaran Zhang, Jiayu Liu, Jie Cheng, Jie Luo, Jie Yan, Jie Yang, Jieyi Hou, Jinguang Zhang, Jinlan Cao, Jisheng Yin, Junfeng Liu, Junhao Huang, Junzhe Lin, Kaijun Tan, Kaixiang Li, Kang An, Kangheng Lin, Kenkun Liu, Lei Yang, Liang Zhao, Liangyu Chen, Lieyu Shi, Liguo Tan, Lin Lin, Lin Zhang, Lina Chen, Liwen Huang, Liying Shi, Longlong Gu, Mei Chen, Mengqiang Ren, Ming Li, Mingzhe Chen, Na Wang, Nan Wu, Qi Han, Qian Zhao, Qiang Zhang, Qianni Liu, Qiaohui Chen, Qiling Wu, Qinglin He, Qinyuan Tan, Qiufeng Wang, Qiuping Wu, Qiuyan Liang, Quan Sun, Rui Li, Ruihang Miao, Ruosi Wan, Ruyan Guo, Shangwu Zhong, Shaoliang Pang, Shengjie Fan, Shijie Shang, Shilei Jiang, Shiliang Yang, Shiming Hao, Shuli Gao, Siming Huang, Siqi Liu, Tiancheng Cao, Tianhao Cheng, Tianhao Peng, Wang You, Wei Ji, Wen Sun, Wenjin Deng, Wenqing He, Wenzhen Zheng, Xi Chen, Xiangwen Kong, Xianzhen Luo, Xiaobo Yang, Xiaojia Liu, Xiaoxiao Ren, Xin Han, Xin Li, Xin Wu, Xu Zhao, Yanan Wei, Yang Li, Yangguang Li, Yangshijie Xu, Yanming Xu, Yaqiang Shi, Yeqing Shen, Yi Yang, Yifei Yang, Yifeng Gong, Yihan Chen, Yijing Yang, Yinmin Zhang, Yizhuang Zhou, Yuanhao Ding, Yuantao Fan, Yuanzhen Yang, Yuchu Luo, Yue Peng, Yufan Lu, Yuhang Deng, Yuhe Yin, Yujie Liu, Yukun Chen, Yuling Zhao, Yun Mou, Yunlong Li, Yunzhou Ju, Yusheng Li, Yuxiang Yang, Yuxiang Zhang, Yuyang Chen, Zejia Weng, Zhe Xie, Zheng Ge, Zheng Gong, Zhenyi Lu, Zhewei Huang, Zhichao Chang, Zhiguo Huang, Zhirui Wang, Zidong Yang, Zili Wang, Ziqi Wang, Zixin Zhang, Binxing Jiao, Daxin Jiang, Heung-Yeung Shum, Xiangyu Zhang, 2025 http://arxiv.org/abs/2507.19427 2. Fast Transformer Decoding: One Write-Head is All You Need — Noam Shazeer, 2019 https://scholar.google.com/scholar?q=Fast+Transformer+Decoding:+One+Write-Head+is+All+You+Need 3. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints — Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, Sumit Sanghai, 2023 https://scholar.google.com/scholar?q=GQA:+Training+Generalized+Multi-Query+Transformer+Models+from+Multi-Head+Checkpoints 4. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model — Zhihong Shao and DeepSeek-AI et al., 2024 https://scholar.google.com/scholar?q=DeepSeek-V2:+A+Strong,+Economical,+and+Efficient+Mixture-of-Experts+Language+Model 5. Multi-matrix Factorization Attention — Jingcheng Hu, Houyi Li, Yinmin Zhang, Zili Wang, Shuigeng Zhou, Xiangyu Zhang, Heung-Yeung Shum, Daxin Jiang, 2024 https://scholar.google.com/scholar?q=Multi-matrix+Factorization+Attention 6. Splitwise: Efficient generative LLM inference using phase splitting — Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Inigo Goiri, Saeed Maleki, Ricardo Bianchini, 2023 https://scholar.google.com/scholar?q=Splitwise:+Efficient+generative+LLM+inference+using+phase+splitting 7. P/D-Serve: Serving Disaggregated Large Language Model at Scale — Yibo Jin, Tao Wang, Huimin Lin and Huawei colleagues, 2024 https://scholar.google.com/scholar?q=P/D-Serve:+Serving+Disaggregated+Large+Language+Model+at+Scale 8. MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism — Ruidong Zhu, Ziheng Jiang, Chao Jin and ByteDance colleagues, 2025 https://scholar.google.com/scholar?q=MegaScale-Infer:+Serving+Mixture-of-Experts+at+Scale+with+Disaggregated+Expert+Parallelism 9. Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding — StepFun et al., 2025 https://scholar.google.com/scholar?q=Step-3+is+Large+yet+Affordable:+Model-system+Co-design+for+Cost-effective+Decoding 10. DeepSeek-V3 Technical Report — DeepSeek-AI et al., 2024 https://scholar.google.com/scholar?q=DeepSeek-V3+Technical+Report 11. Qwen3 MoE 235B — Qwen Team / Alibaba researchers, 2025 https://scholar.google.com/scholar?q=Qwen3+MoE+235B 12. Prefill-Decode Disaggregation — Relevant serving-systems authors cited as [18, 31], 2024-2025 https://scholar.google.com/scholar?q=Prefill-Decode+Disaggregation 13. Kimi K2 Technical Report — Moonshot AI et al., 2025 https://scholar.google.com/scholar?q=Kimi+K2+Technical+Report 14. MiniMax M1 — MiniMax researchers, 2025 https://scholar.google.com/scholar?q=MiniMax+M1 15. KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse — Jingbo Yang et al., 2025 https://scholar.google.com/scholar?q=KVLink:+Accelerating+Large+Language+Models+via+Efficient+KV+Cache+Reuse 16. HyperRAG: Enhancing Quality-Efficiency Tradeoffs in Retrieval-Augmented Generation with Reranker KV-Cache Reuse — Yuwei An et al., 2025 https://scholar.google.com/scholar?q=HyperRAG:+Enhancing+Quality-Efficiency+Tradeoffs+in+Retrieval-Augmented+Generation+with+Reranker+KV-Cache+Reuse 17. ProphetKV: User-Query-Driven Selective Recomputation for Efficient KV Cache Reuse in Retrieval-Augmented Generation — Shihao Wang et al., 2026 https://scholar.google.com/scholar?q=ProphetKV:+User-Query-Driven+Selective+Recomputation+for+Efficient+KV+Cache+Reuse+in+Retrieval-Augmented+Generation 18. HyperAttention: Long-context Attention in Near-Linear Time — Insu Han et al., 2023 https://scholar.google.com/scholar?q=HyperAttention:+Long-context+Attention+in+Near-Linear+Time 19. Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention — Tsendsuren Munkhdalai et al., 2024 https://scholar.google.com/scholar?q=Leave+No+Context+Behind:+Efficient+Infinite+Context+Transformers+with+Infini-attention 20. Every Attention Matters: An Efficient Hybrid Architecture for Long-Context Reasoning — Ling Team et al., 2025 https://scholar.google.com/scholar?q=Every+Attention+Matters:+An+Efficient+Hybrid+Architecture+for+Long-Context+Reasoning 21. KVDirect: Distributed Disaggregated LLM Inference — Shiyang Chen et al., 2024 https://scholar.google.com/scholar?q=KVDirect:+Distributed+Disaggregated+LLM+Inference 22. HexGen-2: Disaggregated Generative Inference of LLMs in Heterogeneous Environment — Youhe Jiang et al., 2025 https://scholar.google.com/scholar?q=HexGen-2:+Disaggregated+Generative+Inference+of+LLMs+in+Heterogeneous+Environment 23. GRACE-MoE: Grouping and Replication with Locality-Aware Routing for Efficient Distributed MoE Inference — Yu Han et al., 2025 https://scholar.google.com/scholar?q=GRACE-MoE:+Grouping+and+Replication+with+Locality-Aware+Routing+for+Efficient+Distributed+MoE+Inference 24. AI Post Transformers: JANUS for Scalable MoE Inference — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-15-janus-for-scalable-moe-inference-78ae30.mp3 25. AI Post Transformers: Prefill-as-a-Service for Cross-Datacenter KV Cache — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-19-prefill-as-a-service-for-cross-datacente-7560be.mp3 26. AI Post Transformers: Batch-Aware Expert Routing for Faster MoE Decoding — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-batch-aware-expert-routing-for-faster-mo-683ab6.mp3 27. AI Post Transformers: Deep Kernel Fusion for Transformer Decoding — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-15-deep-kernel-fusion-for-transformer-decod-b1a703.mp3 28. AI Post Transformers: NanoFlow and the Future of LLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-15-nanoflow-and-the-future-of-llm-serving-7429c9.mp3 29. AI Post Transformers: Why LLM Serving Needs Mathematical Optimization — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-05-why-llm-serving-needs-mathematical-optim-647fc6.mp3 30. AI Post Transformers: Speculative Decoding in Real vLLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-04-speculative-decoding-in-real-vllm-s

  3. 3D AGO

    NanoFlow and the Future of LLM Serving

    This episode explores how a line of recent systems papers culminates in NanoFlow, a serving approach that breaks LLM inference into very small “nano-batches” so different GPU-intensive operations can overlap instead of running in strict sequence. It explains the shift from thinking only about memory bottlenecks such as KV-cache movement and fragmentation toward a more nuanced claim: even if some kernels are memory-bound, overall serving throughput can still be limited by underused compute when prefill and decode are serialized. The discussion walks through the progression from micro-batching and iteration-level scheduling to chunked prefill, then shows how NanoFlow extends that logic with an auto-searched schedule that jointly chooses nano-batch size, operation ordering, and GPU resource allocation. A listener would find it interesting because it frames LLM serving not as a single-kernel optimization problem but as a broader question of hardware utilization, scheduling strategy, and the economics of running large models efficiently at scale. Sources: 1. NanoFlow and the Future of LLM Serving https://www.usenix.org/system/files/osdi25-zhu-kan.pdf 2. 2601.11822v1 https://arxiv.org/html/2601.11822v1 3. 2410.18038v2 https://arxiv.org/html/2410.18038v2 4. https://www.usenix.org/system/files/osdi24-agrawal.pdf https://www.usenix.org/system/files/osdi24-agrawal.pdf 5. 1811.06965 https://arxiv.org/pdf/1811.06965 6. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness — Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Re, 2022 https://scholar.google.com/scholar?q=FlashAttention:+Fast+and+Memory-Efficient+Exact+Attention+with+IO-Awareness 7. Orca: A Distributed Serving System for Transformer-Based Generative Models — Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, Byung-Gon Chun, 2022 https://scholar.google.com/scholar?q=Orca:+A+Distributed+Serving+System+for+Transformer-Based+Generative+Models 8. Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve — Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, Ramachandran Ramjee, 2024 https://scholar.google.com/scholar?q=Taming+Throughput-Latency+Tradeoff+in+LLM+Inference+with+Sarathi-Serve 9. NanoFlow: Towards Optimal Large Language Model Serving Throughput — Kan Zhu, Yufei Gao, Yilong Zhao, Liangyu Zhao, Gefei Zuo, Yile Gu, Dedong Xie, Zihao Ye, Keisuke Kamahori, Chien-Yu Lin, Ziren Wang, Stephanie Wang, Arvind Krishnamurthy, Baris Kasikci, 2025 https://scholar.google.com/scholar?q=NanoFlow:+Towards+Optimal+Large+Language+Model+Serving+Throughput 10. Efficient Memory Management for Large Language Model Serving with PagedAttention — Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica, 2023 https://scholar.google.com/scholar?q=Efficient+Memory+Management+for+Large+Language+Model+Serving+with+PagedAttention 11. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving — Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, Hao Zhang, 2024 https://scholar.google.com/scholar?q=DistServe:+Disaggregating+Prefill+and+Decoding+for+Goodput-optimized+Large+Language+Model+Serving 12. ASPEN: Breaking Operator Barriers for Efficient Parallelization of Deep Neural Networks — Jongseok Park, Kyungmin Bin, Gibum Park, Sangtae Ha, Kyunghan Lee, 2023 https://scholar.google.com/scholar?q=ASPEN:+Breaking+Operator+Barriers+for+Efficient+Parallelization+of+Deep+Neural+Networks 13. vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention — Ramya Prabhu, Ajay Nayak, Jayashree Mohan, Ramachandran Ramjee, Ashish Panwar, 2024 https://scholar.google.com/scholar?q=vAttention:+Dynamic+Memory+Management+for+Serving+LLMs+without+PagedAttention 14. MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool — Cunchen Hu et al., 2024 https://scholar.google.com/scholar?q=MemServe:+Context+Caching+for+Disaggregated+LLM+Serving+with+Elastic+Memory+Pool 15. DynaServe: Unified and Elastic Execution for Dynamic Disaggregated LLM Serving — Chaoyi Ruan et al., 2025 https://scholar.google.com/scholar?q=DynaServe:+Unified+and+Elastic+Execution+for+Dynamic+Disaggregated+LLM+Serving 16. LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention — Shang Yang et al., 2025 https://scholar.google.com/scholar?q=LServe:+Efficient+Long-sequence+LLM+Serving+with+Unified+Sparse+Attention 17. PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference — Dongjie Yang et al., 2024 https://scholar.google.com/scholar?q=PyramidInfer:+Pyramid+KV+Cache+Compression+for+High-throughput+LLM+Inference 18. ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference — Xiang Liu et al., 2025 https://scholar.google.com/scholar?q=ChunkKV:+Semantic-Preserving+KV+Cache+Compression+for+Efficient+Long-Context+LLM+Inference 19. Inference-Time Hyper-Scaling with KV Cache Compression — Adrian Łańcucki, Konrad Staniszewski, Piotr Nawrot, Edoardo M. Ponti, 2025 https://scholar.google.com/scholar?q=Inference-Time+Hyper-Scaling+with+KV+Cache+Compression 20. Slice-Level Scheduling for High Throughput and Load Balanced LLM Serving — Ke Cheng et al., 2024 https://scholar.google.com/scholar?q=Slice-Level+Scheduling+for+High+Throughput+and+Load+Balanced+LLM+Serving 21. Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction — Haoran Qiu et al., 2024 https://scholar.google.com/scholar?q=Efficient+Interactive+LLM+Serving+with+Proxy+Model-based+Sequence+Length+Prediction 22. Deferred Continuous Batching in Resource-Efficient Large Language Model Serving — Yongjun He, Yao Lu, Gustavo Alonso, 2024 https://scholar.google.com/scholar?q=Deferred+Continuous+Batching+in+Resource-Efficient+Large+Language+Model+Serving 23. AI Post Transformers: Why LLM Serving Needs Mathematical Optimization — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-05-why-llm-serving-needs-mathematical-optim-647fc6.mp3 24. AI Post Transformers: LAPS for Length-Aware LLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-05-laps-for-length-aware-llm-serving-0c6149.mp3 25. AI Post Transformers: Splitwise: Phase-Split LLM Inference — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-26-splitwise-phase-split-llm-inference-e8945b.mp3 26. AI Post Transformers: Continuous Batching for LLM Inference: Throughput and Latency Gains — Hal Turing & Dr. Ada Shannon, 2025 https://podcast.do-not-panic.com/episodes/continuous-batching-for-llm-inference-throughput-and-latency-gains/ 27. AI Post Transformers: Speculative Decoding in Real vLLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-speculative-decoding-in-real-vllm-servin-6f4e2b.mp3 28. AI Post Transformers: FengHuang for Rack-Scale LLM Inference Memory — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-12-fenghuang-for-rack-scale-llm-inference-m-62708e.mp3 29. AI Post Transformers: CacheFlow and 3D-Parallel KV Cache Restoration — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-01-cacheflow-and-3d-parallel-kv-cache-resto-8db883.mp3 30. AI Post Transformers: ScoutAttention for Efficient KV Cache Offloading — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-24-scoutattention-for-efficient-kv-cache-of-b26699.mp3 31. AI Post Transformers: FlashFuser and Hopper-Era FFN Kernel Fusion — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-15-flashfuser-and-hopper-era-ffn-kernel-fus-e1fce9.mp3 Interactive Visualization: NanoFlow and the Future of LLM Serving

  4. 3D AGO

    Serving MoE Models with Disaggregated Expert Parallelism

    This episode explores MegaScale-Infer, a systems paper on serving large mixture-of-experts language models by separating the attention path from the expert feed-forward path and scheduling them independently. It explains why MoE models can look efficient on paper yet still waste GPU capacity in practice, especially during decode, where KV-cache-heavy attention and uneven expert routing create very different bottlenecks. The discussion focuses on the paper’s core argument for disaggregated expert parallelism and a ping-pong microbatch pipeline designed to keep both attention and expert GPUs busy instead of leaving one side idle. Listeners would find it interesting for its clear look at the gap between model architecture and real-world serving performance, including a pointed debate over whether strong decode benchmarks actually translate into better end-to-end user latency. Sources: 1. MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism — Ruidong Zhu, Ziheng Jiang, Chao Jin, Peng Wu, Cesar A. Stuardo, Dongyang Wang, Xinlei Zhang, Huaping Zhou, Haoran Wei, Yang Cheng, Jianzhe Xiao, Xinyi Zhang, Lingjun Liu, Haibin Lin, Li-Wen Chang, Jianxi Ye, Xiao Yu, Xuanzhe Liu, Xin Jin, Xin Liu, 2025 http://arxiv.org/abs/2504.02263 2. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism — Yanping Huang, Yonglong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Miaosen Wang, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, Zhifeng Chen, 2019 https://scholar.google.com/scholar?q=GPipe:+Efficient+Training+of+Giant+Neural+Networks+using+Pipeline+Parallelism 3. PipeDream: Generalized Pipeline Parallelism for DNN Training — Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil Devanur, Greg Ganger, Phil Gibbons, Matei Zaharia, 2019 https://scholar.google.com/scholar?q=PipeDream:+Generalized+Pipeline+Parallelism+for+DNN+Training 4. Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM — Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Nitin Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Matei Zaharia, 2021 https://scholar.google.com/scholar?q=Efficient+Large-Scale+Language+Model+Training+on+GPU+Clusters+Using+Megatron-LM 5. Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache — Bin Lin, Chen Zhang, Tao Peng, Hanyu Zhao, Wencong Xiao, Minmin Sun, Anmin Liu, Zhipeng Zhang, Lanbo Li, Xiafei Qiu, Shen Li, Zhigang Ji, Tao Xie, Yong Li, Wei Lin, 2024 https://scholar.google.com/scholar?q=Infinite-LLM:+Efficient+LLM+Service+for+Long+Context+with+DistAttention+and+Distributed+KVCache 6. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving — Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, Hao Zhang, 2024 https://scholar.google.com/scholar?q=DistServe:+Disaggregating+Prefill+and+Decoding+for+Goodput-optimized+Large+Language+Model+Serving 7. Splitwise: Efficient Generative LLM Inference using Phase Splitting — Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Inigo Goiri, Saeed Maleki, Ricardo Bianchini, 2023 https://scholar.google.com/scholar?q=Splitwise:+Efficient+Generative+LLM+Inference+using+Phase+Splitting 8. MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs — Shiyi Cao, Shu Liu, Tyler Griggs, Peter Schafhalter, Xiaoxuan Liu, Ying Sheng, Joseph E. Gonzalez, Matei Zaharia, Ion Stoica, 2024 https://scholar.google.com/scholar?q=MoE-Lightning:+High-Throughput+MoE+Inference+on+Memory-constrained+GPUs 9. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model — DeepSeek-AI et al., 2024 https://scholar.google.com/scholar?q=DeepSeek-V2:+A+Strong,+Economical,+and+Efficient+Mixture-of-Experts+Language+Model 10. Toward Efficient Inference for Mixture of Experts — Haiyang Huang, Newsha Ardalani, Anna Sun, Liu Ke, Hsien-Hsin S. Lee, Shruti Bhosale, Carole-Jean Wu, Benjamin Lee, 2024 https://scholar.google.com/scholar?q=Toward+Efficient+Inference+for+Mixture+of+Experts 11. AdapMoE: Adaptive Sensitivity-based Expert Gating and Management for Efficient MoE Inference — Shuzhang Zhong, Ling Liang, Yuan Wang, Runsheng Wang, Ru Huang, Meng Li, 2024 https://scholar.google.com/scholar?q=AdapMoE:+Adaptive+Sensitivity-based+Expert+Gating+and+Management+for+Efficient+MoE+Inference 12. HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference — Shuzhang Zhong, Yanfan Sun, Ling Liang, Runsheng Wang, Ru Huang, Meng Li, 2025 https://scholar.google.com/scholar?q=HybriMoE:+Hybrid+CPU-GPU+Scheduling+and+Cache+Management+for+Efficient+MoE+Inference 13. Oracle-MoE: Locality-preserving Routing in the Oracle Space for Memory-constrained Large Language Model Inference — Jixian Zhou, Fang Dong, Ruijun Huang, Hengjie Cao, Mengyi Chen, Yifeng Yang, Anrui Chen, Mingzhi Dong, Yujiang Wang, Dongsheng Li, David A. Clifton, Qin Lv, Rui Zhu, Chun Zhang, Fan Yang, Tun Lu, Ning Gu, Li Shang, 2025 https://scholar.google.com/scholar?q=Oracle-MoE:+Locality-preserving+Routing+in+the+Oracle+Space+for+Memory-constrained+Large+Language+Model+Inference 14. Efficient MoE Inference with Fine-Grained Scheduling of Disaggregated Expert Parallelism — Xinglin Pan, Shaohuai Shi, Wenxiang Lin, Yuxin Wang, Zhenheng Tang, Wei Wang, Xiaowen Chu, 2025 https://scholar.google.com/scholar?q=Efficient+MoE+Inference+with+Fine-Grained+Scheduling+of+Disaggregated+Expert+Parallelism 15. AI Post Transformers: JANUS for Scalable MoE Inference — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-15-janus-for-scalable-moe-inference-78ae30.mp3 16. AI Post Transformers: Splitwise: Phase-Split LLM Inference — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-26-splitwise-phase-split-llm-inference-e8945b.mp3 17. AI Post Transformers: Prefill-as-a-Service for Cross-Datacenter KV Cache — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-19-prefill-as-a-service-for-cross-datacente-7560be.mp3 18. AI Post Transformers: NanoFlow and the Future of LLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-15-nanoflow-and-the-future-of-llm-serving-7429c9.mp3 19. AI Post Transformers: Why LLM Serving Needs Mathematical Optimization — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-05-why-llm-serving-needs-mathematical-optim-647fc6.mp3 20. AI Post Transformers: Batch-Aware Expert Routing for Faster MoE Decoding — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-batch-aware-expert-routing-for-faster-mo-683ab6.mp3 21. AI Post Transformers: LAPS for Length-Aware LLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-05-laps-for-length-aware-llm-serving-0c6149.mp3 22. AI Post Transformers: Deep Kernel Fusion for Transformer Decoding — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-15-deep-kernel-fusion-for-transformer-decod-b1a703.mp3 Interactive Visualization: Serving MoE Models with Disaggregated Expert Parallelism

  5. 3D AGO

    The Sparsity Wall: What Reiner Pope Told Dwarkesh About MoE and Sparse Attention

    Episode title: The Sparsity Wall: What Reiner Pope Told Dwarkesh About MoE and Sparse Attention A two-paper deep dive framed around the Dwarkesh Patel x Reiner Pope blackboard lecture on training and serving frontier LLMs. The hosts work through "Unified Scaling Laws for Routed Language Models" (Clark et al., DeepMind 2022, arXiv 2202.01169) for the mixture-of-experts side and the DeepSeek sparse-attention paper (arXiv 2512.02556) for the attention side, treating Pope's blackboard framing on the podcast as the pedagogical lens. The episode separates what the papers establish from what Pope's practitioner intuition adds on top, with particular attention to how MoE on the FFN side and sparse attention on the QK side attack independent cost pools and can compound rather than compete. Sources: arXiv 2202.01169 — "Unified Scaling Laws for Routed Language Models" https://arxiv.org/pdf/2202.01169 arXiv 2512.02556 — DeepSeek sparse-attention paper https://arxiv.org/pdf/2512.02556 Dwarkesh Podcast — "Reiner Pope: The math behind how LLMs are trained and served" (April 29 2026) https://www.dwarkesh.com/p/reiner-pope Transcript: https://gist.github.com/dwarkeshsp/79100f0fdeed69d76241903bb0604dbe Older MoE context: GShard (arXiv 2006.16668), Switch Transformer (arXiv 2101.03961) Chinchilla scaling laws (arXiv 2203.15556) — referenced in the Pope episode Interactive Visualization: The Sparsity Wall: What Reiner Pope Told Dwarkesh About MoE and Sparse Attention

  6. MAY 15

    Agentic AI as a Path to AGI

    This episode explores a position paper arguing that agentic AI systems, built from task decomposition, routing, specialized components, and explicit graph-like workflows, may offer a more credible path to AGI than simply scaling a single monolithic model. It examines how the paper frames AGI through both broad competence across environments and efficient skill acquisition, then asks whether real-world tasks are structured enough for modular systems to outperform one-model-fits-all approaches. The discussion connects that claim to prior work on universal intelligence, compositional generalization, graph-based inductive biases, hierarchical planning, and modular prompting, while stressing that the core debate is about whether intelligence needs external structure rather than just more parameters. A listener would find it interesting for its sharp, theory-driven challenge to the dominant scaling narrative and its concrete attempt to formalize when multi-agent systems should have an advantage. Sources: 1. Agentic AI as a Path to AGI https://arxiv.org/pdf/2605.12966 2. HTN Planning: Complexity and Expressivity — Kutluhan Erol, James Hendler, Dana S. Nau, 1994 https://scholar.google.com/scholar?q=HTN+Planning:+Complexity+and+Expressivity 3. Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition — Thomas G. Dietterich, 2000 https://scholar.google.com/scholar?q=Hierarchical+Reinforcement+Learning+with+the+MAXQ+Value+Function+Decomposition 4. Decomposed Prompting: A Modular Approach for Solving Complex Tasks — Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao Fu, Kyle Richardson, Peter Clark, Ashish Sabharwal, 2022 https://scholar.google.com/scholar?q=Decomposed+Prompting:+A+Modular+Approach+for+Solving+Complex+Tasks 5. HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face — Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, Yueting Zhuang, 2023 https://scholar.google.com/scholar?q=HuggingGPT:+Solving+AI+Tasks+with+ChatGPT+and+its+Friends+in+Hugging+Face 6. Graph of Thoughts: Solving Elaborate Problems with Large Language Models — Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, Torsten Hoefler, 2023 https://scholar.google.com/scholar?q=Graph+of+Thoughts:+Solving+Elaborate+Problems+with+Large+Language+Models 7. TDAG: A Multi-Agent Framework based on Dynamic Task Decomposition and Agent Generation — Yaoxiang Wang, Zhiyong Wu, Junfeng Yao, Jinsong Su, 2024 https://scholar.google.com/scholar?q=TDAG:+A+Multi-Agent+Framework+based+on+Dynamic+Task+Decomposition+and+Agent+Generation 8. DAWN: Distributed LLM Multi-Agent Workflow Synthesis — Guancheng Wan, Mo Zhou, Ziyi Wang, Xiaoran Shang, Eric Hanchen Jiang, Guibin Zhang, Jinhe Bi, Yunpu Ma, Zaixi Zhang, Ke Liang, Wenke Huang, 2026 https://scholar.google.com/scholar?q=DAWN:+Distributed+LLM+Multi-Agent+Workflow+Synthesis 9. From Static Templates to Dynamic Runtime Graphs: A Survey of Workflow Optimization for LLM Agents — Ling Yue, Kushal Raj Bhandari, Ching-Yun Ko, Dhaval Patel, Shuxin Lin, Nianjun Zhou, Jianxi Gao, Pin-Yu Chen, Shaowu Pan, 2026 https://scholar.google.com/scholar?q=From+Static+Templates+to+Dynamic+Runtime+Graphs:+A+Survey+of+Workflow+Optimization+for+LLM+Agents 10. A Generalist Agent — Scott Reed et al., 2022 https://scholar.google.com/scholar?q=A+Generalist+Agent 11. The Measure of Intelligence — François Chollet, 2019 https://scholar.google.com/scholar?q=The+Measure+of+Intelligence 12. On the Measure of Intelligence — Shane Legg and Marcus Hutter, 2007 https://scholar.google.com/scholar?q=On+the+Measure+of+Intelligence 13. Relational Inductive Biases, Deep Learning, and Graph Networks — Peter W. Battaglia et al., 2018 https://scholar.google.com/scholar?q=Relational+Inductive+Biases,+Deep+Learning,+and+Graph+Networks 14. No Free Lunch Theorems for Optimization — David H. Wolpert and William G. Macready, 1997 https://scholar.google.com/scholar?q=No+Free+Lunch+Theorems+for+Optimization 15. Scaling can lead to compositional generalization — Florian Redhardt, Yassir Akram, Simon Schug, 2025 https://scholar.google.com/scholar?q=Scaling+can+lead+to+compositional+generalization 16. Single-agent or Multi-agent Systems? Why Not Both? — Mingyan Gao et al., 2025 https://scholar.google.com/scholar?q=Single-agent+or+Multi-agent+Systems?+Why+Not+Both? 17. When Single-Agent with Skills Replace Multi-Agent Systems and When They Fail — Xiaoxiao Li, 2026 https://scholar.google.com/scholar?q=When+Single-Agent+with+Skills+Replace+Multi-Agent+Systems+and+When+They+Fail 18. Decomposition Dilemmas: Does Claim Decomposition Boost or Burden Fact-Checking Performance? — Qisheng Hu, Quanyu Long, Wenya Wang, 2024/2025 https://scholar.google.com/scholar?q=Decomposition+Dilemmas:+Does+Claim+Decomposition+Boost+or+Burden+Fact-Checking+Performance? 19. Unifying Language Agent Algorithms with Graph-based Orchestration Engine for Reproducible Agent Research — Qianqian Zhang et al., 2025 https://scholar.google.com/scholar?q=Unifying+Language+Agent+Algorithms+with+Graph-based+Orchestration+Engine+for+Reproducible+Agent+Research 20. AI Post Transformers: ASI-Evolve for Data, Architectures, and RL — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-05-asi-evolve-for-data-architectures-and-rl-197b2b.mp3 21. AI Post Transformers: Kimi K2.5 and Visual Agent Swarms — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-24-kimi-k25-and-visual-agent-swarms-7d04d7.mp3 22. AI Post Transformers: AI Co-Mathematician for Mathematical Research — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-14-ai-co-mathematician-for-mathematical-res-4aa2d4.mp3 23. AI Post Transformers: TMAS: Scaling Test-Time Compute with Multi-Agent Synergy — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-14-tmas-scaling-test-time-compute-with-mult-3abe7a.mp3 24. AI Post Transformers: Agentic Discovery for Test-Time Scaling — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-12-agentic-discovery-for-test-time-scaling-f9a81f.mp3 25. AI Post Transformers: AgenticQwen and Small Industrial Tool Agents — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-27-agenticqwen-and-small-industrial-tool-ag-dc676d.mp3 26. AI Post Transformers: MEMSEARCHER: Reinforcement Learning for LLM Memory Management — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-memsearcher-reinforcement-learning-for-l-e9ad84.mp3 27. AI Post Transformers: Neural Computers as Learned Latent Runtimes — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-11-neural-computers-as-learned-latent-runti-9fa282.mp3 Interactive Visualization: Agentic AI as a Path to AGI

  7. MAY 15

    Air Force One, Jensen Huang, and Anthropic's 2028 Memo

    Episode title: Air Force One, Jensen Huang, and Anthropic's 2028 Memo A close reading of Anthropic's policy post "2028: Two Scenarios for Global AI Leadership" as what it actually is — a carefully timed corporate advocacy document, not a peer-reviewed paper. The episode unpacks the post's central "distillation attacks" framing, distinguishes the four very different things that label gets used for, and weighs Anthropic's policy recommendations against the empirical literature on whether unauthorized knowledge distillation is technically deterrable (citing Trace Rewriting, arXiv 2602.15143, and Watermark Robustness Against Distillation, arXiv 2502.11598). It situates the post in the news cycle of President Trump's May 13–15 2026 state visit to Beijing, the inclusion of Nvidia's Jensen Huang in the delegation, and the H200 clearance for roughly ten Chinese firms — a policy direction that diverges from what the post advocates. Mistral AI's Ministral 3 cascade- distillation work serves as the empirical lens for what compact- model distillation actually transfers in practice. The episode acknowledges legitimate underlying concerns about frontier- capability spread while declining to treat the post as research evidence. Sources (selected; the full citation list will be folded into the script): Anthropic — "2028: Two Scenarios for Global AI Leadership" https://www.anthropic.com/research/2028-ai-leadership Anthropic — "Detecting and preventing distillation attacks" (Feb 2026) https://www.anthropic.com/news/detecting-and-preventing-distillation-attacks Nathan Lambert — "The distillation panic" https://www.interconnects.ai/p/the-distillation-panic TIME — "How A.I. Was the Elephant in the Room at the Trump-Xi Summit" https://time.com/article/2026/05/15/trump-xi-us-china-summit-ai-semiconductor-chips/ Bloomberg — "Nvidia's Huang Joins Trump's China Trip as Last-Minute Addition" https://www.bloomberg.com/news/articles/2026-05-13/nvidia-s-huang-joins-trump-s-china-trip-as-last-minute-addition CNBC — "Trump-Xi summit revives China tech rally hopes as U.S. clears Nvidia H200 sales" https://www.cnbc.com/2026/05/14/trump-xi-meeting-china-stocks-ai-rally.html CFR — "At the Trump-Xi Summit, China Will Have the Upper Hand" https://www.cfr.org/articles/at-the-trump-xi-summit-china-will-have-the-upper-hand CFR — "How Trump Should Approach AI Talks With China" https://www.cfr.org/articles/how-trump-should-approach-ai-talks-with-china-targeted-dialogue-maximum-pressure IAPS — "AI Distillation Attacks: The Case for Targeted Government Intervention" https://www.iaps.ai/research/ai-distillation-attacks Chatham House — "Anthropic's feud with the Pentagon reveals the limits of AI governance" https://www.chathamhouse.org/2026/03/anthropics-feud-pentagon-reveals-limits-ai-governance Small Wars Journal — "Selective Virtue: Anthropic, the Pentagon, and the Contradictions of AI Governance" https://smallwarsjournal.com/2026/04/29/selective-virtue-anthropic-the-pentagon-ai-governance/ arXiv 2602.15143 — "Protecting Language Models Against Unauthorized Distillation through Trace Rewriting" arXiv 2502.11598 — "Can LLM Watermarks Robustly Prevent Unauthorized Knowledge Distillation?" Ministral 3 — discussed in the AI Post Transformers episode "Ministral 3: Cascade Distillation for Long-Context Multimodal Models" AI Post Transformers — "Dario Amodei: Machines of Loving Grace" https://podcast.do-not-panic.com/episodes/dario-amodei-machines-of-loving-grace/ AI Post Transformers — "Dario Amodei: The Adolescence of Technology" https://podcast.do-not-panic.com/episodes/dario-amodei-the-adolescence-of-technology/ AI Post Transformers — "Trace Rewriting Against Unauthorized LLM Distillation" (covers arXiv 2602.15143 / Xinhang Ma et al. WashU, with the watermark-radioactivity literature as comparison) Interactive Visualization: Air Force One, Jensen Huang, and Anthropic's 2028 Memo

  8. MAY 15

    Causal-JEPA for Object-Level World Models

    This episode explores Causal-JEPA, a world-modeling approach that masks whole object trajectories rather than image patches to force a model to reason about interactions between entities. It explains how the method combines object-centric representations with JEPA-style latent prediction, asking the model to reconstruct hidden objects from scene context and then predict future dynamics, instead of relying on pixel reconstruction or simple autoregressive rollouts. The discussion highlights the paper’s core argument that this training setup makes counterfactual and causal reasoning more necessary by blocking shortcut strategies like temporal interpolation and self-contained single-object motion prediction. Listeners would find it interesting for its sharp comparison between patch-based scaling and object-centric structure, and for its claim that better world models may come from making interaction reasoning unavoidable rather than merely possible. Sources: 1. Causal-JEPA: Learning World Models through Object-Level Latent Interventions — Heejeong Nam, Quentin Le Lidec, Lucas Maes, Yann LeCun, Randall Balestriero, 2026 http://arxiv.org/abs/2602.11389 2. MONet: Unsupervised Scene Decomposition and Representation — Christopher P. Burgess, Loic Matthey, Nicholas Watters, Rishabh Kabra, Irina Higgins, Matt Botvinick, Alexander Lerchner, 2019 https://scholar.google.com/scholar?q=MONet:+Unsupervised+Scene+Decomposition+and+Representation 3. Multi-Object Representation Learning with Iterative Variational Inference — Klaus Greff, Raphael Lopez Kaufman, Rishabh Kabra, Nick Watters, Chris Burgess, Daniel Zoran, Loic Matthey, Matthew Botvinick, Alexander Lerchner, 2019 https://scholar.google.com/scholar?q=Multi-Object+Representation+Learning+with+Iterative+Variational+Inference 4. Object-Centric Learning with Slot Attention — Francesco Locatello, Dirk Weissenborn, Thomas Unterthiner, Aravindh Mahendran, Georg Heigold, Jakob Uszkoreit, Alexey Dosovitskiy, Thomas Kipf, 2020 https://scholar.google.com/scholar?q=Object-Centric+Learning+with+Slot+Attention 5. Bridging the Gap to Real-World Object-Centric Learning — Maximilian Seitzer, Max Horn, Andrii Zadaianchuk, Dominik Zietlow, Tianjun Xiao, Carl-Johann Simon-Gabriel, Tong He, Zheng Zhang, Bernhard Schölkopf, Thomas Brox, Francesco Locatello, 2023 https://scholar.google.com/scholar?q=Bridging+the+Gap+to+Real-World+Object-Centric+Learning 6. A Path Towards Autonomous Machine Intelligence — Yann LeCun, 2022 https://scholar.google.com/scholar?q=A+Path+Towards+Autonomous+Machine+Intelligence 7. Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture — Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, Nicolas Ballas, 2023 https://scholar.google.com/scholar?q=Self-Supervised+Learning+from+Images+with+a+Joint-Embedding+Predictive+Architecture 8. Revisiting Feature Prediction for Learning Visual Representations from Video — Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, Nicolas Ballas, 2024 https://scholar.google.com/scholar?q=Revisiting+Feature+Prediction+for+Learning+Visual+Representations+from+Video 9. V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning — Mido Assran, Adrien Bardes, David Fan and many others including Yann LeCun, Michael Rabbat, Nicolas Ballas, 2025 https://scholar.google.com/scholar?q=V-JEPA+2:+Self-Supervised+Video+Models+Enable+Understanding,+Prediction+and+Planning 10. CLEVRER: CoLlision Events for Video REpresentation and Reasoning — Kexin Yi, Chuang Gan, Yunzhu Li, Pushmeet Kohli, Jiajun Wu, Antonio Torralba, Joshua B. Tenenbaum, 2020 https://scholar.google.com/scholar?q=CLEVRER:+CoLlision+Events+for+Video+REpresentation+and+Reasoning 11. Counterfactual VQA: A Cause-Effect Look at Language Bias — Yulei Niu, Kaihua Tang, Hanwang Zhang, Zhiwu Lu, Xian-Sheng Hua, Ji-Rong Wen, 2021 https://scholar.google.com/scholar?q=Counterfactual+VQA:+A+Cause-Effect+Look+at+Language+Bias 12. What If the TV Was Off? Examining Counterfactual Reasoning Abilities of Multi-modal Language Models — Letian Zhang, Xiaotong Zhai, Zhongkai Zhao, Yongshuo Zong, Xin Wen, Bingchen Zhao, 2024 https://scholar.google.com/scholar?q=What+If+the+TV+Was+Off?+Examining+Counterfactual+Reasoning+Abilities+of+Multi-modal+Language+Models 13. ACQUIRED: A Dataset for Answering Counterfactual Questions In Real-Life Videos — Te-Lin Wu, Zi-Yi Dou, Qingyuan Hu, Yu Hou, Nischal Reddy Chandra, Marjorie Freedman, Ralph M. Weischedel, Nanyun Peng, 2023 https://scholar.google.com/scholar?q=ACQUIRED:+A+Dataset+for+Answering+Counterfactual+Questions+In+Real-Life+Videos 14. Towards Causal Representation Learning — Bernhard Schölkopf, Francesco Locatello, Stefan Bauer, Nan Rosemary Ke, Nal Kalchbrenner, Anirudh Goyal, Yoshua Bengio, 2021 https://scholar.google.com/scholar?q=Towards+Causal+Representation+Learning 15. Interventional Causal Representation Learning — Kartik Ahuja, Divyat Mahajan, Yixin Wang, Yoshua Bengio, 2023 https://scholar.google.com/scholar?q=Interventional+Causal+Representation+Learning 16. Desiderata for Representation Learning: A Causal Perspective — Yixin Wang, Michael I. Jordan, 2024 https://scholar.google.com/scholar?q=Desiderata+for+Representation+Learning:+A+Causal+Perspective 17. Provably Learning Object-Centric Representations — Stefan Bauer, Bernhard Schölkopf and collaborators, 2023 https://scholar.google.com/scholar?q=Provably+Learning+Object-Centric+Representations 18. SlotFormer: Unsupervised Visual Dynamics Simulation with Object-Centric Models — Yuhang Wu, Yueting Zhuang, Francesco Locatello, et al., 2022 https://scholar.google.com/scholar?q=SlotFormer:+Unsupervised+Visual+Dynamics+Simulation+with+Object-Centric+Models 19. Object-Centric Video Prediction via Decoupling of Object Dynamics and Interactions — Angel Villar-Corrales, Ismail Wahdan, Sven Behnke, 2023 https://scholar.google.com/scholar?q=Object-Centric+Video+Prediction+via+Decoupling+of+Object+Dynamics+and+Interactions 20. DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning — Gaoyue Zhou, Hengkai Pan, Yann LeCun, Lerrel Pinto, 2024 https://scholar.google.com/scholar?q=DINO-WM:+World+Models+on+Pre-trained+Visual+Features+enable+Zero-shot+Planning 21. Conditional Object-Centric Learning from Video — Thomas Kipf, Gamaleldin F. Elsayed, Aravindh Mahendran, Austin Stone, Sara Sabour, Georg Heigold, Rico Jonschkowski, Alexey Dosovitskiy, Klaus Greff, 2022 https://scholar.google.com/scholar?q=Conditional+Object-Centric+Learning+from+Video 22. Attention over Learned Object Embeddings Enables Complex Visual Reasoning — David Ding, Felix Hill, Adam Santoro, Malcolm Reynolds, Matt Botvinick, 2021 https://scholar.google.com/scholar?q=Attention+over+Learned+Object+Embeddings+Enables+Complex+Visual+Reasoning 23. Dyn-O: Building Structured World Models with Object-Centric Representations — Zizhao Wang, Kaixin Wang, Li Zhao, Peter Stone, Jiang Bian, 2025 https://scholar.google.com/scholar?q=Dyn-O:+Building+Structured+World+Models+with+Object-Centric+Representations 24. Learning Interactive World Model for Object-Centric Reinforcement Learning — Fan Feng, Phillip Lippe, Sara Magliacane, 2025 https://scholar.google.com/scholar?q=Learning+Interactive+World+Model+for+Object-Centric+Reinforcement+Learning 25. Object-Centric World Model for Language-Guided Manipulation — Youngjoon Jeong, Junha Chun, Soonwoo Cha, Taesup Kim, 2025 https://scholar.google.com/scholar?q=Object-Centric+World+Model+for+Language-Guided+Manipulation 26. Planning in 8 Tokens: A Compact Discrete Tokenizer for Latent World Model — Dongwon Kim, Gawon Seo, Jinsung Lee, Minsu Cho, Suha Kwak, 2026 https://scholar.google.com/scholar?q=Planning+in+8+Tokens:+A+Compact+Discrete+Tokenizer+for+Latent+World+Model 27. Learning nonparametric latent causal graphs with unknown interventions — Yibo Jiang, Bryon Aragam, 2023 https://scholar.google.com/scholar?q=Learning+nonparametric+latent+causal+graphs+with+unknown+interventions 28. Learning Linear Causal Representations from Interventions under General Nonlinear Mixing — Simon Buchholz, Goutham Rajendran, Elan Rosenfeld, Bryon Aragam, Bernhard Schölkopf, Pradeep Ravikumar, 2023 https://scholar.google.com/scholar?q=Learning+Linear+Causal+Representations+from+Interventions+under+General+Nonlinear+Mixing 29. AI Post Transformers: LeWorldModel: Stable Joint-Embedding World Models from Pixels — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-25-leworldmodel-stable-joint-embedding-worl-650f9f.mp3 30. AI Post Transformers: Learning Latent Action World Models from Video — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-09-learning-latent-action-world-models-from-1570a4.mp3 31. AI Post Transformers: DreamerV3 World Models Across 150 Tasks — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-20-dreamerv3-world-models-across-150-tasks-af5edb.mp3 Interactive Visualization: Causal-JEPA for Object-Level World Models

Ratings & Reviews

3.7
out of 5
3 Ratings

About

AI-generated podcast where hosts Hal Turing and Dr. Ada Shannon discuss the latest research papers and reports in machine learning, AI systems, and optimization. Featuring honest critical analysis, proper citations, and nerdy humor.

You Might Also Like