AI Post Transformers

mcgrof

0.0 (0)
TECHNOLOGY
UPDATED DAILY

AI-generated podcast where hosts Hal Turing and Dr. Ada Shannon discuss the latest research papers and reports in machine learning, AI systems, and optimization. Featuring honest critical analysis, proper citations, and nerdy humor.

2 DAYS AGO

Agentic AI as a Path to AGI

This episode explores a position paper arguing that agentic AI systems, built from task decomposition, routing, specialized components, and explicit graph-like workflows, may offer a more credible path to AGI than simply scaling a single monolithic model. It examines how the paper frames AGI through both broad competence across environments and efficient skill acquisition, then asks whether real-world tasks are structured enough for modular systems to outperform one-model-fits-all approaches. The discussion connects that claim to prior work on universal intelligence, compositional generalization, graph-based inductive biases, hierarchical planning, and modular prompting, while stressing that the core debate is about whether intelligence needs external structure rather than just more parameters. A listener would find it interesting for its sharp, theory-driven challenge to the dominant scaling narrative and its concrete attempt to formalize when multi-agent systems should have an advantage. Sources: 1. Agentic AI as a Path to AGI https://arxiv.org/pdf/2605.12966 2. HTN Planning: Complexity and Expressivity — Kutluhan Erol, James Hendler, Dana S. Nau, 1994 https://scholar.google.com/scholar?q=HTN+Planning:+Complexity+and+Expressivity 3. Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition — Thomas G. Dietterich, 2000 https://scholar.google.com/scholar?q=Hierarchical+Reinforcement+Learning+with+the+MAXQ+Value+Function+Decomposition 4. Decomposed Prompting: A Modular Approach for Solving Complex Tasks — Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao Fu, Kyle Richardson, Peter Clark, Ashish Sabharwal, 2022 https://scholar.google.com/scholar?q=Decomposed+Prompting:+A+Modular+Approach+for+Solving+Complex+Tasks 5. HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face — Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, Yueting Zhuang, 2023 https://scholar.google.com/scholar?q=HuggingGPT:+Solving+AI+Tasks+with+ChatGPT+and+its+Friends+in+Hugging+Face 6. Graph of Thoughts: Solving Elaborate Problems with Large Language Models — Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, Torsten Hoefler, 2023 https://scholar.google.com/scholar?q=Graph+of+Thoughts:+Solving+Elaborate+Problems+with+Large+Language+Models 7. TDAG: A Multi-Agent Framework based on Dynamic Task Decomposition and Agent Generation — Yaoxiang Wang, Zhiyong Wu, Junfeng Yao, Jinsong Su, 2024 https://scholar.google.com/scholar?q=TDAG:+A+Multi-Agent+Framework+based+on+Dynamic+Task+Decomposition+and+Agent+Generation 8. DAWN: Distributed LLM Multi-Agent Workflow Synthesis — Guancheng Wan, Mo Zhou, Ziyi Wang, Xiaoran Shang, Eric Hanchen Jiang, Guibin Zhang, Jinhe Bi, Yunpu Ma, Zaixi Zhang, Ke Liang, Wenke Huang, 2026 https://scholar.google.com/scholar?q=DAWN:+Distributed+LLM+Multi-Agent+Workflow+Synthesis 9. From Static Templates to Dynamic Runtime Graphs: A Survey of Workflow Optimization for LLM Agents — Ling Yue, Kushal Raj Bhandari, Ching-Yun Ko, Dhaval Patel, Shuxin Lin, Nianjun Zhou, Jianxi Gao, Pin-Yu Chen, Shaowu Pan, 2026 https://scholar.google.com/scholar?q=From+Static+Templates+to+Dynamic+Runtime+Graphs:+A+Survey+of+Workflow+Optimization+for+LLM+Agents 10. A Generalist Agent — Scott Reed et al., 2022 https://scholar.google.com/scholar?q=A+Generalist+Agent 11. The Measure of Intelligence — François Chollet, 2019 https://scholar.google.com/scholar?q=The+Measure+of+Intelligence 12. On the Measure of Intelligence — Shane Legg and Marcus Hutter, 2007 https://scholar.google.com/scholar?q=On+the+Measure+of+Intelligence 13. Relational Inductive Biases, Deep Learning, and Graph Networks — Peter W. Battaglia et al., 2018 https://scholar.google.com/scholar?q=Relational+Inductive+Biases,+Deep+Learning,+and+Graph+Networks 14. No Free Lunch Theorems for Optimization — David H. Wolpert and William G. Macready, 1997 https://scholar.google.com/scholar?q=No+Free+Lunch+Theorems+for+Optimization 15. Scaling can lead to compositional generalization — Florian Redhardt, Yassir Akram, Simon Schug, 2025 https://scholar.google.com/scholar?q=Scaling+can+lead+to+compositional+generalization 16. Single-agent or Multi-agent Systems? Why Not Both? — Mingyan Gao et al., 2025 https://scholar.google.com/scholar?q=Single-agent+or+Multi-agent+Systems?+Why+Not+Both? 17. When Single-Agent with Skills Replace Multi-Agent Systems and When They Fail — Xiaoxiao Li, 2026 https://scholar.google.com/scholar?q=When+Single-Agent+with+Skills+Replace+Multi-Agent+Systems+and+When+They+Fail 18. Decomposition Dilemmas: Does Claim Decomposition Boost or Burden Fact-Checking Performance? — Qisheng Hu, Quanyu Long, Wenya Wang, 2024/2025 https://scholar.google.com/scholar?q=Decomposition+Dilemmas:+Does+Claim+Decomposition+Boost+or+Burden+Fact-Checking+Performance? 19. Unifying Language Agent Algorithms with Graph-based Orchestration Engine for Reproducible Agent Research — Qianqian Zhang et al., 2025 https://scholar.google.com/scholar?q=Unifying+Language+Agent+Algorithms+with+Graph-based+Orchestration+Engine+for+Reproducible+Agent+Research 20. AI Post Transformers: ASI-Evolve for Data, Architectures, and RL — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-05-asi-evolve-for-data-architectures-and-rl-197b2b.mp3 21. AI Post Transformers: Kimi K2.5 and Visual Agent Swarms — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-24-kimi-k25-and-visual-agent-swarms-7d04d7.mp3 22. AI Post Transformers: AI Co-Mathematician for Mathematical Research — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-14-ai-co-mathematician-for-mathematical-res-4aa2d4.mp3 23. AI Post Transformers: TMAS: Scaling Test-Time Compute with Multi-Agent Synergy — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-14-tmas-scaling-test-time-compute-with-mult-3abe7a.mp3 24. AI Post Transformers: Agentic Discovery for Test-Time Scaling — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-12-agentic-discovery-for-test-time-scaling-f9a81f.mp3 25. AI Post Transformers: AgenticQwen and Small Industrial Tool Agents — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-27-agenticqwen-and-small-industrial-tool-ag-dc676d.mp3 26. AI Post Transformers: MEMSEARCHER: Reinforcement Learning for LLM Memory Management — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-memsearcher-reinforcement-learning-for-l-e9ad84.mp3 27. AI Post Transformers: Neural Computers as Learned Latent Runtimes — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-11-neural-computers-as-learned-latent-runti-9fa282.mp3 Interactive Visualization: Agentic AI as a Path to AGI
2 DAYS AGO

Air Force One, Jensen Huang, and Anthropic's 2028 Memo

Episode title: Air Force One, Jensen Huang, and Anthropic's 2028 Memo A close reading of Anthropic's policy post "2028: Two Scenarios for Global AI Leadership" as what it actually is — a carefully timed corporate advocacy document, not a peer-reviewed paper. The episode unpacks the post's central "distillation attacks" framing, distinguishes the four very different things that label gets used for, and weighs Anthropic's policy recommendations against the empirical literature on whether unauthorized knowledge distillation is technically deterrable (citing Trace Rewriting, arXiv 2602.15143, and Watermark Robustness Against Distillation, arXiv 2502.11598). It situates the post in the news cycle of President Trump's May 13–15 2026 state visit to Beijing, the inclusion of Nvidia's Jensen Huang in the delegation, and the H200 clearance for roughly ten Chinese firms — a policy direction that diverges from what the post advocates. Mistral AI's Ministral 3 cascade- distillation work serves as the empirical lens for what compact- model distillation actually transfers in practice. The episode acknowledges legitimate underlying concerns about frontier- capability spread while declining to treat the post as research evidence. Sources (selected; the full citation list will be folded into the script): Anthropic — "2028: Two Scenarios for Global AI Leadership" https://www.anthropic.com/research/2028-ai-leadership Anthropic — "Detecting and preventing distillation attacks" (Feb 2026) https://www.anthropic.com/news/detecting-and-preventing-distillation-attacks Nathan Lambert — "The distillation panic" https://www.interconnects.ai/p/the-distillation-panic TIME — "How A.I. Was the Elephant in the Room at the Trump-Xi Summit" https://time.com/article/2026/05/15/trump-xi-us-china-summit-ai-semiconductor-chips/ Bloomberg — "Nvidia's Huang Joins Trump's China Trip as Last-Minute Addition" https://www.bloomberg.com/news/articles/2026-05-13/nvidia-s-huang-joins-trump-s-china-trip-as-last-minute-addition CNBC — "Trump-Xi summit revives China tech rally hopes as U.S. clears Nvidia H200 sales" https://www.cnbc.com/2026/05/14/trump-xi-meeting-china-stocks-ai-rally.html CFR — "At the Trump-Xi Summit, China Will Have the Upper Hand" https://www.cfr.org/articles/at-the-trump-xi-summit-china-will-have-the-upper-hand CFR — "How Trump Should Approach AI Talks With China" https://www.cfr.org/articles/how-trump-should-approach-ai-talks-with-china-targeted-dialogue-maximum-pressure IAPS — "AI Distillation Attacks: The Case for Targeted Government Intervention" https://www.iaps.ai/research/ai-distillation-attacks Chatham House — "Anthropic's feud with the Pentagon reveals the limits of AI governance" https://www.chathamhouse.org/2026/03/anthropics-feud-pentagon-reveals-limits-ai-governance Small Wars Journal — "Selective Virtue: Anthropic, the Pentagon, and the Contradictions of AI Governance" https://smallwarsjournal.com/2026/04/29/selective-virtue-anthropic-the-pentagon-ai-governance/ arXiv 2602.15143 — "Protecting Language Models Against Unauthorized Distillation through Trace Rewriting" arXiv 2502.11598 — "Can LLM Watermarks Robustly Prevent Unauthorized Knowledge Distillation?" Ministral 3 — discussed in the AI Post Transformers episode "Ministral 3: Cascade Distillation for Long-Context Multimodal Models" AI Post Transformers — "Dario Amodei: Machines of Loving Grace" https://podcast.do-not-panic.com/episodes/dario-amodei-machines-of-loving-grace/ AI Post Transformers — "Dario Amodei: The Adolescence of Technology" https://podcast.do-not-panic.com/episodes/dario-amodei-the-adolescence-of-technology/ AI Post Transformers — "Trace Rewriting Against Unauthorized LLM Distillation" (covers arXiv 2602.15143 / Xinhang Ma et al. WashU, with the watermark-radioactivity literature as comparison) Interactive Visualization: Air Force One, Jensen Huang, and Anthropic's 2028 Memo
2 DAYS AGO

Causal-JEPA for Object-Level World Models

This episode explores Causal-JEPA, a world-modeling approach that masks whole object trajectories rather than image patches to force a model to reason about interactions between entities. It explains how the method combines object-centric representations with JEPA-style latent prediction, asking the model to reconstruct hidden objects from scene context and then predict future dynamics, instead of relying on pixel reconstruction or simple autoregressive rollouts. The discussion highlights the paper’s core argument that this training setup makes counterfactual and causal reasoning more necessary by blocking shortcut strategies like temporal interpolation and self-contained single-object motion prediction. Listeners would find it interesting for its sharp comparison between patch-based scaling and object-centric structure, and for its claim that better world models may come from making interaction reasoning unavoidable rather than merely possible. Sources: 1. Causal-JEPA: Learning World Models through Object-Level Latent Interventions — Heejeong Nam, Quentin Le Lidec, Lucas Maes, Yann LeCun, Randall Balestriero, 2026 http://arxiv.org/abs/2602.11389 2. MONet: Unsupervised Scene Decomposition and Representation — Christopher P. Burgess, Loic Matthey, Nicholas Watters, Rishabh Kabra, Irina Higgins, Matt Botvinick, Alexander Lerchner, 2019 https://scholar.google.com/scholar?q=MONet:+Unsupervised+Scene+Decomposition+and+Representation 3. Multi-Object Representation Learning with Iterative Variational Inference — Klaus Greff, Raphael Lopez Kaufman, Rishabh Kabra, Nick Watters, Chris Burgess, Daniel Zoran, Loic Matthey, Matthew Botvinick, Alexander Lerchner, 2019 https://scholar.google.com/scholar?q=Multi-Object+Representation+Learning+with+Iterative+Variational+Inference 4. Object-Centric Learning with Slot Attention — Francesco Locatello, Dirk Weissenborn, Thomas Unterthiner, Aravindh Mahendran, Georg Heigold, Jakob Uszkoreit, Alexey Dosovitskiy, Thomas Kipf, 2020 https://scholar.google.com/scholar?q=Object-Centric+Learning+with+Slot+Attention 5. Bridging the Gap to Real-World Object-Centric Learning — Maximilian Seitzer, Max Horn, Andrii Zadaianchuk, Dominik Zietlow, Tianjun Xiao, Carl-Johann Simon-Gabriel, Tong He, Zheng Zhang, Bernhard Schölkopf, Thomas Brox, Francesco Locatello, 2023 https://scholar.google.com/scholar?q=Bridging+the+Gap+to+Real-World+Object-Centric+Learning 6. A Path Towards Autonomous Machine Intelligence — Yann LeCun, 2022 https://scholar.google.com/scholar?q=A+Path+Towards+Autonomous+Machine+Intelligence 7. Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture — Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, Nicolas Ballas, 2023 https://scholar.google.com/scholar?q=Self-Supervised+Learning+from+Images+with+a+Joint-Embedding+Predictive+Architecture 8. Revisiting Feature Prediction for Learning Visual Representations from Video — Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, Nicolas Ballas, 2024 https://scholar.google.com/scholar?q=Revisiting+Feature+Prediction+for+Learning+Visual+Representations+from+Video 9. V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning — Mido Assran, Adrien Bardes, David Fan and many others including Yann LeCun, Michael Rabbat, Nicolas Ballas, 2025 https://scholar.google.com/scholar?q=V-JEPA+2:+Self-Supervised+Video+Models+Enable+Understanding,+Prediction+and+Planning 10. CLEVRER: CoLlision Events for Video REpresentation and Reasoning — Kexin Yi, Chuang Gan, Yunzhu Li, Pushmeet Kohli, Jiajun Wu, Antonio Torralba, Joshua B. Tenenbaum, 2020 https://scholar.google.com/scholar?q=CLEVRER:+CoLlision+Events+for+Video+REpresentation+and+Reasoning 11. Counterfactual VQA: A Cause-Effect Look at Language Bias — Yulei Niu, Kaihua Tang, Hanwang Zhang, Zhiwu Lu, Xian-Sheng Hua, Ji-Rong Wen, 2021 https://scholar.google.com/scholar?q=Counterfactual+VQA:+A+Cause-Effect+Look+at+Language+Bias 12. What If the TV Was Off? Examining Counterfactual Reasoning Abilities of Multi-modal Language Models — Letian Zhang, Xiaotong Zhai, Zhongkai Zhao, Yongshuo Zong, Xin Wen, Bingchen Zhao, 2024 https://scholar.google.com/scholar?q=What+If+the+TV+Was+Off?+Examining+Counterfactual+Reasoning+Abilities+of+Multi-modal+Language+Models 13. ACQUIRED: A Dataset for Answering Counterfactual Questions In Real-Life Videos — Te-Lin Wu, Zi-Yi Dou, Qingyuan Hu, Yu Hou, Nischal Reddy Chandra, Marjorie Freedman, Ralph M. Weischedel, Nanyun Peng, 2023 https://scholar.google.com/scholar?q=ACQUIRED:+A+Dataset+for+Answering+Counterfactual+Questions+In+Real-Life+Videos 14. Towards Causal Representation Learning — Bernhard Schölkopf, Francesco Locatello, Stefan Bauer, Nan Rosemary Ke, Nal Kalchbrenner, Anirudh Goyal, Yoshua Bengio, 2021 https://scholar.google.com/scholar?q=Towards+Causal+Representation+Learning 15. Interventional Causal Representation Learning — Kartik Ahuja, Divyat Mahajan, Yixin Wang, Yoshua Bengio, 2023 https://scholar.google.com/scholar?q=Interventional+Causal+Representation+Learning 16. Desiderata for Representation Learning: A Causal Perspective — Yixin Wang, Michael I. Jordan, 2024 https://scholar.google.com/scholar?q=Desiderata+for+Representation+Learning:+A+Causal+Perspective 17. Provably Learning Object-Centric Representations — Stefan Bauer, Bernhard Schölkopf and collaborators, 2023 https://scholar.google.com/scholar?q=Provably+Learning+Object-Centric+Representations 18. SlotFormer: Unsupervised Visual Dynamics Simulation with Object-Centric Models — Yuhang Wu, Yueting Zhuang, Francesco Locatello, et al., 2022 https://scholar.google.com/scholar?q=SlotFormer:+Unsupervised+Visual+Dynamics+Simulation+with+Object-Centric+Models 19. Object-Centric Video Prediction via Decoupling of Object Dynamics and Interactions — Angel Villar-Corrales, Ismail Wahdan, Sven Behnke, 2023 https://scholar.google.com/scholar?q=Object-Centric+Video+Prediction+via+Decoupling+of+Object+Dynamics+and+Interactions 20. DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning — Gaoyue Zhou, Hengkai Pan, Yann LeCun, Lerrel Pinto, 2024 https://scholar.google.com/scholar?q=DINO-WM:+World+Models+on+Pre-trained+Visual+Features+enable+Zero-shot+Planning 21. Conditional Object-Centric Learning from Video — Thomas Kipf, Gamaleldin F. Elsayed, Aravindh Mahendran, Austin Stone, Sara Sabour, Georg Heigold, Rico Jonschkowski, Alexey Dosovitskiy, Klaus Greff, 2022 https://scholar.google.com/scholar?q=Conditional+Object-Centric+Learning+from+Video 22. Attention over Learned Object Embeddings Enables Complex Visual Reasoning — David Ding, Felix Hill, Adam Santoro, Malcolm Reynolds, Matt Botvinick, 2021 https://scholar.google.com/scholar?q=Attention+over+Learned+Object+Embeddings+Enables+Complex+Visual+Reasoning 23. Dyn-O: Building Structured World Models with Object-Centric Representations — Zizhao Wang, Kaixin Wang, Li Zhao, Peter Stone, Jiang Bian, 2025 https://scholar.google.com/scholar?q=Dyn-O:+Building+Structured+World+Models+with+Object-Centric+Representations 24. Learning Interactive World Model for Object-Centric Reinforcement Learning — Fan Feng, Phillip Lippe, Sara Magliacane, 2025 https://scholar.google.com/scholar?q=Learning+Interactive+World+Model+for+Object-Centric+Reinforcement+Learning 25. Object-Centric World Model for Language-Guided Manipulation — Youngjoon Jeong, Junha Chun, Soonwoo Cha, Taesup Kim, 2025 https://scholar.google.com/scholar?q=Object-Centric+World+Model+for+Language-Guided+Manipulation 26. Planning in 8 Tokens: A Compact Discrete Tokenizer for Latent World Model — Dongwon Kim, Gawon Seo, Jinsung Lee, Minsu Cho, Suha Kwak, 2026 https://scholar.google.com/scholar?q=Planning+in+8+Tokens:+A+Compact+Discrete+Tokenizer+for+Latent+World+Model 27. Learning nonparametric latent causal graphs with unknown interventions — Yibo Jiang, Bryon Aragam, 2023 https://scholar.google.com/scholar?q=Learning+nonparametric+latent+causal+graphs+with+unknown+interventions 28. Learning Linear Causal Representations from Interventions under General Nonlinear Mixing — Simon Buchholz, Goutham Rajendran, Elan Rosenfeld, Bryon Aragam, Bernhard Schölkopf, Pradeep Ravikumar, 2023 https://scholar.google.com/scholar?q=Learning+Linear+Causal+Representations+from+Interventions+under+General+Nonlinear+Mixing 29. AI Post Transformers: LeWorldModel: Stable Joint-Embedding World Models from Pixels — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-25-leworldmodel-stable-joint-embedding-worl-650f9f.mp3 30. AI Post Transformers: Learning Latent Action World Models from Video — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-09-learning-latent-action-world-models-from-1570a4.mp3 31. AI Post Transformers: DreamerV3 World Models Across 150 Tasks — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-20-dreamerv3-world-models-across-150-tasks-af5edb.mp3 Interactive Visualization: Causal-JEPA for Object-Level World Models
2 DAYS AGO

Deep Kernel Fusion for Transformer Decoding

This episode explores a systems paper on speeding up Transformer decoding by tightly fusing the SwiGLU MLP path, rather than focusing only on attention or long-context tricks. It explains why long output generation becomes memory-bandwidth bound, clarifying concepts like kernel fusion, HBM traffic, prefill versus autoregressive decode, and why repeated token-by-token inference exposes the MLP as a real bottleneck. The discussion walks through the paper’s main design choice: a disciplined fusion of the up-projection, gate projection, SiLU activation, and elementwise multiply into a single decode-stage kernel, while leaving the down projection separate to avoid worse scheduling and register-pressure tradeoffs. It also highlights the paper’s practical argument for profiler-driven runtime scheduling across row-major and column-major kernel variants, making the result interesting to listeners who care about how large-model serving performance is won through careful hardware-aware engineering rather than headline-grabbing algorithm changes. Sources: 1. Deep Kernel Fusion for Transformer Decoding https://arxiv.org/pdf/2602.11808 2. GLU Variants Improve Transformer — Noam Shazeer, 2020 https://scholar.google.com/scholar?q=GLU+Variants+Improve+Transformer 3. PaLM: Scaling Language Modeling with Pathways — Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Noam Shazeer and many others, 2022 https://scholar.google.com/scholar?q=PaLM:+Scaling+Language+Modeling+with+Pathways 4. DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale — Reza Yazdani Aminabadi, Samyam Rajbhandari, Minjia Zhang, Ammar Ahmad Awan, Cheng Li and others, 2022 https://scholar.google.com/scholar?q=DeepSpeed+Inference:+Enabling+Efficient+Inference+of+Transformer+Models+at+Unprecedented+Scale 5. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning — Tri Dao, 2023 https://scholar.google.com/scholar?q=FlashAttention-2:+Faster+Attention+with+Better+Parallelism+and+Work+Partitioning 6. SGLang: Efficient Execution of Structured Language Model Programs — Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng, 2024 https://scholar.google.com/scholar?q=SGLang:+Efficient+Execution+of+Structured+Language+Model+Programs 7. Efficient Memory Management for Large Language Model Serving with PagedAttention — Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica, 2023 https://scholar.google.com/scholar?q=Efficient+Memory+Management+for+Large+Language+Model+Serving+with+PagedAttention 8. Welder: Scheduling Deep Learning Memory Access via Tile-Graph — Yining Shi, Zhi Yang, Jilong Xue, Lingxiao Ma, Yuqing Xia, Ziming Miao, Yuxiao Guo, Fan Yang, and Lidong Zhou, 2023 https://scholar.google.com/scholar?q=Welder:+Scheduling+Deep+Learning+Memory+Access+via+Tile-Graph 9. FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving — Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, and Luis Ceze, 2025 https://scholar.google.com/scholar?q=FlashInfer:+Efficient+and+Customizable+Attention+Engine+for+LLM+Inference+Serving 10. Masked Gated Linear Unit — unknown from snippet, likely 2024 or 2025 https://scholar.google.com/scholar?q=Masked+Gated+Linear+Unit 11. SCBench: A KV Cache-Centric Analysis of Long-Context Methods — unknown from snippet, likely 2025 https://scholar.google.com/scholar?q=SCBench:+A+KV+Cache-Centric+Analysis+of+Long-Context+Methods 12. Model Tells You Where to Merge: Adaptive KV Cache Merging for LLMs on Long-Context Tasks — unknown from snippet, likely 2025 https://scholar.google.com/scholar?q=Model+Tells+You+Where+to+Merge:+Adaptive+KV+Cache+Merging+for+LLMs+on+Long-Context+Tasks 13. MEDA: Dynamic KV Cache Allocation for Efficient Multimodal Long-Context Inference — unknown from snippet, likely 2025 https://scholar.google.com/scholar?q=MEDA:+Dynamic+KV+Cache+Allocation+for+Efficient+Multimodal+Long-Context+Inference 14. Efficient LLM Inference Using Dynamic Input Pruning and Cache-Aware Masking — unknown from snippet, likely 2025 https://scholar.google.com/scholar?q=Efficient+LLM+Inference+Using+Dynamic+Input+Pruning+and+Cache-Aware+Masking 15. Enhancing Transformer Performance and Portability Through Auto-Tuning Frameworks — P. Siwinska et al. (approx.), unknown, likely recent https://scholar.google.com/scholar?q=Enhancing+Transformer+Performance+and+Portability+Through+Auto-Tuning+Frameworks 16. AI Post Transformers: Splitwise: Phase-Split LLM Inference — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-26-splitwise-phase-split-llm-inference-e8945b.mp3 17. AI Post Transformers: Why LLM Serving Needs Mathematical Optimization — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-05-why-llm-serving-needs-mathematical-optim-647fc6.mp3 18. AI Post Transformers: PackKV Lossy Compression for KV Caches — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-04-packkv-lossy-compression-for-kv-caches-b37bce.mp3 19. AI Post Transformers: SGLang for Faster Structured LLM Programs — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-06-sglang-for-faster-structured-llm-program-c59f1c.mp3 20. AI Post Transformers: Speculative Decoding in Real vLLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-speculative-decoding-in-real-vllm-servin-6f4e2b.mp3 21. AI Post Transformers: LAPS for Length-Aware LLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-05-laps-for-length-aware-llm-serving-0c6149.mp3 22. AI Post Transformers: CacheFlow and 3D-Parallel KV Cache Restoration — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-01-cacheflow-and-3d-parallel-kv-cache-resto-8db883.mp3 23. AI Post Transformers: FengHuang for Rack-Scale LLM Inference Memory — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-12-fenghuang-for-rack-scale-llm-inference-m-62708e.mp3 Interactive Visualization: Deep Kernel Fusion for Transformer Decoding
2 DAYS AGO

FlashFuser and Hopper-Era FFN Kernel Fusion

This episode explores how the FlashFuser paper uses Hopper GPU inter-core communication to push kernel fusion beyond the usual single-SM memory limits, especially for transformer feed-forward networks and gated FFNs. It explains why this matters now: H100-class GPUs have gained compute far faster than memory bandwidth, making activation spills to HBM an increasingly painful bottleneck for workloads that can consume 40 to 60 percent of inference time. The discussion walks through Hopper’s distributed shared memory model and FlashFuser’s core idea of coordinating reduce, shuffle, and multiply patterns across SM clusters so large intermediate activations can stay on chip longer. Listeners would find it interesting because it connects compiler techniques, GPU architecture, and real transformer inference bottlenecks into a concrete argument about when newer hardware may finally make more aggressive fusion worthwhile. Sources: 1. FlashFuser: Expanding the Scale of Kernel Fusion for Compute-Intensive Operators via Inter-Core Connection — Ziyu Huang, Yangjie Zhou, Zihan Liu, Xinhao Luo, Yijia Diao, Minyi Guo, Jidong Zhai, Yu Feng, Chen Zhang, Anbang Wu, Jingwen Leng, 2025 http://arxiv.org/abs/2512.12949 2. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning — Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, Arvind Krishnamurthy, 2018 https://scholar.google.com/scholar?q=TVM:+An+Automated+End-to-End+Optimizing+Compiler+for+Deep+Learning 3. FusionStitching: Boosting Memory Intensive Computations for Deep Learning Workloads — Zhen Zheng, Pengzhan Zhao, Guoping Long, Feiwen Zhu, Kai Zhu, Wenyi Zhao, Lansong Diao, Jun Yang, Wei Lin, 2020 https://scholar.google.com/scholar?q=FusionStitching:+Boosting+Memory+Intensive+Computations+for+Deep+Learning+Workloads 4. Operator Fusion in XLA: Analysis and Evaluation — Daniel Snider, Ruofan Liang, 2023 https://scholar.google.com/scholar?q=Operator+Fusion+in+XLA:+Analysis+and+Evaluation 5. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness — Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Re, 2022 https://scholar.google.com/scholar?q=FlashAttention:+Fast+and+Memory-Efficient+Exact+Attention+with+IO-Awareness 6. Benchmarking and Dissecting the Nvidia Hopper GPU Architecture — Weile Luo, Ruibo Fan, Zeyu Li, Dayou Du, Qiang Wang, Xiaowen Chu, 2024 https://scholar.google.com/scholar?q=Benchmarking+and+Dissecting+the+Nvidia+Hopper+GPU+Architecture 7. A Case Study in CUDA Kernel Fusion: Implementing FlashAttention-2 on NVIDIA Hopper Architecture using the CUTLASS Library — Ganesh Bikshandi, Jay Shah, 2023 https://scholar.google.com/scholar?q=A+Case+Study+in+CUDA+Kernel+Fusion:+Implementing+FlashAttention-2+on+NVIDIA+Hopper+Architecture+using+the+CUTLASS+Library 8. Scaling Deep Learning Computation over the Inter-Core Connected Intelligence Processor with T10 — Yiqi Liu, Yuqi Xue, Yu Cheng, Lingxiao Ma, Ziming Miao, Jilong Xue, Jian Huang, 2024 https://scholar.google.com/scholar?q=Scaling+Deep+Learning+Computation+over+the+Inter-Core+Connected+Intelligence+Processor+with+T10 9. FlashFuser: Expanding the Scale of Kernel Fusion for Compute-Intensive Operators via Inter-Core Connection — Ziyu Huang, Yangjie Zhou, Zihan Liu, Xinhao Luo, Yijia Diao, Minyi Guo, Jidong Zhai, Yu Feng, Chen Zhang, Anbang Wu, Jingwen Leng, 2025 https://scholar.google.com/scholar?q=FlashFuser:+Expanding+the+Scale+of+Kernel+Fusion+for+Compute-Intensive+Operators+via+Inter-Core+Connection 10. Chimera: An Analytical Optimizing Framework for Effective Compute-intensive Operators Fusion — Size Zheng, Siyuan Chen, Peidi Song, Renze Chen, Xiuhong Li, Shengen Yan, Dahua Lin, Jingwen Leng, Yun Liang, 2023 https://scholar.google.com/scholar?q=Chimera:+An+Analytical+Optimizing+Framework+for+Effective+Compute-intensive+Operators+Fusion 11. BOLT: Bridging the Gap between Auto-tuners and Hardware-native Performance — Jiarong Xing, Leyuan Wang, Shang Zhang, Jack Chen, Ang Chen, Yibo Zhu, 2022 https://scholar.google.com/scholar?q=BOLT:+Bridging+the+Gap+between+Auto-tuners+and+Hardware-native+Performance 12. MCFuser: High-Performance and Rapid Fusion of Memory-Bound Compute-Intensive Operators — Zheng Zhang, Donglin Yang, Xiaobo Zhou, Dazhao Cheng, 2024 https://scholar.google.com/scholar?q=MCFuser:+High-Performance+and+Rapid+Fusion+of+Memory-Bound+Compute-Intensive+Operators 13. Deep Kernel Fusion for Transformers — Zixi Zhang, Zhiwen Mo, Yiren Zhao, Robert Mullins, 2026 https://scholar.google.com/scholar?q=Deep+Kernel+Fusion+for+Transformers 14. Benchmarking thread block cluster — approximate; unclear from snippet, 2023-2026 https://scholar.google.com/scholar?q=Benchmarking+thread+block+cluster 15. ClusterSim: modeling thread block clusters in Hopper GPUs — approximate; unclear from snippet, 2023-2026 https://scholar.google.com/scholar?q=ClusterSim:+modeling+thread+block+clusters+in+Hopper+GPUs 16. Analysing and Reducing Costs of Deep Learning Compiler Auto-tuning — approximate; unclear from snippet, 2023-2026 https://scholar.google.com/scholar?q=Analysing+and+Reducing+Costs+of+Deep+Learning+Compiler+Auto-tuning 17. AI Post Transformers: Why LLM Serving Needs Mathematical Optimization — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-05-why-llm-serving-needs-mathematical-optim-647fc6.mp3 18. AI Post Transformers: Splitwise: Phase-Split LLM Inference — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-26-splitwise-phase-split-llm-inference-e8945b.mp3 19. AI Post Transformers: LAPS for Length-Aware LLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-05-laps-for-length-aware-llm-serving-0c6149.mp3 20. AI Post Transformers: Speculative Decoding in Real vLLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-speculative-decoding-in-real-vllm-servin-6f4e2b.mp3 21. AI Post Transformers: Caffeine: A Unified FPGA for CNNs — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-06-caffeine-a-unified-fpga-for-cnns-e8acbe.mp3 22. AI Post Transformers: Gated Linear Attention for Efficient Long Sequences — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-18-gated-linear-attention-for-efficient-lon-c858ab.mp3 Interactive Visualization: FlashFuser and Hopper-Era FFN Kernel Fusion
2 DAYS AGO

JANUS for Scalable MoE Inference

This episode explores JANUS, a systems approach to serving mixture-of-experts transformers efficiently by separating attention layers from expert layers instead of deploying the whole model as a single monolithic unit. It explains why MoE models can still be expensive and latency-prone in practice: even if only a few experts activate per token, the system must still manage large expert memory footprints, skewed expert demand, and strict token-level latency targets such as time per output token. The discussion focuses on JANUS’s core ideas, including separate GPU pools for attention and expert computation, an adaptive two-phase communication scheme that reduces cross-node messaging overhead, and SLO-aware scaling that adjusts attention and expert capacity independently. Listeners would find it interesting because it turns MoE inference from a simple “sparse compute saves money” story into a deeper argument about distributed systems design, load balancing, and the real bottlenecks that determine whether advanced models feel fast in production. Sources: 1. Janus: Disaggregating Attention and Experts for Scalable MoE Inference — Zhexiang Zhang, Ye Wang, Yumiao Zhao, Jiayu Xiao, Qianjing Yang, Xiangyu Wang, Jingzhe Jiang, Qizhen Weng, Ruichuan Chen, Shaohuai Shi, Adel N. Toosi, Yin Chen, Minchen Yu, 2025 http://arxiv.org/abs/2512.13525 2. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity — William Fedus, Barret Zoph, Noam Shazeer, 2021 https://scholar.google.com/scholar?q=Switch+Transformers:+Scaling+to+Trillion+Parameter+Models+with+Simple+and+Efficient+Sparsity 3. FastMoE: A Fast Mixture-of-Expert Training System — Jiaao He, Jiezhong Qiu, Aohan Zeng, Zhilin Yang, Jidong Zhai, Jie Tang, 2021 https://scholar.google.com/scholar?q=FastMoE:+A+Fast+Mixture-of-Expert+Training+System 4. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving — Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, Hao Zhang, 2024 https://scholar.google.com/scholar?q=DistServe:+Disaggregating+Prefill+and+Decoding+for+Goodput-optimized+Large+Language+Model+Serving 5. MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism — Ruidong Zhu, Ziheng Jiang, Chao Jin, Peng Wu, Cesar A. Stuardo, Dongyang Wang and others, 2025 https://scholar.google.com/scholar?q=MegaScale-Infer:+Serving+Mixture-of-Experts+at+Scale+with+Disaggregated+Expert+Parallelism 6. eMoE: Task-aware Memory Efficient Mixture-of-Experts-Based (MoE) Model Inference — Suraiya Tairin, Shohaib Mahmud, Haiying Shen, Anand Iyer, 2025 https://scholar.google.com/scholar?q=eMoE:+Task-aware+Memory+Efficient+Mixture-of-Experts-Based+(MoE)+Model+Inference 7. SLO-Aware Compute Resource Allocation for Prefill-Decode Disaggregated LLM Inference — Luchang Li, Dongfang Li, Bozhao Gong, Yu Zhang, 2026 https://scholar.google.com/scholar?q=SLO-Aware+Compute+Resource+Allocation+for+Prefill-Decode+Disaggregated+LLM+Inference 8. MoEless: Efficient MoE LLM Serving via Serverless Computing — Hanfei Yu, Bei Ouyang, Shwai He, Ang Li, Hao Wang, 2026 https://scholar.google.com/scholar?q=MoEless:+Efficient+MoE+LLM+Serving+via+Serverless+Computing 9. xDeepServe: Model-as-a-Service on Huawei CloudMatrix384 — Ao Xiao et al., 2025 https://scholar.google.com/scholar?q=xDeepServe:+Model-as-a-Service+on+Huawei+CloudMatrix384 10. Semantic Parallelism: Redefining Efficient MoE Inference via Model-Data Co-Scheduling — Yan Li, Zhenyu Zhang, Zhengang Wang, Pengfei Chen, Pengfei Zheng, 2025 https://scholar.google.com/scholar?q=Semantic+Parallelism:+Redefining+Efficient+MoE+Inference+via+Model-Data+Co-Scheduling 11. GRACE-MoE: Grouping and Replication with Locality-Aware Routing for Efficient Distributed MoE Inference — Yu Han, Lehan Pan, Jie Peng, Ziyang Tao, Hanqi Zhu, Wuyang Zhang, Yanyong Zhang, 2025 https://scholar.google.com/scholar?q=GRACE-MoE:+Grouping+and+Replication+with+Locality-Aware+Routing+for+Efficient+Distributed+MoE+Inference 12. BurstGPT: A Real-world Workload Dataset to Optimize LLM Serving Systems — Yuxin Wang, Yuhan Chen, Zeyu Li, Xueze Kang, Yuchu Fang, Yeju Zhou, Yang Zheng, Zhenheng Tang, Xin He, Rui Guo, Xin Wang, Qiang Wang, Amelie Chi Zhou, Xiaowen Chu, 2024 https://scholar.google.com/scholar?q=BurstGPT:+A+Real-world+Workload+Dataset+to+Optimize+LLM+Serving+Systems 13. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model — DeepSeek-AI et al., 2024 https://scholar.google.com/scholar?q=DeepSeek-V2:+A+Strong,+Economical,+and+Efficient+Mixture-of-Experts+Language+Model 14. Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference — Ranggi Hwang et al., 2023/2024 https://arxiv.org/abs/2308.12066 15. HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference — Peng Tang et al., 2024 https://arxiv.org/abs/2411.01433 16. DAOP: Data-Aware Offloading and Predictive Pre-Calculation for Efficient MoE Inference — Yujie Zhang, Shivam Aggarwal, Tulika Mitra, 2025 https://arxiv.org/abs/2501.10375 17. AdaServe: Accelerating Multi-SLO LLM Serving with SLO-Customized Speculative Decoding — Zikun Li et al., 2025 https://arxiv.org/abs/2501.12162 18. SLOs-Serve: Optimized Serving of Multi-SLO LLMs — Siyuan Chen et al., 2025 https://arxiv.org/abs/2504.08784 19. Accelerating Edge Inference for Distributed MoE Models with Latency-Optimized Expert Placement — Tian Wu et al., 2025 https://arxiv.org/abs/2508.12851 20. AI Post Transformers: Batch-Aware Expert Routing for Faster MoE Decoding — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-batch-aware-expert-routing-for-faster-mo-683ab6.mp3 21. AI Post Transformers: Splitwise: Phase-Split LLM Inference — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-26-splitwise-phase-split-llm-inference-e8945b.mp3 22. AI Post Transformers: Computation-Bandwidth-Memory Trade-offs for AI Infrastructure — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-09-computation-bandwidth-memory-trade-offs-a83f2b.mp3 23. AI Post Transformers: Why LLM Serving Needs Mathematical Optimization — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-05-why-llm-serving-needs-mathematical-optim-647fc6.mp3 24. AI Post Transformers: LAPS for Length-Aware LLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-05-laps-for-length-aware-llm-serving-0c6149.mp3 25. AI Post Transformers: DeepSeek-V4 and Practical Million-Token Context — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-25-deepseek-v4-and-practical-million-token-6f4de1.mp3 26. AI Post Transformers: FengHuang for Rack-Scale LLM Inference Memory — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-12-fenghuang-for-rack-scale-llm-inference-m-62708e.mp3 Interactive Visualization: JANUS for Scalable MoE Inference
2 DAYS AGO

Lossless Sparse Deltas for RL Networks

This episode explores a systems paper on making reinforcement-learning post-training for large language models practical over ordinary Ethernet and even WAN links, rather than requiring expensive RDMA clusters. It explains why trainer-actor RL creates a synchronization bottleneck, how full policy refreshes can dominate runtime on 1 to 10 gigabit networks, and why that turns bandwidth into a hidden limiter of who can run serious RL workloads. The discussion centers on the paper’s proposed solution: lossless sparse delta checkpoints that send only changed parameters, along with carefully encoded indices, streamed in parallel with rollout generation so actors can reconstruct the exact updated model without quantization or approximation. Listeners would find it interesting because it connects low-level systems design to the economics and accessibility of modern LLM training, asking whether better synchronization methods could open RL post-training to labs and startups outside elite infrastructure environments. Sources: 1. RL over Commodity Networks: Overcoming the Bandwidth Barrier with Lossless Sparse Deltas — Chaoyi Ruan, Geng Luo, Xinyi Wan, Long Zhao, Qinghe Wang, Jiaan Zhu, Duling Xu, Guanbin Xu, Dehui Wei, Xiang Liu, Cheng Li, Haifeng Sun, Congcong Miao, Jialin Li, 2026 http://arxiv.org/abs/2602.11456 2. Understanding and Exploiting Weight Update Sparsity for Communication-Efficient Distributed RL — Erfan Miahi, Eugene Belilovsky, 2026 https://scholar.google.com/scholar?q=Understanding+and+Exploiting+Weight+Update+Sparsity+for+Communication-Efficient+Distributed+RL 3. StreamRL: Scalable, Heterogeneous, and Elastic RL for LLMs with Disaggregated Stream Generation — Yinmin Zhong, Zili Zhang, Xiaoniu Song, Hanpeng Hu, Chao Jin, Bingyang Wu, Nuo Chen, Yukun Chen, Yu Zhou, Changyi Wan, Hongyu Zhou, Yimin Jiang, Yibo Zhu, Daxin Jiang, 2025 https://scholar.google.com/scholar?q=StreamRL:+Scalable,+Heterogeneous,+and+Elastic+RL+for+LLMs+with+Disaggregated+Stream+Generation 4. HybridFlow: A Flexible and Efficient RLHF Framework — Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, Chuan Wu, 2024 https://scholar.google.com/scholar?q=HybridFlow:+A+Flexible+and+Efficient+RLHF+Framework 5. OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework — Jian Hu, Xibin Wu, Wei Shen, Jason Klein Liu, Zilin Zhu, Weixun Wang, Songlin Jiang, Haoran Wang, Hao Chen, Bin Chen, Weikai Fang, Xianyu, Yu Cao, Haotian Xu, Yiming Liu, 2024 https://scholar.google.com/scholar?q=OpenRLHF:+An+Easy-to-use,+Scalable+and+High-performance+RLHF+Framework 6. How Can We Train Deep Learning Models Across Clouds and Continents? An Experimental Study — Alexander Erben, Ruben Mayer, Hans-Arno Jacobsen, 2023 https://scholar.google.com/scholar?q=How+Can+We+Train+Deep+Learning+Models+Across+Clouds+and+Continents?+An+Experimental+Study 7. Efficient Memory Management for Large Language Model Serving with PagedAttention — Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica, 2023 https://scholar.google.com/scholar?q=Efficient+Memory+Management+for+Large+Language+Model+Serving+with+PagedAttention 8. Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models — authors not confirmed from snippet, recent, unconfirmed from snippet https://scholar.google.com/scholar?q=Asynchronous+RLHF:+Faster+and+More+Efficient+Off-Policy+RL+for+Language+Models 9. Faster, More Efficient RLHF through Off-Policy Asynchronous Learning — authors not confirmed from snippet, recent, unconfirmed from snippet https://scholar.google.com/scholar?q=Faster,+More+Efficient+RLHF+through+Off-Policy+Asynchronous+Learning 10. Stable Asynchrony: Variance-Controlled Off-Policy RL for LLMs — authors not confirmed from snippet, recent, unconfirmed from snippet https://scholar.google.com/scholar?q=Stable+Asynchrony:+Variance-Controlled+Off-Policy+RL+for+LLMs 11. Efficient Online RFT with Plug-and-Play LLM Judges: Unlocking State-of-the-Art Performance — authors not confirmed from snippet, recent, unconfirmed from snippet https://scholar.google.com/scholar?q=Efficient+Online+RFT+with+Plug-and-Play+LLM+Judges:+Unlocking+State-of-the-Art+Performance 12. Accelerating RL Post-Training Rollouts via System-Integrated Speculative Decoding — authors not confirmed from snippet, recent, unconfirmed from snippet https://scholar.google.com/scholar?q=Accelerating+RL+Post-Training+Rollouts+via+System-Integrated+Speculative+Decoding 13. Beat the Long Tail: Distribution-Aware Speculative Decoding for RL Training — authors not confirmed from snippet, recent, unconfirmed from snippet https://scholar.google.com/scholar?q=Beat+the+Long+Tail:+Distribution-Aware+Speculative+Decoding+for+RL+Training 14. AI Post Transformers: HALoS: Hierarchical Asynchronous LLM Training over Slow Networks — Hal Turing & Dr. Ada Shannon, 2025 https://podcast.do-not-panic.com/episodes/halos-hierarchical-asynchronous-llm-training-over-slow-networks/ 15. AI Post Transformers: TensorFlow for Distributed Machine Learning Systems — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-06-tensorflow-for-distributed-machine-learn-b7fa52.mp3 16. AI Post Transformers: AgenticQwen and Small Industrial Tool Agents — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-27-agenticqwen-and-small-industrial-tool-ag-dc676d.mp3 Interactive Visualization: Lossless Sparse Deltas for RL Networks
2 DAYS AGO

Ministral 3: Cascade Distillation for Long-Context Multimodal Models

This episode explores Ministral 3, a family of 3B, 8B, and 14B long-context multimodal models built from a 24B parent through structured pruning and cascade distillation rather than separate full-scale training runs. It explains how the method works step by step, from teacher-student distillation and capacity-gap concerns to the staged pruning pipeline that extends each child model to 256k context windows while preserving useful capabilities. The discussion places the paper in context with earlier distillation and pruning work such as Hinton’s original distillation paper, DistilBERT, teacher-assistant distillation, and NVIDIA’s Minitron, arguing that the contribution is a practical model-family construction recipe rather than a brand-new paradigm. Listeners would find it interesting because it gets at a central 2026 question in AI deployment: whether smaller, cheaper models can stay competitive on long-context and multimodal tasks by amortizing one expensive parent run across several deployable descendants. Sources: 1. Ministral 3 — Alexander H. Liu, Kartik Khandelwal, Sandeep Subramanian, Victor Jouault, Abhinav Rastogi, Adrien Sadé, Alan Jeffares, Albert Jiang, Alexandre Cahill, Alexandre Gavaudan, Alexandre Sablayrolles, Amélie Héliou, Amos You, Andy Ehrenberg, Andy Lo, Anton Eliseev, Antonia Calvi, Avinash Sooriyarachchi, Baptiste Bout, Baptiste Rozière, Baudouin De Monicault, Clémence Lanfranchi, Corentin Barreau, Cyprien Courtot, Daniele Grattarola, Darius Dabert, Diego de las Casas, Elliot Chane-Sane, Faruk Ahmed, Gabrielle Berrada, Gaëtan Ecrepont, Gauthier Guinet, Georgii Novikov, Guillaume Kunsch, Guillaume Lample, Guillaume Martin, Gunshi Gupta, Jan Ludziejewski, Jason Rute, Joachim Studnia, Jonas Amar, Joséphine Delas, Josselin Somerville Roberts, Karmesh Yadav, Khyathi Chandu, Kush Jain, Laurence Aitchison, Laurent Fainsin, Léonard Blier, Lingxiao Zhao, Louis Martin, Lucile Saulnier, Luyu Gao, Maarten Buyl, Margaret Jennings, Marie Pellat, Mark Prins, Mathieu Poirée, Mathilde Guillaumin, Matthieu Dinot, Matthieu Futeral, Maxime Darrin, Maximilian Augustin, Mia Chiquier, Michel Schimpf, Nathan Grinsztajn, Neha Gupta, Nikhil Raghuraman, Olivier Bousquet, Olivier Duchenne, Patricia Wang, Patrick von Platen, Paul Jacob, Paul Wambergue, Paula Kurylowicz, Pavankumar Reddy Muddireddy, Philomène Chagniot, Pierre Stock, Pravesh Agrawal, Quentin Torroba, Romain Sauvestre, Roman Soletskyi, Rupert Menneer, Sagar Vaze, Samuel Barry, Sanchit Gandhi, Siddhant Waghjale, Siddharth Gandhi, Soham Ghosh, Srijan Mishra, Sumukh Aithal, Szymon Antoniak, Teven Le Scao, Théo Cachet, Theo Simon Sorg, Thibaut Lavril, Thiziri Nait Saada, Thomas Chabal, Thomas Foubert, Thomas Robert, Thomas Wang, Tim Lawson, Tom Bewley, Tom Bewley, Tom Edwards, Umar Jamil, Umberto Tomasini, Valeriia Nemychnikova, Van Phung, Vincent Maladière, Virgile Richard, Wassim Bouaziz, Wen-Ding Li, William Marshall, Xinghui Li, Xinyu Yang, Yassine El Ouahidi, Yihan Wang, Yunhao Tang, Zaccharie Ramzi, 2026 http://arxiv.org/abs/2601.08584 2. Distilling the Knowledge in a Neural Network — Geoffrey Hinton, Oriol Vinyals, Jeff Dean, 2015 https://scholar.google.com/scholar?q=Distilling+the+Knowledge+in+a+Neural+Network 3. Improved Knowledge Distillation via Teacher Assistant — Seyed-Iman Mirzadeh, Mehrdad Farajtabar, Ang Li, Nir Levine, Akihiro Matsukawa, Hassan Ghasemzadeh, 2019 https://scholar.google.com/scholar?q=Improved+Knowledge+Distillation+via+Teacher+Assistant 4. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter — Victor Sanh, Lysandre Debut, Julien Chaumond, Thomas Wolf, 2019 https://scholar.google.com/scholar?q=DistilBERT,+a+distilled+version+of+BERT:+smaller,+faster,+cheaper+and+lighter 5. Compact Language Models via Pruning and Knowledge Distillation — Saurav Muralidharan, Sharath Turuvekere Sreenivas, Raviraj Joshi, Marcin Chochowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, Pavlo Molchanov, 2024 https://scholar.google.com/scholar?q=Compact+Language+Models+via+Pruning+and+Knowledge+Distillation 6. LLM Pruning and Distillation in Practice: The Minitron Approach — S. T. Sreenivas, S. Muralidharan, R. Joshi, M. Chochowski, A. S. Mahabaleshwarkar, G. Shen, J. Zeng, Z. Chen, Y. Suhara, S. Diao, C. Yu, W. Chen, H. Ross, O. Olabiyi, A. Aithal, O. Kuchaiev, D. Korzekwa, P. Molchanov, M. Patwary, M. Shoeybi, J. Kautz, and B. Catanzaro, 2024 https://scholar.google.com/scholar?q=LLM+Pruning+and+Distillation+in+Practice:+The+Minitron+Approach 7. Distillation Scaling Laws — D. Busbridge, A. Shidani, F. Weers, J. Ramapuram, E. Littwin, and R. Webb, 2025 https://scholar.google.com/scholar?q=Distillation+Scaling+Laws 8. Distilled Pretraining: A Modern Lens of Data, In-Context Learning and Test-Time Scaling — S. Goyal, D. Lopez-Paz, and K. Ahuja, 2025 https://scholar.google.com/scholar?q=Distilled+Pretraining:+A+Modern+Lens+of+Data,+In-Context+Learning+and+Test-Time+Scaling 9. Pixtral 12B — P. Agrawal, S. Antoniak, E. B. Hanna, B. Bout, D. Chaplot, J. Chudnovsky, D. Costa, B. De Monicault, S. Garg, T. Gervet, et al., 2024 https://scholar.google.com/scholar?q=Pixtral+12B 10. Junk DNA Hypothesis: Pruning Small Pre-Trained Weights Irreversibly and Monotonically Impairs "Difficult" Downstream Tasks in LLMs — Lu Yin et al., 2024 https://scholar.google.com/scholar?q=Junk+DNA+Hypothesis:+Pruning+Small+Pre-Trained+Weights+Irreversibly+and+Monotonically+Impairs+"Difficult"+Downstream+Tasks+in+LLMs 11. SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot — Elias Frantar and Dan Alistarh, 2023 https://scholar.google.com/scholar?q=SparseGPT:+Massive+Language+Models+Can+Be+Accurately+Pruned+in+One-Shot 12. Fast and Effective Weight Update for Pruned Large Language Models — Vladimir Boza, 2024 https://scholar.google.com/scholar?q=Fast+and+Effective+Weight+Update+for+Pruned+Large+Language+Models 13. Exploring Knowledge Purification in Multi-Teacher Knowledge Distillation for LLMs — Ruihan Jin et al., 2026 https://scholar.google.com/scholar?q=Exploring+Knowledge+Purification+in+Multi-Teacher+Knowledge+Distillation+for+LLMs 14. Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models — Siyan Zhao et al., 2026 https://scholar.google.com/scholar?q=Self-Distilled+Reasoner:+On-Policy+Self-Distillation+for+Large+Language+Models 15. Data Engineering for Scaling Language Models to 128K Context — Yao Fu et al., 2024 https://scholar.google.com/scholar?q=Data+Engineering+for+Scaling+Language+Models+to+128K+Context 16. How to Train Long-Context Language Models (Effectively) — Tianyu Gao et al., 2025 https://scholar.google.com/scholar?q=How+to+Train+Long-Context+Language+Models+(Effectively) 17. Train Small, Infer Large: Memory-Efficient LoRA Training for Large Language Models — Jun Zhang et al., 2025 https://scholar.google.com/scholar?q=Train+Small,+Infer+Large:+Memory-Efficient+LoRA+Training+for+Large+Language+Models 18. AI Post Transformers: DeepSeek-V4 and Practical Million-Token Context — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-25-deepseek-v4-and-practical-million-token-6f4de1.mp3 19. AI Post Transformers: Muon Is Scalable for LLM Training — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-25-muon-is-scalable-for-llm-training-587ed8.mp3 20. AI Post Transformers: Learning to Reason with 13 Parameters — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-14-learning-to-reason-with-13-parameters-54c87f.mp3 21. AI Post Transformers: AgenticQwen and Small Industrial Tool Agents — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-27-agenticqwen-and-small-industrial-tool-ag-dc676d.mp3 Interactive Visualization: Ministral 3: Cascade Distillation for Long-Context Multimodal Models

See All (651)

Creator

mcgrof
Years Active

2025 - 2026
Episodes

651
Rating

Clean
Show Website

AI Post Transformers

Technology

Technology

Updated weekly
Society & Culture

Society & Culture

Updated weekly
Technology

Technology

Updated weekly

AI Post Transformers

Agentic AI as a Path to AGI

Air Force One, Jensen Huang, and Anthropic's 2028 Memo

Causal-JEPA for Object-Level World Models

Deep Kernel Fusion for Transformer Decoding

FlashFuser and Hopper-Era FFN Kernel Fusion

JANUS for Scalable MoE Inference

Lossless Sparse Deltas for RL Networks

Ministral 3: Cascade Distillation for Long-Context Multimodal Models

About

Information

You Might Also Like

AI Post Transformers

Episodes

Agentic AI as a Path to AGI

Air Force One, Jensen Huang, and Anthropic's 2028 Memo

Causal-JEPA for Object-Level World Models

Deep Kernel Fusion for Transformer Decoding

FlashFuser and Hopper-Era FFN Kernel Fusion

JANUS for Scalable MoE Inference

Lossless Sparse Deltas for RL Networks

Ministral 3: Cascade Distillation for Long-Context Multimodal Models

About

Information

You Might Also Like