AI Post Transformers

mcgrof

0.0 (0)
Technology
Updated daily

AI-generated podcast where hosts Hal Turing and Dr. Ada Shannon discuss the latest research papers and reports in machine learning, AI systems, and optimization. Featuring honest critical analysis, proper citations, and nerdy humor.

1 day ago

InfiniGen for Efficient Long-Context LLM Inference

This episode explores InfiniGen, a systems approach to speeding up long-context language model inference by treating KV cache management, not raw compute, as the central bottleneck. It explains why decoding slows down when large caches have to shuttle between CPU and GPU memory, and contrasts InfiniGen with FlexGen, H2O, and PagedAttention to show how different serving setups create different memory problems. The discussion focuses on InfiniGen’s core idea: use a lightweight preview from the previous layer, along with offline-skewed query and key weights, to predict which exact cache entries will matter next and prefetch only those instead of moving the whole history. Listeners would find it interesting because the paper reports large practical gains, including up to 3x speedups and major accuracy improvements over weaker cache-selection methods, making it a concrete example of systems engineering reshaping how large models are served. Sources: 1. InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management — Wonbeom Lee, Jungi Lee, Junghwan Seo, Jaewoong Sim, 2024 http://arxiv.org/abs/2406.19707 2. FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU — Ying Sheng et al., 2023 https://scholar.google.com/scholar?q=FlexGen:+High-Throughput+Generative+Inference+of+Large+Language+Models+with+a+Single+GPU 3. H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models — Zhenyu Zhang et al., 2023 https://scholar.google.com/scholar?q=H2O:+Heavy-Hitter+Oracle+for+Efficient+Generative+Inference+of+Large+Language+Models 4. Efficient Memory Management for Large Language Model Serving with PagedAttention — Woosuk Kwon et al., 2023 https://scholar.google.com/scholar?q=Efficient+Memory+Management+for+Large+Language+Model+Serving+with+PagedAttention 5. Efficient Streaming Language Models with Attention Sinks — Guangxuan Xiao et al., 2023 https://scholar.google.com/scholar?q=Efficient+Streaming+Language+Models+with+Attention+Sinks 6. DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads — Guangxuan Xiao et al., 2024 https://scholar.google.com/scholar?q=DuoAttention:+Efficient+Long-Context+LLM+Inference+with+Retrieval+and+Streaming+Heads 7. SCBench: A KV Cache-Centric Analysis of Long-Context Methods — Yucheng Li et al., 2024 https://scholar.google.com/scholar?q=SCBench:+A+KV+Cache-Centric+Analysis+of+Long-Context+Methods 8. IndexMem: Learned KV-Cache Eviction with Latent Memory for Long-Context LLM Inference — Xintong Yang et al., 2026 https://arxiv.org/abs/2605.25475 9. Model Tells You Where to Merge: Adaptive KV Cache Merging for LLMs on Long-Context Tasks — Zheng Wang et al., 2024 https://arxiv.org/abs/2407.08454 10. KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization — Coleman Hooper et al., 2024 https://arxiv.org/abs/2401.18079 11. TailorKV: A Hybrid Framework for Long-Context Inference via Tailored KV Cache Optimization — Dingyu Yao et al., 2025 https://arxiv.org/abs/2505.19586 12. LayerKV: Optimizing Large Language Model Serving with Layer-wise KV Cache Management — Yi Xiong et al., 2024 https://arxiv.org/abs/2410.00428 13. SqueezeAttention: 2D Management of KV-Cache in LLM Inference via Layer-wise Optimal Budget — Zihao Wang et al., 2024 https://arxiv.org/abs/2404.04793 14. KeDiff: Key Similarity-Based KV Cache Eviction for Long-Context LLM Inference in Resource-Constrained Environments — Junyoung Park et al., 2025 https://arxiv.org/abs/2504.15364 15. AI Post Transformers: IndexMem: Learned KV-Cache Eviction for Long-Context LLMs — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-06-12-indexmem-learned-kv-cache-eviction-for-l-132c2a.mp3 16. AI Post Transformers: PackKV Lossy Compression for KV Caches — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-04-packkv-lossy-compression-for-kv-caches-b37bce.mp3 17. AI Post Transformers: Speculative Decoding in Real vLLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-speculative-decoding-in-real-vllm-servin-6f4e2b.mp3 18. AI Post Transformers: Memory-Bound, Not Bandwidth-Limited Batch-1 LLM Decode — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-06-02-memory-bound-not-bandwidth-limited-batch-114799.mp3 19. AI Post Transformers: FengHuang for Rack-Scale LLM Inference Memory — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-12-fenghuang-for-rack-scale-llm-inference-m-62708e.mp3 20. AI Post Transformers: Deep Kernel Fusion for Transformer Decoding — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-15-deep-kernel-fusion-for-transformer-decod-b1a703.mp3 21. AI Post Transformers: QuantSpec: Hierarchical KV Cache for Self-Speculative Decoding — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/quantspec-hierarchical-kv-cache-for-self-speculative-decoding/
1 day ago

Ling and Ring 2.6 for Trillion-Scale Agents

This episode explores Inclusion AI’s Ling and Ring 2.6 technical report, which asks how a trillion-parameter model can stay fast, handle very long contexts, and remain dependable in multi-step agent workflows. It explains why agentic AI makes latency and token costs much more painful than in ordinary chat, especially when models must carry long instruction traces, tool outputs, and large working contexts through repeated reasoning loops. The discussion breaks down the report’s core architectural changes, including a Lightning Attention and MLA hybrid with a 7:1 layer mix, designed to reduce attention cost and KV-cache memory without sacrificing model quality. It also examines the practical significance of retrofitting an existing trillion-scale checkpoint through continued pretraining and staged migration techniques, making the episode especially interesting for listeners who want a concrete look at how frontier model design is shifting from raw scale toward deployable systems engineering. Sources: 1. Ling and Ring 2.6 Technical Report: Efficient and Instant Agentic Intelligence at Trillion-Parameter Scale — Ang Li, Ben Liu, Bin Han, Bin Hu, Bin Jing, Binbin Hu, Bing Li, Cai Chen, Caizhi Tang, Changxin Tian, Chao Huang, Chao Zhang, Chen Liang, Chen Qian, Chengfu Tang, Chengyao Wen, Chilin Fu, Chunwei Wu, Cong Zhang, Cunyin Peng, Daixin Wang, Dalong Zhang, Deng Zhao, Dingnan Jin, Dingyuan Zhu, Donghao Zhang, Fan Yuan, Fangzheng Zhao, Fanzhuang Meng, Feifan Wu, Feng Xu, Fengbin Fang, Gangshan Wang, Guodong Yang, Hailin Zhao, Haitao Wang, Haitao Zhang, Hanxiao Zhang, Hanzi Wang, Hao Dai, Hao Liu, Hao Qian, Hao Wu, Haoxiong Liu, Haoyu Xu, Heng Zhang, Hong Liu, Hongliang Zhang, Hongrui Liu, Hongxun Li, Hongzhi Ruan, Huaidong Xiong, Huihuang Zheng, Huikang Tang, Jia Guo, Jia Li, Jia Liu, Jiameng Wang, Jiaming Liu, Jiannan Shi, Jianping Wei, Jiaolong Yang, Jiapeng Wang, Jie Gao, Jie Wang, Jiewei Wu, Jin Yang, Jinjin Li, Jinjing Huang, Jinquan Sun, Jinyao Chen, Juanhui Tu, Jun Liu, Jun Mei, Jun Xu, Jun Zhou, Junjie Ou, Junnan Sipan, Junpeng Fang, Kaihong Zhang, Kaiqin Hu, Ke Shi, Kuan Xu, Kun Tang, Kunlong Chen, Lanyin Mei, Lei Chen, Lei Liang, Lei Xu, Li Tang, Liang Jiang, Liangcheng Fu, Lihui Zhang, Linfeng Shi, Lintao Ma, Liyuan Liu, Longfei Li, Longfei Zheng, Lu Liu, Lu Yu, Man Li, Meiqi Zhu, Meng Li, Mengjie Gao, Mengshu Sun, Mingming Yin, Mingyang Zhang, Mingyuan Fan, Nuo Xu, Pan Tang, Peijie Jiang, Peilong Zhao, Peng Lin, Pingping Liu, Qi Zuo, Qian Zhao, Qiang Cheng, Qianggang Cao, Qiaoben Bao, Qing Cui, Qingyuan Yang, Qitao Shi, Qiyin Huang, Qizheng Zhou, Quan Wan, Runyuan Zhao, Shaomian Zheng, Shaowei Wei, Shengnan Zhang, Shuaicheng Li, Shujie Li, Shuo Zhang, Sikang Bian, Tianchu Yao, Tiange Xu, Tianshu Wang, Ting Guo, Tinghao Wang, Tingwei Huang, Tong Zhao, Tongkai Yang, Wang Hong, Wanli Gu, Wei Lu, Weichang Wu, Weiguang Han, Weiquan Li, Wenbo Shen, Wenjing Fang, Wenzhi Tang, Xiang Shu, Xiao Shi, Xiaodong Yan, Xiaolu Zhang, Xiaopei Wan, Xiaqing Sun, Xin Zhao, Xingyu Lu, Xinxing Yang, Xinyao Tang, Xinyu Kong, Xinyu Liu, Xiong Xu, Xuan Sun, Xudong Han, Xudong Wang, Xujie Shen, Yalin Zhang, Yangyang Hou, Yankun Ren, Yao Zhao, Ye Chen, Yeyang Chen, Yibo Cao, Yifan Zuo, Yijie Chen, Ying Li, Yingjie Song, Yingxue Li, Yiqi Wang, Yixuan Sun, Yizhu Xiao, Yongfei Xu, Yu Liu, Yuchen Fang, Yue Gao, Yue Yu, Yue Zhang, Yuqi Zhang, Yuxiao He, Yuxiao Lu, Yuxin Tian, Yuxuan Li, Yuzhuo Fu, Zhankai Xu, Zhaoxin Huan, Zhenduo Zhang, Zhengke Gui, Zhengyu Huang, Zhenjun Ma, Zhenxuan Pan, Zheping Qu, Zhibo Zhu, Zhidong Fan, Zhigang Huangfu, Zhihao Wang, Zhiqiang Zhang, Zhizhen Liu, Zhuyan Zhou, Zibin Lin, Zihang Zeng, Zihao Wang, Zilong Wang, Ziqi Liu, Zitao Xuan, Zixuan Cheng, Zujie Wen, Zuoli Tang, 2026 http://arxiv.org/abs/2606.15079 2. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models — Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, Denny Zhou, 2022 https://scholar.google.com/scholar?q=Chain-of-Thought+Prompting+Elicits+Reasoning+in+Large+Language+Models 3. Let's Verify Step by Step — Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, Karl Cobbe, 2023 https://scholar.google.com/scholar?q=Let's+Verify+Step+by+Step 4. CoT-Evo: Evolutionary Distillation of Chain-of-Thought for Scientific Reasoning — Kehua Feng, Keyan Ding, Zhihui Zhu, Lei Liang, Qiang Zhang, Huajun Chen, 2025 https://scholar.google.com/scholar?q=CoT-Evo:+Evolutionary+Distillation+of+Chain-of-Thought+for+Scientific+Reasoning 5. Chain Of Thought Compression: A Theoretical Analysis — Juncai Li, Ru Li, Yuxiang Zhou, Boxiang Ma, Jeff Z. Pan, 2026 https://scholar.google.com/scholar?q=Chain+Of+Thought+Compression:+A+Theoretical+Analysis 6. Every Attention Matters: An Efficient Hybrid Architecture for Long-Context Reasoning (https://arxiv.org/abs/2510.19338) — Ling Team et al., 2025 https://scholar.google.com/scholar?q=Every+Attention+Matters:+An+Efficient+Hybrid+Architecture+for+Long-Context+Reasoning+(https://arxiv.org/abs/2510.19338) 7. Various Lengths, Constant Speed: Efficient Language Modeling with Lightning Attention (https://arxiv.org/abs/2405.17381) — Zhen Qin et al., 2024 https://scholar.google.com/scholar?q=Various+Lengths,+Constant+Speed:+Efficient+Language+Modeling+with+Lightning+Attention+(https://arxiv.org/abs/2405.17381) 8. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model (https://arxiv.org/abs/2405.04434) — DeepSeek-AI et al., 2024 https://scholar.google.com/scholar?q=DeepSeek-V2:+A+Strong,+Economical,+and+Efficient+Mixture-of-Experts+Language+Model+(https://arxiv.org/abs/2405.04434) 9. Holistic Capability Preservation: Towards Compact Yet Comprehensive Reasoning Models (https://arxiv.org/abs/2504.07158) — Ling Team et al., 2025 https://scholar.google.com/scholar?q=Holistic+Capability+Preservation:+Towards+Compact+Yet+Comprehensive+Reasoning+Models+(https://arxiv.org/abs/2504.07158) 10. Every Step Evolves: Scaling Reinforcement Learning for Trillion-Scale Thinking Model (https://arxiv.org/abs/2510.18855) — Ling Team et al., 2025 https://scholar.google.com/scholar?q=Every+Step+Evolves:+Scaling+Reinforcement+Learning+for+Trillion-Scale+Thinking+Model+(https://arxiv.org/abs/2510.18855) 11. Michelangelo: Long Context Evaluations Beyond Haystacks via Latent Structure Queries (https://arxiv.org/abs/2409.12640) — Kiran Vodrahalli et al., 2024 https://scholar.google.com/scholar?q=Michelangelo:+Long+Context+Evaluations+Beyond+Haystacks+via+Latent+Structure+Queries+(https://arxiv.org/abs/2409.12640) 12. HyperAttention: Long-context Attention in Near-Linear Time — Insu Han et al., 2023 https://arxiv.org/abs/2310.05869 13. MiniCPM-SALA: Hybridizing Sparse and Linear Attention for Efficient Long-Context Modeling — MiniCPM Team / Wenhao An et al., 2026 https://arxiv.org/abs/2602.11761 14. Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning — Wenkai Yang et al., 2025 https://arxiv.org/abs/2502.18080 15. Reasoning on a Budget: A Survey of Adaptive and Controllable Test-Time Compute in LLMs — Mohammad Ali Alomrani et al., 2025 https://arxiv.org/abs/2507.02076 16. Anyprefer: An Agentic Framework for Preference Data Synthesis — Yiyang Zhou et al., 2025 https://arxiv.org/abs/2504.19276 17. Towards Comprehensive Preference Data Collection for Reward Modeling — Yulan Hu et al., 2024 https://arxiv.org/abs/2406.16486 18. AI Post Transformers: Affordable Large-Scale Decoding Through Model-System Co-Design — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-19-affordable-large-scale-decoding-through-e1d7ed.mp3 19. AI Post Transformers: DeepSeek-V4 and Practical Million-Token Context — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-25-deepseek-v4-and-practical-million-token-6f4de1.mp3 20. AI Post Transformers: Speculative Decoding in Real vLLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-speculative-decoding-in-real-vllm-servin-6f4e2b.mp3 21. AI Post Transformers: Self-Improving Pretraining With Post-Trained Models — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-02-self-improving-pretraining-with-post-tra-e37460.mp3 22. AI Post Transformers: Agentic Discovery for Test-Time Scaling — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-12-agentic-discovery-for-test-time-scaling-f9a81f.mp3 23. AI Post Transformers: Nemotron 3 Super Hybrid Mamba-Transformer MoE — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-19-nemotron-3-super-hybrid-mamba-transforme-31ac75.mp3 24. AI Post Transformers: Kimi K2.5 and Visual Agent Swarms — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-24-kimi-k25-and-visual-agent-swarms-7d04d7.mp3 Interactive Visualization: Ling and Ring 2.6 for Trillion-Scale Agents
1 day ago

Nemotron 3 Ultra for Long-Horizon Agents

This episode explores NVIDIA’s Nemotron 3 Ultra, an open 550-billion-parameter mixture-of-experts model with only 55 billion parameters active per token, a hybrid Mamba-transformer backbone, and a 1 million token context window aimed at long-running agentic reasoning. It explains how sparse MoE routing, Mamba-style sequence layers, and low-precision NVFP4 training are used to cut KV-cache pressure, memory bandwidth costs, and decode-time latency for workloads like extended coding, tool use, and document-heavy planning. The discussion also breaks down the model’s full training and post-training stack, including LatentMoE, multi-token prediction, RL for reasoning and tool use, specialist-teacher distillation, and user-facing reasoning budget control. Listeners would find it interesting because the episode goes beyond benchmark headlines to examine the real argument of the paper: that long-horizon AI agents depend as much on serving economics and systems design as on raw model intelligence. Sources: 1. Nemotron 3 Ultra: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning — NVIDIA, :, Aaron Blakeman, Aaron Thomas, Aastha Jhunjhunwala, Abhibha Gupta, Abhinav Khattar, Adam Rajfer, Adi Renduchintala, Adil Asif, Aditya Vavre, Adriana Flores Miranda, Ahmad Bilal, Aileen Zaman, Ajay Hotchandani, Akanksha Shukla, Akhiad Bercovich, Aleksander Ficek, Alex Gronskiy, Alex Kondratenko, Alex Steiner, Alex Ye, Alexander Bukharin, Alexandre Milesi, Ali Taghibakhshi, Alice Gatti, Alisa Liu, Alok Kumar, Amar Phanishayee, Ameya Sunil Mahabaleshwarkar, Amir Klein, Amit Zuker, Amnon Geifman, Anahita Bhiwandiwalla, Ananth Subramaniam, Andrea Santilli, Andrew Fulks, Andrew McHarg, Andrew Tao, Andrii Skliar, Anjulie Agrusa, Ankur Srivastava, Ankur Verma, Anna Shors, Anna Warno, Antoni-Joan Solergibert I Llaquet, Arham Mehta, Arkadiusz Nowaczynski, Arti Jain, Ashwath Aithal, Ashwin Poojary, Asif Ahamed, Asit Mishra, Asma Kuriparambil Thekkumpate, Atefeh Sohrabizadeh, Avinash Kaur, Avinash Vem, Ayush Dattagupta, Barath Subramaniam Anandan, Bardiya Sadeghi, Ben Lanir, Benedikt Schifferer, Besmira Nushi, Bilal Kartal, Bill Thiede, Bita Darvish Rouhani, Bo Deng, Bob Schatz, Boris Ginsburg, Boxin Wang, Brad Nemire, Brandon Norick, Brian Dang, Brian Westphal, Brian Yu, Brucek Khailany, Bryan Catanzaro, Carlo del Mundo, Caryln Aarish, Chankyu Lee, Chantal Hwang, Charbel Sakr, Charles Wang, Charlie Truong, Chen Cui, Cheng Cheng, Cheng-Ping Hsieh, Chenghao Zhang, Chenhui Deng, Chintan Patel, Chris Alexiuk, Christian Cosgrove, Christian Munley, Christine Harvey, Christopher Parisien, Chunyang Shen, Coco Li, Collin Neale, Cynthia Gao, Cyril Meurillon, Dan Gil, Dan Su, Dan Zhao, Dane Corneil, Daniel Afrimi, Daniel Egert, Daniel Korzekwa, Daniel Lo, Daniel Machlab, Daniel Serebrenik, Daniil Sorokin, Daria Gitman, Daria Levy, Darko Stosic, David Mosallanezhad, David Yu, Davit Karamyan, Deena Donia, Deep Debroy, Deepak Narayanan, Devin O'Kelly, Dheeraj Peri, Dhruv Nathawani, Di, Wu, Dima Rekesh, Divyanshu Kakwani, Donald Plummer, Dong Anh, Dongfeng Yu, Dongfu Jiang, Donnie Kim, Dorrin Poorkay, Duncan Riach, Dusan Stosic, Dustin VanStee, Eavan Meng, Edgar Minasyan, Edward Lin, Eileen Margaret Peters Long, Elad Sarafin, Elad Segal, Elena Lantz, Ellie Evans, Elliott Ning, Eric Chung, Eric Harper, Eric Pham-Hung, Eric Tramel, Eric Yang, Erick Galinkin, Erik Pounds, Erika Goncalves Goncalves, Evan Briones, Evan Wu, Evelina Bakhturina, Evgeny Tsykunov, Ewa Dobrowolska, Faisal Ladhak, Farzan Memarian, Fay Wang, Fei Jia, Felipe Soares, Felipe Vieira Frujeri, Feng Chen, Fengguang Lin, Ferenc Galko, Frank Sun, Frankie Siino, Frida Hou, Gal Hubara Agam, Gal Kaplun, Gantavya Bhatt, Gargi Prasad, Garvit Kulshreshtha, George Armstrong, Gerald Shen, Giulio Borghesi, Gordana Neskovic, Gorkem Batmaz, Grace Lam, Greg Mason, Greg Pauloski, Grigor Nalbandyan, Grzegorz Chlebus, Grzegorz Karch, Guan-Ting Liu, Guoming Zhang, Guyue Huang, Haggai Maron, Haifeng Qian, Haim Elisha, Haoxing Ren, Haran Kumar Shiv Kumar, Haribhau Hud, Harris Nover, Harrison Saturley Hall, Hayate Iso, Helen Ngo, Herbert Hum, Herman Sahota, Hexin Wang, Himanshu Soni, Hovhannes Tamoyan, Hua Li, Huanhuan Chen, Hui Li, Hui Wang, Huy Nguyen, Ian Chiles, Ido Galil, Ido Shahaf, Igor Gitman, Igor Shovkun, Ilya Loshchilov, Ingo Guehring, Itamar Schen, Itay Levy, Itay Neeman, Ivan Moshkov, Izik Golan, Izzy Putterman, Jaemin Choi, Jakub Slowikowski, Jan Kautz, Jane Polak Scowcroft, Jared Casper, Jatin Mitra, Jeffrey Glick, Jenny Chen, Jesse Oliver, Jiacheng Xu, Jiafan Zhu, Jialin Song, Jian Zhang, Jiantao Jiao, Jiaqi Zeng, Jie Lou, Jim King, Jimmy Zhang, Jingquan Wang, Jinhang Choi, Jinju Chu, Joey Conway, Joey Guman, Johan Jatko, Johannes Rausch, John Kamalu, John Roberts, Johnny Greco, Johnny Mensel, Jonah Alben, Jonas Yang, Jonathan Cohen, Jonathan Raiman, Joseph Jennings, Joshua Mabry, Joshua Pierce, Joyjit Daw, Julien Veron Vialard, Junkeun Yi, Jupinder Parmar, Kajal Jain, Kan Zhu, Kari Briski, Katherine Cheung, Katherine Luna, Keith Willowhawk, Keith Wyss, Keshav Santhanam, Kevin Shih, Kezhi Kong, Khanh Nguyen, Khushi Bhardwaj, Kirthi Shankar Sivamani, Konstantinos Krommydas, Krishna C. Puvvada, Krzysztof Pawelec, Kumar Anik, Kyle Keprios, Kylie Day, Lawrence McAfee, Leo Du, Leon Derczynski, Li Ding, Linda Liu, Lingjie Wu, Lior Kadoch, Lizzie Wei, Luis Vega, Luke Robison, Lun Su, Maarten Van Segbroeck, Maciej Jakub Mikulski, Maer Rodrigues de Melo, Magda Sypula, Mahan Fathi, Makesh Narsimhan Sreedhar, Makesh Tarun Chandran, Manoj Kilaru, Maor Ashkenazi, Marc Cuevas, Marc Romeijn, Marcin Chochowski, Mark Cai, Mark Mozolewski, Markus Kliegl, Marta Stepniewska-Dziubinska, Martyna Patelka, Mattei Machczynski, Matvei Novikov, Mauricio Ferrato, Maximilian Golub, Mehrzad Samadi, Melissa Corpuz, Mengru Wang, Mengxi Wu, Meredith Price, Meriem Boubdir, Micah Schaffer, Michael Andersch, Michael Boone, Michael Gschwind, Michael Lightstone, Michael Loh, Michal Bien, Michal Zawalski, Michelle Gill, Miguel Martinez, Mikail Khona, Mike Chrzanowski, Mike Houston, Mingyuan Ma, Minseok Lee, Mohamed Fawzy, Mohammad Dabbah, Mohammad Shoeybi, Mostofa Patwary, Nabin Mulepati, Najeeb Nabwani, Namit Dhameja, Narimane Hennouni, Natalie Hereth, Nathaniel Pinckney, Nave Algarici, Nave Assaf, Netanel Haber, Nicholas Knight, Nick Reamaroon, Nickson Quak, Nidhi Bhatia, Nikhil Desai, Nikolai Ludwig, Nima Tajbakhsh, Ning Xu, Nir Ailon, Nirmal Juluru, Nitin Nitin, Ofri Masad, Oleg Rybakov, Oleksii Hrinchuk, Oleksii Kuchaiev, Olivia Viessmann, Olivier Delalleau, Oluwatobi Olabiyi, Omer Ullman Argov, Omri Puny, Oren Tropp, Pablo Ribalta, Pallab Bhattacharya, Panos Lampropoulos, Parth Mannan, Pasha Shamis, Patrick Legresley, Paul Gibbons, Pavlo Molchanov, Pawel Morkisz, Peter Dykas, Peter Jin, Pierre-Yves Aquilanti, Pinky Xu, Piotr Januszewski, Piotr Laskiewicz, Pooya Jannaty, Prakash Gurumurthy, Pranav Prashant Thombre, Prasoon Varshney, Pritam Gundecha, Przemek Tredak, Puhui Meng, Qiyu Wan, Rabeeh Karimi Mahabadi, Rachel Oberman, Rachit Garg, Radha Sri-Tharan, Rahul Kandu, Rakshit Sanadhya, Ran El-Yaniv, Ran Zilberstein, Rasoul Shafipour, Ray Macalisang, Rayen Tian, Reka Kovacs, Renjie Pi, Rick Izzo, Rima Shahbazyan, Rishabh Garg, Rishi Puri, Rita Fernandes Neves, Ritchie Zhao, Ritika Borkar, Ritu Gala, Riyad Islam, Robert Clark, Robert Hesse, Robert Kirby, Roger Waleffe, Rohit Watve, Roi Koren, Ron Banner, Ruoxi Zhang, Russell J. Hewett, Ryan Prenger, Ryan Stewart, Ryota Egashira, Sadegh Mahdavi, Saee Paliwal, Sagar Singh, Sahil Modi, Salika Dave, Samantha Shinagawa, Samuel Kriman, Sandip Bhaskar, Sangkug Lym, Sanjay Kariyappa, Sanjeev Satheesh, Saran Vikas Murari, Satish Pasumarthi, Saurabh Mishra, Saurav Muralidharan, Scott Hara, Sean Narentharen, Selvaraj Anandaraj, Seonjin Na, Seonmeyong Bak, Seonmyeong Bak, Sepehr Sameni, Seph Mard, Serge Panev, Seth Henneman, Seth Poulos, Shahar Mor, Shantanu Acharya, Shaona Ghosh, Sharath Turuvekere Sreenivas, Sharon Mendelson, Shaun Kotek, Shawn Wang, Shay Aharon, Shaya Gharghabi, Sheng-Chieh Lin, Shi Chen, Shiqing Fan, Shirish Baskaran, Shreya Gopa, Shrimai Prabhumoye, Shubham Pachori, Shubham Toshniwal, Shuoyang Ding, Shwetha Krishnamurthy, Siddharth Singh, Simeng Sun, Sirshak Das, Sivakumar Arayandi Thottakara, Smita Ithape, Somshubra Majumdar, Soumye Singhal, Sri Harsha Singudasu, Sridhar Bhuvanapalli, Srimukh Veccham, Stas Sergienko, Stefania Alborghetti, Stephen Ge, Su Rong, Sugam Dipak Devare, Sukrit Rao, Sumeet Kumar Barua, Sungsoo Ha, Sunny Gai, Suriya Gunasekar, Suseella Panguluri, Suyog Gupta, Sviataslau Hinzburh, Sweta Priyadarshi, Syeda Nahida Akter, Talor Abramovich, Tan Bui, Tanay Varshney, Tatevik Ter-Hovhannisyan, Teodor-Dumitru Ene, Terry Kong, Thanh Do, Tianhe Zhang, Tiffany Moore, Tijmen Blankevoort, Tim Moon, Tiyasa Mitra, Tom Balough, Tomasz Grzegorzek, Tomasz Hliwiak, Tomer Asida, Tomer Bar Natan, Tomer Keren, Tomer Ronen, Tony Salim, Tony Wang, Traian Rebedea, Tugrul Konuk, Twinkle Vashishth, Udi Karpas, Ushnish De, Vahid Noorozi, Venkat Srinivasan, Venmugil Elango, Vibhor Agrawal, Victor Cui, Vijay Korthikanti, Vikas Mehta, Vinay Rao, Virginia Wu, Vitaly Kurin, Vitaly Lavrukhin, Vladimir Anisimov, Vu Pham, Wanli Jiang, Wasi Uddin Ahmad, Wataru Ishihara, Wei Du, Wei Ping, Weiheng Chai, Wenliang Dai, Wesley Helmholz, Will Jennings, Will Zhu, Wojciech Prazuch, Xiaowei Ren, Xiwen Yu, Yan Breek, Yang Chen, Yang Yu, Yangyi Chen, Yaniv Galron, Yashaswi Karnati, Yejin Choi, Yev Meyer, Yi-Fu Wu, Yian Zhang, Ying Lin, Yonatan Geifman, Yonggan Fu, Youngeun Kwon, Yu Yao, Yugi Guvvla, Yuki Huang, Yunsheng Liu, Zach Moshe, Zachary Newell, Zhilin Wang, Zhiyu Li, Zhongbo Zhu, Zhuolin Yang, Zihan Liu, Zijie Yan, Zsolt-Alon Wertheimer, 2026 http://arxiv.org/abs/2606.15007 2. Distilling the Knowledge in a Neural Network — Geoffrey Hinton, Oriol Vinyals, Jeff Dean, 2015 https://arxiv.org/abs/1503.02531 3. On-Policy Distilla
1 day ago

OpenSkill for Open-World Self-Evolution in LLM Agents

This episode explores OpenSkill, a framework for LLM agents that tries to improve behavior after deployment by building durable, reusable skills from public evidence rather than retraining model weights. It explains how the paper separates ordinary tool use from open-world self-evolution, arguing that the key challenge is not just acting with browsers and code, but turning documentation, repositories, papers, and tutorials into explicit procedures and verification checks. The discussion focuses on the paper’s central claim that agents can create their own proxy tests through grounded verification anchors without leaking hidden benchmark answers, and compares that approach with earlier systems like Reflexion, Voyager, ExpeL, AutoSkill, and Memento-Skills. Listeners would find it interesting because it gets at a practical industry problem: whether agents can stay useful as APIs, websites, and workflows change, or whether the verifier remains the real bottleneck to genuine self-improvement. Sources: 1. OpenSkill: Open-World Self-Evolution for LLM Agents — Zhiling Yan, Dingjie Song, Hanrong Zhang, Wei Liang, Yuxuan Zhang, Yutong Dai, Lifang He, Philip S. Yu, Ran Xu, Xiang Li, Lichao Sun, 2026 http://arxiv.org/abs/2606.06741 2. Reflexion: Language Agents with Verbal Reinforcement Learning — Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, Shunyu Yao, 2023 https://arxiv.org/abs/2303.11366 3. Voyager: An Open-Ended Embodied Agent with Large Language Models — Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, Anima Anandkumar, 2023 https://arxiv.org/abs/2305.16291 4. ExpeL: LLM Agents Are Experiential Learners — Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, Gao Huang, 2023 https://arxiv.org/abs/2308.10144 5. Training LLM Agents for Spontaneous, Reward-Free Self-Evolution via World Knowledge Exploration — Qifan Zhang, Dongyang Ma, Tianqing Fang, Jia Li, Jing Tang, Nuo Chen, Haitao Mi, Yan Wang, 2026 https://arxiv.org/abs/2604.18131 6. Constitutional AI: Harmlessness from AI Feedback — Yuntao Bai, Saurav Kadavath, Amanda Askell, Ethan Perez, Jared Kaplan, Dario Amodei, Tom Brown and collaborators, 2022 https://arxiv.org/abs/2212.08073 7. RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback — Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan Ferret, Kellie Lu, Colton Bishop, Ethan Hall, Victor Carbune, Abhinav Rastogi, Sushant Prakash, 2023 https://arxiv.org/abs/2309.00267 8. Self-Rewarding Language Models — Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, Jason Weston, 2024 https://arxiv.org/abs/2401.10020 9. MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models — Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T. Kwok, Zhenguo Li, Adrian Weller, Weiyang Liu, 2023 https://arxiv.org/abs/2309.12284 10. SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks — Xiangyi Li et al., 2026 https://scholar.google.com/scholar?q=SkillsBench:+Benchmarking+How+Well+Agent+Skills+Work+Across+Diverse+Tasks 11. AutoSkill: Experience-Driven Lifelong Learning via Skill Self-Evolution — Yutao Yang et al., 2026 https://scholar.google.com/scholar?q=AutoSkill:+Experience-Driven+Lifelong+Learning+via+Skill+Self-Evolution 12. Memento-Skills: Let Agents Design Agents — Huichi Zhou et al., 2026 https://scholar.google.com/scholar?q=Memento-Skills:+Let+Agents+Design+Agents 13. SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces — Chang Jin et al., 2026 https://scholar.google.com/scholar?q=SkillSafetyBench:+Evaluating+Agent+Safety+under+Skill-Facing+Attack+Surfaces 14. EASYTOOL: Enhancing LLM-based Agents with Concise Tool Instruction — Siyu Yuan et al., 2024 https://scholar.google.com/scholar?q=EASYTOOL:+Enhancing+LLM-based+Agents+with+Concise+Tool+Instruction 15. AutoRefine: From Trajectories to Reusable Expertise for Continual LLM Agent Refinement — Libin Qiu et al., 2026 https://scholar.google.com/scholar?q=AutoRefine:+From+Trajectories+to+Reusable+Expertise+for+Continual+LLM+Agent+Refinement 16. SE-Agent: Self-Evolution Trajectory Optimization in Multi-Step Reasoning with LLM-Based Agents — Jiaye Lin et al., 2025 https://scholar.google.com/scholar?q=SE-Agent:+Self-Evolution+Trajectory+Optimization+in+Multi-Step+Reasoning+with+LLM-Based+Agents 17. When AIs Judge AIs: The Rise of Agent-as-a-Judge Evaluation for LLMs — Fangyi Yu, 2025 https://scholar.google.com/scholar?q=When+AIs+Judge+AIs:+The+Rise+of+Agent-as-a-Judge+Evaluation+for+LLMs 18. Leveraging LLMs as Meta-Judges: A Multi-Agent Framework for Evaluating LLM Judgments — Yuran Li et al., 2025 https://scholar.google.com/scholar?q=Leveraging+LLMs+as+Meta-Judges:+A+Multi-Agent+Framework+for+Evaluating+LLM+Judgments 19. AI Post Transformers: The Endless Gym: Training Terminal Agents — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/the-endless-gym-training-terminal-agents/ 20. AI Post Transformers: When AI Builds Itself and Recursive Self-Improvement — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-06-05-when-ai-builds-itself-and-recursive-self-8bbf9e.mp3 21. AI Post Transformers: Self-Improving Pretraining With Post-Trained Models — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-02-self-improving-pretraining-with-post-tra-e37460.mp3 22. AI Post Transformers: Split Personality Training Reveals Latent Knowledge — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-08-split-personality-training-reveals-laten-c84616.mp3 23. AI Post Transformers: In-Place Test-Time Training for Transformers — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-09-in-place-test-time-training-for-transfor-d0b976.mp3 Interactive Visualization: OpenSkill for Open-World Self-Evolution in LLM Agents
1 day ago

SageAttention2 and Fast Exact INT4 Attention

This episode explores SageAttention2, an ICML 2025 paper on making exact transformer attention faster without changing the underlying computation, focusing on why long-context models still pay a steep quadratic cost and why exact kernels remain important despite sparse and linear alternatives. It explains the paper’s central claim that aggressive low-precision attention can work only with careful numerical repair: queries and keys are pushed to INT4, attention-weight and value computation moves toward FP8, and outlier-smoothing ideas inspired by SmoothQuant are used to keep softmax-sensitive logits from collapsing. The discussion highlights the paper’s most concrete systems contribution, per-thread INT4 quantization aligned to GPU thread fragments and PTX `mma` execution, which aims to get fine-grained scaling without losing the performance win to dequantization overhead. A listener would find it interesting because the episode turns a seemingly narrow kernel optimization into a broader argument about hardware-software co-design, showing how much engineering is required to make lower-bit attention practical rather than just theoretically faster. Sources: 1. SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization — Jintao Zhang, Haofeng Huang, Pengle Zhang, Jia Wei, Jun Zhu, Jianfei Chen, 2024 http://arxiv.org/abs/2411.10958 2. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning — Tri Dao, 2023 https://scholar.google.com/scholar?q=FlashAttention-2:+Faster+Attention+with+Better+Parallelism+and+Work+Partitioning 3. INT-FlashAttention: Enabling Flash Attention for INT8 Quantization — Shimao Chen, Zirui Liu, Zhiying Wu, et al., 2024 https://scholar.google.com/scholar?q=INT-FlashAttention:+Enabling+Flash+Attention+for+INT8+Quantization 4. SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration — Jintao Zhang, Jia Wei, Haofeng Huang, Pengle Zhang, Jun Zhu, Jianfei Chen, 2025 https://scholar.google.com/scholar?q=SageAttention:+Accurate+8-Bit+Attention+for+Plug-and-play+Inference+Acceleration 5. SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization — Jintao Zhang, Haofeng Huang, Pengle Zhang, Jia Wei, Jun Zhu, Jianfei Chen, 2025 https://scholar.google.com/scholar?q=SageAttention2:+Efficient+Attention+with+Thorough+Outlier+Smoothing+and+Per-thread+INT4+Quantization 6. Understanding and Overcoming the Challenges of Efficient Transformer Quantization — Yelysei Bondarenko, Markus Nagel, Tijmen Blankevoort, 2021 https://scholar.google.com/scholar?q=Understanding+and+Overcoming+the+Challenges+of+Efficient+Transformer+Quantization 7. SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models — Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, Song Han, 2023 https://scholar.google.com/scholar?q=SmoothQuant:+Accurate+and+Efficient+Post-Training+Quantization+for+Large+Language+Models 8. Outlier Suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling — Xiuying Wei, Yunchen Zhang, Yuhang Li, Xiangguo Zhang, Ruihao Gong, Jinyang Guo, Xianglong Liu, 2023 https://scholar.google.com/scholar?q=Outlier+Suppression+:+Accurate+quantization+of+large+language+models+by+equivalent+and+optimal+shifting+and+scaling 9. QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs — Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L. Croci, et al., 2024 https://scholar.google.com/scholar?q=QuaRot:+Outlier-Free+4-Bit+Inference+in+Rotated+LLMs 10. FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision — Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, Tri Dao, 2024 https://scholar.google.com/scholar?q=FlashAttention-3:+Fast+and+Accurate+Attention+with+Asynchrony+and+Low-precision 11. QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving — Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, Song Han, 2024 https://scholar.google.com/scholar?q=QServe:+W4A8KV4+Quantization+and+System+Co-design+for+Efficient+LLM+Serving 12. MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention — Huiqiang Jiang et al., 2024 https://arxiv.org/abs/2407.02490 13. SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention — Qianchao Zhu et al., 2024 https://arxiv.org/abs/2406.15486 14. FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference — Xunhao Lai et al., 2025 https://arxiv.org/abs/2502.20766 15. Activation Outliers in Transformer Quantization: Reproduction, Statistical Analysis, and Deployment Tradeoffs — Pranav Kumar Kaliaperumal, 2026 https://arxiv.org/abs/2603.04308 16. BAPS: A Fine-Grained Low-Precision Scheme for Softmax in Attention via Block-Aware Precision reScaling — Zisheng Ye et al., 2026 https://arxiv.org/abs/2602.02071 17. Softpick: No Attention Sink, No Massive Activations with Rectified Softmax — Zayd M. K. Zuhri et al., 2025 https://arxiv.org/abs/2504.20966 18. AI Post Transformers: Deep Kernel Fusion for Transformer Decoding — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-15-deep-kernel-fusion-for-transformer-decod-b1a703.mp3 19. AI Post Transformers: TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-25-turboquant-online-vector-quantiz-1967b7.mp3 20. AI Post Transformers: NanoFlow and the Future of LLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-15-nanoflow-and-the-future-of-llm-serving-7429c9.mp3 Interactive Visualization: SageAttention2 and Fast Exact INT4 Attention
1 day ago

When Quantization Hurts Reasoning Models

This episode explores how quantization affects reasoning models, asking how much weights, activations, and KV caches can be compressed before multi-step reasoning starts to fail. It explains the main quantization strategies in practical serving terms, from weight-only methods like AWQ and GPTQ to weight-activation schemes such as W8A8 and W4A4, and KV cache compression for long decoding traces. The discussion argues that reasoning models are unusually fragile because small numerical errors can compound across long solution paths, making calibration quality and benchmark choice far more important than they are for ordinary chat models. Listeners would find it interesting for its concrete look at the tradeoff between cheaper inference and reliable reasoning, grounded in evaluations across model families from 1.5B to 70B and difficult benchmarks in math, science, and code. Sources: 1. Quantization Hurts Reasoning? An Empirical Study on Quantized Reasoning Models — Ruikang Liu, Yuxuan Sun, Manyi Zhang, Haoli Bai, Xianzhi Yu, Tiezheng Yu, Chun Yuan, Lu Hou, 2025 http://arxiv.org/abs/2504.04823 2. SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models — Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, Song Han, 2023 https://scholar.google.com/scholar?q=SmoothQuant:+Accurate+and+Efficient+Post-Training+Quantization+for+Large+Language+Models 3. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration — Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, Song Han, 2024 https://scholar.google.com/scholar?q=AWQ:+Activation-aware+Weight+Quantization+for+LLM+Compression+and+Acceleration 4. KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization — Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, Amir Gholami, 2024 https://scholar.google.com/scholar?q=KVQuant:+Towards+10+Million+Context+Length+LLM+Inference+with+KV+Cache+Quantization 5. Quantization Meets Reasoning: Exploring LLM Low-Bit Quantization Degradation for Mathematical Reasoning — Zhen Li, Yupeng Su, Runming Yang, Congkai Xie, Zheng Wang, Zhongwei Xie, Ngai Wong, Hongxia Yang, 2025 https://scholar.google.com/scholar?q=Quantization+Meets+Reasoning:+Exploring+LLM+Low-Bit+Quantization+Degradation+for+Mathematical+Reasoning 6. Evaluating Quantized Large Language Models — Shiyao Li et al., 2024 https://scholar.google.com/scholar?q=Evaluating+Quantized+Large+Language+Models 7. FlatQuant: Flatness Matters for LLM Quantization — Yuxuan Sun et al., 2024 https://scholar.google.com/scholar?q=FlatQuant:+Flatness+Matters+for+LLM+Quantization 8. s1: Simple Test-Time Scaling — Niklas Muennighoff et al., 2025 https://scholar.google.com/scholar?q=s1:+Simple+Test-Time+Scaling 9. What Makes Low-Bit Quantization-Aware Training Work for Reasoning LLMs? A Systematic Study — Keyu Lv et al., 2026 https://arxiv.org/abs/2601.14888 10. Measuring Faithfulness Depends on How You Measure: Classifier Sensitivity in LLM Chain-of-Thought Evaluation — Richard J. Young, 2026 https://arxiv.org/abs/2603.20172 11. On the Hardness of Faithful Chain-of-Thought Reasoning in Large Language Models — Sree Harsha Tanneru et al., 2024 https://arxiv.org/abs/2406.10625 12. Activation Outliers in Transformer Quantization: Reproduction, Statistical Analysis, and Deployment Tradeoffs — Pranav Kumar Kaliaperumal, 2026 https://arxiv.org/abs/2603.04308 13. KVCOMM: Online Cross-context KV-cache Communication for Efficient LLM-based Multi-agent Systems — Hancheng Ye et al., 2025 https://arxiv.org/abs/2510.12872 14. AI Post Transformers: Affordable Large-Scale Decoding Through Model-System Co-Design — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-19-affordable-large-scale-decoding-through-e1d7ed.mp3 15. AI Post Transformers: Mooncake for KV Cache-Centric LLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-06-05-mooncake-for-kv-cache-centric-llm-servin-1086d0.mp3 16. AI Post Transformers: IndexMem: Learned KV-Cache Eviction for Long-Context LLMs — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-06-12-indexmem-learned-kv-cache-eviction-for-l-132c2a.mp3 17. AI Post Transformers: Speculative Decoding in Real vLLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-speculative-decoding-in-real-vllm-servin-6f4e2b.mp3
2 days ago

DafnyBench and LLMs for Formal Verification

This episode explores DafnyBench, a benchmark for testing whether large language models can help with one of formal verification’s hardest practical bottlenecks: reconstructing the missing assertions and loop invariants that make Dafny programs verifiable. It explains how formal verification differs from ordinary testing and from theorem proving, and why the paper deliberately frames the task as restoring proof hints in existing verified programs rather than synthesizing correct software from scratch. The discussion digs into benchmark design, including the dataset of 782 single-file Dafny programs, the rule that models must infer both the content and placement of missing hints, and the importance of excluding shortcut tricks like disabling verification. It also highlights a crucial result nuance: 208 files already verify after hint removal, so the reported top score of about 67.8% is more informative when translated into genuine recovery performance on the subset that actually needs new annotations. Sources: 1. DafnyBench: A Benchmark for Formal Software Verification — Chloe Loughridge, Qinyi Sun, Seth Ahrenbach, Federico Cassano, Chuyue Sun, Ying Sheng, Anish Mudide, Md Rakib Hossain Misu, Nada Amin, Max Tegmark, 2024 http://arxiv.org/abs/2406.08467 2. Clover: Closed-Loop Verifiable Code Generation — Chuyue Sun, Ying Sheng, Oded Padon, Clark Barrett, 2024 https://scholar.google.com/scholar?q=Clover:+Closed-Loop+Verifiable+Code+Generation 3. Towards AI-Assisted Synthesis of Verified Dafny Methods — Md Rakib Hossain Misu, Cristina V. Lopes, Iris Ma, James Noble, 2024 https://scholar.google.com/scholar?q=Towards+AI-Assisted+Synthesis+of+Verified+Dafny+Methods 4. LeanDojo: Theorem Proving with Retrieval-Augmented Language Models — Kaiyu Yang et al., 2023 https://scholar.google.com/scholar?q=LeanDojo:+Theorem+Proving+with+Retrieval-Augmented+Language+Models 5. LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code — Naman Jain et al., 2024 https://scholar.google.com/scholar?q=LiveCodeBench:+Holistic+and+Contamination+Free+Evaluation+of+Large+Language+Models+for+Code 6. Local Success Does Not Compose: Benchmarking Large Language Models for Compositional Formal Verification — Xu Xu et al., 2025 https://scholar.google.com/scholar?q=Local+Success+Does+Not+Compose:+Benchmarking+Large+Language+Models+for+Compositional+Formal+Verification 7. A New Era in Software Security: Towards Self-Healing Software via Large Language Models and Formal Verification — Norbert Tihanyi et al., 2023 https://scholar.google.com/scholar?q=A+New+Era+in+Software+Security:+Towards+Self-Healing+Software+via+Large+Language+Models+and+Formal+Verification 8. AI Post Transformers: LLM Agents Reason About Code Without Running It — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-15-llm-agents-reason-about-code-without-run-2a1876.mp3 9. AI Post Transformers: SkillsBench for Evaluating Agent Skills — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-14-skillsbench-for-evaluating-agent-skills-58bb1e.mp3
2 days ago

DafnyPro for LLM-Assisted Dafny Verification

This episode explores DafnyPro, a system for using large language models to help verify Dafny programs while keeping the original executable logic unchanged. It explains the basics of formal verification in Dafny, including preconditions, postconditions, loop invariants, decreases clauses, ghost code, and why writing correct proof annotations is much harder than generating plausible code. The discussion compares DafnyPro with earlier efforts such as Clover, DafnyBench, and Laurel, then focuses on DafnyPro’s main contribution: a parser-backed safeguard that rejects any LLM attempt that alters program behavior and a verifier-guided loop that can also prune bad invariants instead of blindly adding more. A listener would find it interesting because it gets at a real trust problem in AI coding tools: whether a model can genuinely help prove software correct rather than quietly rewriting the task into something easier to verify. Sources: 1. DafnyPro: LLM-Assisted Automated Verification for Dafny Programs — Debangshu Banerjee, Olivier Bouissou, Stefan Zetzsche, 2026 http://arxiv.org/abs/2601.05385 2. Clover: Closed-Loop Verifiable Code Generation — Chuyue Sun, Ying Sheng, Oded Padon, Clark Barrett, 2023 https://arxiv.org/abs/2310.17807 3. DafnyBench: A Benchmark for Formal Software Verification — Chloe Loughridge, Qinyi Sun, Seth Ahrenbach, Federico Cassano, Chuyue Sun, Ying Sheng, Anish Mudide, Md Rakib Hossain Misu, Nada Amin, Max Tegmark, 2024 https://arxiv.org/abs/2406.08467 4. Laurel: Unblocking Automated Verification with Large Language Models — Eric Mugnier, Emmanuel Anaya Gonzalez, Ranjit Jhala, Nadia Polikarpova, Yuanyuan Zhou, 2024 (rev. 2025) https://arxiv.org/abs/2405.16792 5. DafnyPro: LLM-Assisted Automated Verification for Dafny Programs — Debangshu Banerjee, Olivier Bouissou, Stefan Zetzsche, 2026 https://arxiv.org/abs/2601.05385 6. Laurel: Generating Dafny Assertions Using Large Language Models — Eric Mugnier, Emmanuel Anaya Gonzalez, Ranjit Jhala, Nadia Polikarpova, Yuanyuan Zhou, 2024 https://arxiv.org/abs/2405.16792 7. dafny-annotator: AI-Assisted Verification of Dafny Programs — Gabriel Poesia, Chloe Loughridge, Nada Amin, 2024 https://arxiv.org/abs/2411.15143 8. Towards AI-Assisted Synthesis of Verified Dafny Methods — Md Rakib Hossain Misu, Cristina V. Lopes, Iris Ma, James Noble, 2024 https://arxiv.org/abs/2402.00247 9. Inferring multiple helper Dafny assertions with LLMs — Alvaro Silva, Alexandra Mendes, Ruben Martins, 2025 https://arxiv.org/abs/2511.00125 10. Rango: Adaptive Retrieval-Augmented Proving for Automated Software Verification — Kyle Thompson et al., 2024 https://scholar.google.com/scholar?q=Rango:+Adaptive+Retrieval-Augmented+Proving+for+Automated+Software+Verification 11. Towards Neural Synthesis for SMT-Assisted Proof-Oriented Programming — Saikat Chakraborty et al., 2024 https://scholar.google.com/scholar?q=Towards+Neural+Synthesis+for+SMT-Assisted+Proof-Oriented+Programming 12. Finding Inductive Loop Invariants using Large Language Models — Adharsh Kamath et al., 2023 https://scholar.google.com/scholar?q=Finding+Inductive+Loop+Invariants+using+Large+Language+Models 13. A New Era in Software Security: Towards Self-Healing Software via Large Language Models and Formal Verification — Norbert Tihanyi et al., 2023 https://scholar.google.com/scholar?q=A+New+Era+in+Software+Security:+Towards+Self-Healing+Software+via+Large+Language+Models+and+Formal+Verification 14. Rethinking Optimal Verification Granularity for Compute-Efficient Test-Time Scaling — Hao Mark Chen et al., 2025 https://scholar.google.com/scholar?q=Rethinking+Optimal+Verification+Granularity+for+Compute-Efficient+Test-Time+Scaling 15. Heimdall: test-time scaling on the generative verification — Wenlei Shi and Xing Jin, 2025 https://scholar.google.com/scholar?q=Heimdall:+test-time+scaling+on+the+generative+verification 16. AI Post Transformers: From Natural Language to Verified Dafny Code — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-06-14-from-natural-language-to-verified-dafny-8abed9.mp3 17. AI Post Transformers: DeepVerifier: Self-Evolving Research Agents via Rubric-Guided Verification — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/deepverifier-self-evolving-research-agents-via-rubric-guided-verification/ 18. AI Post Transformers: Agentic Discovery for Test-Time Scaling — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-12-agentic-discovery-for-test-time-scaling-f9a81f.mp3 19. AI Post Transformers: Trajectory Summaries for Long-Horizon Coding Agents — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-24-trajectory-summaries-for-long-horizon-co-0194be.mp3 Interactive Visualization: DafnyPro for LLM-Assisted Dafny Verification

See All (711)

Creator

mcgrof
Years Active

2025 - 2026
Episodes

711
Rating

Clean
Show Website

AI Post Transformers

Technology

Technology

Updated twice weekly

AI Post Transformers

InfiniGen for Efficient Long-Context LLM Inference

Ling and Ring 2.6 for Trillion-Scale Agents

Nemotron 3 Ultra for Long-Horizon Agents

OpenSkill for Open-World Self-Evolution in LLM Agents

SageAttention2 and Fast Exact INT4 Attention

When Quantization Hurts Reasoning Models

DafnyBench and LLMs for Formal Verification

DafnyPro for LLM-Assisted Dafny Verification

About

Information

You Might Also Like

AI Post Transformers

Episodes

InfiniGen for Efficient Long-Context LLM Inference

Ling and Ring 2.6 for Trillion-Scale Agents

Nemotron 3 Ultra for Long-Horizon Agents

OpenSkill for Open-World Self-Evolution in LLM Agents

SageAttention2 and Fast Exact INT4 Attention

When Quantization Hurts Reasoning Models

DafnyBench and LLMs for Formal Verification

DafnyPro for LLM-Assisted Dafny Verification

About

Information

You Might Also Like