AI Post Transformers

mcgrof

3.7 (3)
Technology
Updated Daily

AI-generated podcast where hosts Hal Turing and Dr. Ada Shannon discuss the latest research papers and reports in machine learning, AI systems, and optimization. Featuring honest critical analysis, proper citations, and nerdy humor.

1d ago

KVzap: Fast, Adaptive, Faithful KV Cache Pruning

This episode explores KVzap, a method for pruning transformer KV caches by learning a cheap surrogate for a much stronger oracle, with the goal of making cache eviction practical during both prompt prefilling and token-by-token decoding. It explains why KV caches dominate long-context inference costs, clarifies the difference between prefilling and decoding, and lays out why serving systems have favored quantization and paging over content-aware token deletion: removing the wrong token can quietly break later answers. The discussion places KVzap alongside KVzip, Expected Attention, and DMS, arguing that its key advance is a learned per-layer, per-head importance predictor trained to imitate a richer KVzip+ teacher that measures not just attention but actual contribution to the residual stream. Listeners would find it interesting because it ties together systems bottlenecks, adaptive eviction policies such as delayed eviction and sliding windows, and concrete training choices into a broader case for faster, more faithful long-context inference. Sources: 1. KVzap: Fast, Adaptive, Faithful KV Cache Pruning https://arxiv.org/pdf/2601.07891 2. H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models — Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Beidi Chen, et al., 2023 https://scholar.google.com/scholar?q=H2O:+Heavy-Hitter+Oracle+for+Efficient+Generative+Inference+of+Large+Language+Models 3. SnapKV: LLM Knows What You are Looking for Before Generation — Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Patrick Lewis, et al., 2024 https://scholar.google.com/scholar?q=SnapKV:+LLM+Knows+What+You+are+Looking+for+Before+Generation 4. Expected Attention: KV Cache Compression by Estimating Attention from Future Queries Distribution — Alessio Devoto, Maximilian Jeblick, Simon Jegou, 2025 https://scholar.google.com/scholar?q=Expected+Attention:+KV+Cache+Compression+by+Estimating+Attention+from+Future+Queries+Distribution 5. KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction — Jang-Hyun Kim, Jinuk Kim, Sangwoo Kwon, Jae W. Lee, Sangdoo Yun, Hyun Oh Song, 2025 https://scholar.google.com/scholar?q=KVzip:+Query-Agnostic+KV+Cache+Compression+with+Context+Reconstruction 6. Inference-Time Hyper-Scaling with KV Cache Compression — Adrian Lancucki, Konrad Staniszewski, Piotr Nawrot, Edoardo M. Ponti, 2025 https://scholar.google.com/scholar?q=Inference-Time+Hyper-Scaling+with+KV+Cache+Compression 7. Compactor: Calibrated Query-Agnostic KV Cache Compression with Approximate Leverage Scores — Vivek Chari, Benjamin Van Durme, 2025 https://scholar.google.com/scholar?q=Compactor:+Calibrated+Query-Agnostic+KV+Cache+Compression+with+Approximate+Leverage+Scores 8. DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads — Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, Song Han, 2024 https://scholar.google.com/scholar?q=DuoAttention:+Efficient+Long-Context+LLM+Inference+with+Retrieval+and+Streaming+Heads 9. Retrieval Head Mechanistically Explains Long-Context Factuality — Wenhao Wu et al., 2024 https://scholar.google.com/scholar?q=Retrieval+Head+Mechanistically+Explains+Long-Context+Factuality 10. Query-Focused Retrieval Heads Improve Long-Context Reasoning and Re-ranking — Wuwei Zhang et al., 2025 https://scholar.google.com/scholar?q=Query-Focused+Retrieval+Heads+Improve+Long-Context+Reasoning+and+Re-ranking 11. MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention — Huiqiang Jiang et al., 2024 https://scholar.google.com/scholar?q=MInference+1.0:+Accelerating+Pre-filling+for+Long-Context+LLMs+via+Dynamic+Sparse+Attention 12. KV-Compress: Paged KV-Cache Compression with Variable Compression Rates per Attention Head — Isaac Rehg, 2024 https://scholar.google.com/scholar?q=KV-Compress:+Paged+KV-Cache+Compression+with+Variable+Compression+Rates+per+Attention+Head 13. PagedEviction: Structured Block-wise KV Cache Pruning for Efficient Large Language Model Inference — Krishna Teja Chitty-Venkata et al., 2025 https://scholar.google.com/scholar?q=PagedEviction:+Structured+Block-wise+KV+Cache+Pruning+for+Efficient+Large+Language+Model+Inference 14. KVFlow: Efficient Prefix Caching for Accelerating LLM-Based Multi-Agent Workflows — Zaifeng Pan et al., 2025 https://scholar.google.com/scholar?q=KVFlow:+Efficient+Prefix+Caching+for+Accelerating+LLM-Based+Multi-Agent+Workflows 15. AI Post Transformers: Affordable Large-Scale Decoding Through Model-System Co-Design — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-19-affordable-large-scale-decoding-through-e1d7ed.mp3 16. AI Post Transformers: TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-25-turboquant-online-vector-quantiz-1967b7.mp3 17. AI Post Transformers: Deep Kernel Fusion for Transformer Decoding — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-15-deep-kernel-fusion-for-transformer-decod-b1a703.mp3 18. AI Post Transformers: DeepSeek-V4 and Practical Million-Token Context — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-25-deepseek-v4-and-practical-million-token-6f4de1.mp3 19. AI Post Transformers: How Induction Heads Emerge in Transformers — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-03-how-induction-heads-emerge-in-transforme-a7bfcb.mp3 20. AI Post Transformers: When Many-Shot CoT Becomes Test-Time Learning — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-15-when-many-shot-cot-becomes-test-time-lea-c25bfe.mp3
2d ago

KVzip for Query-Agnostic KV Cache Compression

This episode explores KVzip, a query-agnostic method for compressing long-context KV caches so a model can reuse a shared document, codebase, or memory bank across many later questions without optimizing for just one query. It explains why KV cache has become a major systems bottleneck, including the striking example that a 120,000-token context for Qwen2.5-14B can require more memory for cache than for the model weights themselves. The discussion contrasts KVzip with exact prefix caching and query-aware pruning methods like SnapKV, then breaks down KVzip’s core idea: replay the original context, measure which cached states receive the most attention during reconstruction, and keep those as durable memory. Listeners would find it interesting because the paper ties a clean systems insight to concrete gains, reporting roughly 394x smaller decoding-time KV caches and about 2x lower FlashAttention latency across LLaMA, Qwen, and Gemma models on very long contexts. Sources: 1. KVzip for Query-Agnostic KV Cache Compression https://arxiv.org/pdf/2505.23416 2. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding — Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova, 2018 https://scholar.google.com/scholar?q=BERT:+Pre-training+of+Deep+Bidirectional+Transformers+for+Language+Understanding 3. SnapKV: LLM Knows What You are Looking for Before Generation — Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, Deming Chen, 2024 https://scholar.google.com/scholar?q=SnapKV:+LLM+Knows+What+You+are+Looking+for+Before+Generation 4. KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction — Jang-Hyun Kim, Jinuk Kim, Sangwoo Kwon, Jae W. Lee, Sangdoo Yun, Hyun Oh Song, 2025 https://scholar.google.com/scholar?q=KVzip:+Query-Agnostic+KV+Cache+Compression+with+Context+Reconstruction 5. Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving — Wei Gao, Xinyu Zhou, Peng Sun, Tianwei Zhang, Yonggang Wen, 2025 https://scholar.google.com/scholar?q=Rethinking+Key-Value+Cache+Compression+Techniques+for+Large+Language+Model+Serving 6. SCBench: A KV Cache-Centric Analysis of Long-Context Methods — Yudong Li, Hongkang Jiang, Qihui Wu, Xintong Luo, Sohee Ahn, Chen Zhang, and others, 2025 https://scholar.google.com/scholar?q=SCBench:+A+KV+Cache-Centric+Analysis+of+Long-Context+Methods 7. DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads — Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, Song Han, 2025 https://scholar.google.com/scholar?q=DuoAttention:+Efficient+Long-Context+LLM+Inference+with+Retrieval+and+Streaming+Heads 8. Compactor: Calibrated Query-Agnostic KV Cache Compression with Approximate Leverage Scores — Vivek Chari, Benjamin Van Durme, 2025 https://scholar.google.com/scholar?q=Compactor:+Calibrated+Query-Agnostic+KV+Cache+Compression+with+Approximate+Leverage+Scores 9. No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization — June Yong Yang, Byeongwook Kim, Jeongin Bae, Beomseok Kwon, Gunho Park, Eunho Yang, Se Jung Kwon, Dongsoo Lee, 2024 https://scholar.google.com/scholar?q=No+Token+Left+Behind:+Reliable+KV+Cache+Compression+via+Importance-Aware+Mixed+Precision+Quantization 10. Safety Alignment Should Be Made More Than Just a Few Tokens Deep — Xiangyu Qi, Ashwinee Panda, Kaifeng Lyu, Xiao Ma, Subhrajit Roy, Ahmad Beirami, Prateek Mittal, Peter Henderson, 2025 https://scholar.google.com/scholar?q=Safety+Alignment+Should+Be+Made+More+Than+Just+a+Few+Tokens+Deep 11. The Residual Stream Is All You Need: On the Redundancy of the KV Cache in Transformer Inference — Kaleem Ullah Qasim et al., 2026 https://arxiv.org/abs/2603.19664 12. DeltaKV: Residual-Based KV Cache Compression via Long-Range Similarity — Jitai Hao et al., 2026 https://arxiv.org/abs/2602.08005 13. ParisKV: Fast and Drift-Robust KV-Cache Retrieval for Long-Context LLMs — Yanlin Qi et al., 2026 https://arxiv.org/abs/2602.07721 14. HeteroCache: A Dynamic Retrieval Approach to Heterogeneous KV Cache Compression for Long-Context LLM Inference — Zhiyuan Shi et al., 2026 https://arxiv.org/abs/2601.13684 15. R-KV: Redundancy-aware KV Cache Compression for Reasoning Models — Zefan Cai et al., 2025 https://arxiv.org/abs/2505.24133 16. Hold Onto That Thought: Assessing KV Cache Compression On Reasoning — Minghui Liu et al., 2025 https://arxiv.org/abs/2512.12008 17. SideQuest: Model-Driven KV Cache Management for Long-Horizon Agentic Reasoning — Sanjay Kariyappa and G. Edward Suh, 2026 https://arxiv.org/abs/2602.22603 18. AI Post Transformers: PackKV Lossy Compression for KV Caches — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-04-packkv-lossy-compression-for-kv-caches-b37bce.mp3 19. AI Post Transformers: Affordable Large-Scale Decoding Through Model-System Co-Design — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-19-affordable-large-scale-decoding-through-e1d7ed.mp3 20. AI Post Transformers: Deep Kernel Fusion for Transformer Decoding — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-15-deep-kernel-fusion-for-transformer-decod-b1a703.mp3 21. AI Post Transformers: DeepSeek-V4 and Practical Million-Token Context — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-25-deepseek-v4-and-practical-million-token-6f4de1.mp3 22. AI Post Transformers: TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-25-turboquant-online-vector-quantiz-1967b7.mp3
4d ago

Beluga: CXL Memory Pooling for LLM KV Cache

This episode explores Beluga, a systems paper that tackles one of the biggest practical bottlenecks in long-context LLM inference: how to store and retrieve massive KV caches when GPU memory is no longer enough. It explains why traditional RDMA-based memory disaggregation is cumbersome and how Beluga uses CXL-based pooled memory to give GPUs and CPUs more direct, load/store-style access to shared cache data, reducing copies, staging, and synchronization overhead. The discussion digs into the architecture’s tradeoffs, including the fact that CXL is still slower than local HBM or DRAM, but argues that its simpler access model can still deliver large gains in the right workload regime. Listeners would find it interesting for its concrete analysis of when the reported speedups, including major reductions in time to first token and large throughput gains, are real advances versus artifacts of favorable cache-reuse conditions. Sources: 1. Beluga: A CXL-Based Memory Architecture for Scalable and Efficient LLM KVCache Management — Xinjun Yang, Qingda Hu, Junru Li, Feifei Li, Yicong Zhu, Yuqi Zhou, Qiuru Lin, Jian Dai, Yang Kong, Jiayu Zhang, Guoqiang Xu, Qiang Liu, 2025 http://arxiv.org/abs/2511.20172 2. Memory Pooling With CXL — Donghyun Gouk, Miryeong Kwon, Hanyeoreum Bae, Sangwon Lee, Myoungsoo Jung, 2023 https://scholar.google.com/scholar?q=Memory+Pooling+With+CXL 3. Demystifying CXL Memory with Genuine CXL-Ready Systems and Devices — Yan Sun, Yifan Yuan, Zeduo Yu, Reese Kuper, Chihun Song, Jinghan Huang, Houxiang Ji, Siddharth Agarwal, Jiaqi Lou, Ipoom Jeong, Ren Wang, Jung Ho Ahn, Tianyin Xu, Nam Sung Kim, 2023 https://scholar.google.com/scholar?q=Demystifying+CXL+Memory+with+Genuine+CXL-Ready+Systems+and+Devices 4. Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving — Ruoyu Qin, Zheming Li, Weiran He, Mingxing Zhang, Yongwei Wu, Weimin Zheng, Xinran Xu, 2025 https://scholar.google.com/scholar?q=Mooncake:+A+KVCache-centric+Disaggregated+Architecture+for+LLM+Serving 5. Exploring CXL-based KV Cache Storage for LLM Serving — Yupeng Tang, Runxiang Cheng, Ping Zhou, Tongping Liu, Fei Liu, Wei Tang, Kyoungryun Bae, Jianjun Chen, Wu Xiang, Rui Shi, 2024 https://scholar.google.com/scholar?q=Exploring+CXL-based+KV+Cache+Storage+for+LLM+Serving 6. MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool — Cunchen Hu, Heyang Huang, Junhao Hu, Jiang Xu, Xusheng Chen, Tao Xie, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, Yizhou Shan, 2024 https://scholar.google.com/scholar?q=MemServe:+Context+Caching+for+Disaggregated+LLM+Serving+with+Elastic+Memory+Pool 7. Efficient Memory Management for Large Language Model Serving with PagedAttention — Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica, 2023 https://scholar.google.com/scholar?q=Efficient+Memory+Management+for+Large+Language+Model+Serving+with+PagedAttention 8. CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving — Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Ananthanarayanan, Michael Maire, Henry Hoffmann, Ari Holtzman, Junchen Jiang, 2024 https://scholar.google.com/scholar?q=CacheGen:+KV+Cache+Compression+and+Streaming+for+Fast+Large+Language+Model+Serving 9. PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference — Dongjie Yang et al., 2024 https://scholar.google.com/scholar?q=PyramidInfer:+Pyramid+KV+Cache+Compression+for+High-throughput+LLM+Inference 10. ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference — Xiang Liu et al., 2025 https://scholar.google.com/scholar?q=ChunkKV:+Semantic-Preserving+KV+Cache+Compression+for+Efficient+Long-Context+LLM+Inference 11. Inference-Time Hyper-Scaling with KV Cache Compression — Adrian Łańcucki, Konrad Staniszewski, Piotr Nawrot, Edoardo M. Ponti, 2025 https://scholar.google.com/scholar?q=Inference-Time+Hyper-Scaling+with+KV+Cache+Compression 12. FlowKV: A Disaggregated Inference Framework with Low-Latency KV Cache Transfer and Load-Aware Scheduling — Weiqing Li et al., 2025 https://scholar.google.com/scholar?q=FlowKV:+A+Disaggregated+Inference+Framework+with+Low-Latency+KV+Cache+Transfer+and+Load-Aware+Scheduling 13. TokenLake: A Unified Segment-level Prefix Cache Pool for Fine-grained Elastic Long-Context LLM Serving — Bingyang Wu et al., 2025 https://scholar.google.com/scholar?q=TokenLake:+A+Unified+Segment-level+Prefix+Cache+Pool+for+Fine-grained+Elastic+Long-Context+LLM+Serving 14. KVFlow: Efficient Prefix Caching for Accelerating LLM-Based Multi-Agent Workflows — Zaifeng Pan et al., 2025 https://scholar.google.com/scholar?q=KVFlow:+Efficient+Prefix+Caching+for+Accelerating+LLM-Based+Multi-Agent+Workflows 15. Learned Prefix Caching for Efficient LLM Inference — Dongsheng Yang, Austin Li, Kai Li, Wyatt Lloyd, 2025 https://scholar.google.com/scholar?q=Learned+Prefix+Caching+for+Efficient+LLM+Inference 16. AI Post Transformers: FengHuang for Rack-Scale LLM Inference Memory — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-12-fenghuang-for-rack-scale-llm-inference-m-62708e.mp3 17. AI Post Transformers: CXL Computational Memory Offloading for Lower Runtime — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-cxl-computational-memory-offloading-for-3b2124.mp3 18. AI Post Transformers: Prefill-as-a-Service for Cross-Datacenter KV Cache — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-19-prefill-as-a-service-for-cross-datacente-7560be.mp3 19. AI Post Transformers: SolidAttention: Co-Designing Sparse Attention and SSD I/O — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-18-solidattention-co-designing-sparse-atten-5a8622.mp3 20. AI Post Transformers: Stochastic KV Routing for Cache Sharing — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-29-stochastic-kv-routing-for-cache-sharing-5fef63.mp3 21. AI Post Transformers: Affordable Large-Scale Decoding Through Model-System Co-Design — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-19-affordable-large-scale-decoding-through-e1d7ed.mp3 Interactive Visualization: Beluga: CXL Memory Pooling for LLM KV Cache
4d ago

CXL-GPU and Beyond Onboard Memory

This episode explores a systems paper that extends GPU memory through CXL-attached DRAM and SSDs, asking whether accelerators can reach beyond on-board HBM without the usual overhead of software-driven memory migration. It explains CXL, memory disaggregation, and the difference between local GPU memory, host-managed memory, CXL memory, and storage-backed expansion, while grounding the discussion in earlier work such as Infiniswap, DirectCXL, and Microsoft’s Pond. The conversation focuses on the paper’s main technical claim: custom GPU-side hardware, including RTL CXL controllers, multiple root ports, and latency-hiding policies, could make expanded memory tiers more usable than approaches like UVM or GPUDirect Storage. It is interesting because the speakers both highlight the engineering ambition and press on a central unresolved question: whether these ideas truly help real transformer workloads, rather than only looking good on more conventional benchmark traces. Sources: 1. CXL-GPU: Pushing GPU Memory Boundaries with the Integration of CXL Technologies — Donghyun Gouk, Seungkwan Kang, Seungjun Lee, Jiseon Kim, Kyungkuk Nam, Eojin Ryu, Sangwon Lee, Dongpyung Kim, Junhyeok Jang, Hanyeoreum Bae, Myoungsoo Jung, 2025 http://arxiv.org/abs/2506.15601 2. Disaggregated Memory for Expansion and Sharing in Blade Servers — Kevin Lim, Jichuan Chang, Trevor Mudge, Parthasarathy Ranganathan, Steven K. Reinhardt, Thomas F. Wenisch, 2009 https://scholar.google.com/scholar?q=Disaggregated+Memory+for+Expansion+and+Sharing+in+Blade+Servers 3. Efficient Memory Disaggregation with Infiniswap — Juncheng Gu, Youngmoon Lee, Yiwen Zhang, Mosharaf Chowdhury, Kang G. Shin, 2017 https://scholar.google.com/scholar?q=Efficient+Memory+Disaggregation+with+Infiniswap 4. Direct Access, High-Performance Memory Disaggregation with DirectCXL — Donghyun Gouk, Sangwon Lee, Miryeong Kwon, Myoungsoo Jung, 2022 https://scholar.google.com/scholar?q=Direct+Access,+High-Performance+Memory+Disaggregation+with+DirectCXL 5. Pond: CXL-Based Memory Pooling Systems for Cloud Platforms — Huaicheng Li, Daniel S. Berger, Lisa Hsu, Daniel Ernst, Pantea Zardoshti, Stanko Novakovic, Monish Shah, Samir Rajadnya, Scott Lee, Ishwar Agarwal, Mark D. Hill, Marcus Fontoura, Ricardo Bianchini, 2023 https://scholar.google.com/scholar?q=Pond:+CXL-Based+Memory+Pooling+Systems+for+Cloud+Platforms 6. SMT: Software-Defined Memory Tiering for Heterogeneous Computing Systems with CXL Memory Expander — K. Kim, H. Kim, J. So, W. Lee, J. Im, S. Park, J. Cho, H. Song, 2023 https://scholar.google.com/scholar?q=SMT:+Software-Defined+Memory+Tiering+for+Heterogeneous+Computing+Systems+with+CXL+Memory+Expander 7. TPP: Transparent Page Placement for CXL-Enabled Tiered-Memory — Hasan Al Maruf, Hao Wang, Abhishek Dhanotia, Johannes Weiner, Niket Agarwal, Pallab Bhattacharya, Chris Petersen, Mosharaf Chowdhury, Shobhit Kanaujia, Prakash Chauhan, 2023 https://scholar.google.com/scholar?q=TPP:+Transparent+Page+Placement+for+CXL-Enabled+Tiered-Memory 8. NVMMU: A Non-volatile Memory Management Unit for Heterogeneous GPU-SSD Architectures — Jie Zhang, David Donofrio, John Shalf, Mahmut T. Kandemir, Myoungsoo Jung, 2015 https://scholar.google.com/scholar?q=NVMMU:+A+Non-volatile+Memory+Management+Unit+for+Heterogeneous+GPU-SSD+Architectures 9. Overcoming the Memory Wall with CXL-Enabled SSDs — Shao-Peng Yang, Minjae Kim, Sanghyun Nam, Juhyung Park, Jin-yong Choi, Eyee Hyun Nam, Eunji Lee, Sungjin Lee, Bryan S. Kim, 2023 https://scholar.google.com/scholar?q=Overcoming+the+Memory+Wall+with+CXL-Enabled+SSDs 10. NeoMem: Hardware/Software Co-Design for CXL-Native Memory Tiering — Zhe Zhou, Yiqi Chen, Tao Zhang, Yang Wang, Ran Shu, Shuotao Xu, Peng Cheng, Lei Qu, Yongqiang Xiong, Jie Zhang, Guangyu Sun, 2024 https://scholar.google.com/scholar?q=NeoMem:+Hardware/Software+Co-Design+for+CXL-Native+Memory+Tiering 11. ARIADNE: Adaptive UVM Management for Efficient GPU Memory Oversubscription — approx. recent systems authors, 2024/2025 https://scholar.google.com/scholar?q=ARIADNE:+Adaptive+UVM+Management+for+Efficient+GPU+Memory+Oversubscription 12. MOST: Memory Oversubscription-Aware Scheduling for Tensor Migration on GPU Unified Storage — approx. recent systems authors, 2024/2025 https://scholar.google.com/scholar?q=MOST:+Memory+Oversubscription-Aware+Scheduling+for+Tensor+Migration+on+GPU+Unified+Storage 13. Selective memory compression for GPU memory oversubscription management — approx. recent architecture authors, 2024/2025 https://scholar.google.com/scholar?q=Selective+memory+compression+for+GPU+memory+oversubscription+management 14. Phoenix: A Refactored I/O Stack for GPU Direct Storage without Phony Buffers — approx. recent storage/systems authors, 2024/2025 https://scholar.google.com/scholar?q=Phoenix:+A+Refactored+I/O+Stack+for+GPU+Direct+Storage+without+Phony+Buffers 15. Managing Scalable Direct Storage Accesses for GPUs with GoFS — approx. recent storage/systems authors, 2024/2025 https://scholar.google.com/scholar?q=Managing+Scalable+Direct+Storage+Accesses+for+GPUs+with+GoFS 16. CCCL: Node-Spanning GPU Collectives with CXL Memory Pooling — approx. recent distributed systems authors, 2024/2025 https://scholar.google.com/scholar?q=CCCL:+Node-Spanning+GPU+Collectives+with+CXL+Memory+Pooling 17. Efficient Tensor Offloading Based on CXL Memory Pool For Extreme Scale Deep Learning — approx. recent ML systems authors, 2024/2025 https://scholar.google.com/scholar?q=Efficient+Tensor+Offloading+Based+on+CXL+Memory+Pool+For+Extreme+Scale+Deep+Learning 18. UHM: Unified Transferring and Pooling over Heterogeneous GPU Memories — approx. recent memory-systems authors, 2024/2025 https://scholar.google.com/scholar?q=UHM:+Unified+Transferring+and+Pooling+over+Heterogeneous+GPU+Memories 19. GPUVM: GPU-driven unified virtual memory — approx. recent architecture authors, 2024/2025 https://scholar.google.com/scholar?q=GPUVM:+GPU-driven+unified+virtual+memory 20. Salus: Efficient security support for cxl-expanded gpu memory — approx. recent security/systems authors, 2024/2025 https://scholar.google.com/scholar?q=Salus:+Efficient+security+support+for+cxl-expanded+gpu+memory 21. AI Post Transformers: Vistara Brings CXL Memory to Hyperscale — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-11-vistara-brings-cxl-memory-to-hyperscale-b5199e.mp3 22. AI Post Transformers: FengHuang for Rack-Scale LLM Inference Memory — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-12-fenghuang-for-rack-scale-llm-inference-m-62708e.mp3 23. AI Post Transformers: Why LLM Serving Needs Mathematical Optimization — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-05-why-llm-serving-needs-mathematical-optim-647fc6.mp3 24. AI Post Transformers: Affordable Large-Scale Decoding Through Model-System Co-Design — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-19-affordable-large-scale-decoding-through-e1d7ed.mp3 Interactive Visualization: CXL-GPU and Beyond Onboard Memory
4d ago

DFX: Multi-FPGA Acceleration for Transformer Inference

This episode explores the DFX system, a four-FPGA appliance designed to accelerate transformer-based text generation by targeting a key weakness of GPUs: low-batch, token-by-token decode. It explains the difference between prompt processing and sequential generation, connects the paper’s older terminology to today’s prefill/decode framing, and shows why autoregressive inference often leaves GPU hardware underused even when training runs efficiently in parallel. The discussion also breaks down how DFX uses hardware-aware model parallelism and end-to-end accelerator design, rather than only speeding up isolated transformer subcomponents, to argue for lower latency and better energy and cost efficiency than a four-V100 GPU server. Listeners would find it interesting for its clear historical perspective on transformer serving and for its skepticism about how much of the reported advantage comes from FPGA specialization versus the fairness of the GPU baseline. Sources: 1. DFX: A Low-latency Multi-FPGA Appliance for Accelerating Transformer-based Text Generation — Seongmin Hong, Seungjae Moon, Junsoo Kim, Sungjae Lee, Minsub Kim, Dongsoo Lee, Joo-Young Kim, 2022 http://arxiv.org/abs/2209.10797 2. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism — Yanping Huang, Youlong Cheng, Ankur Bapna, Quoc V. Le, Yonghui Wu, Zhifeng Chen, and others, 2019 https://scholar.google.com/scholar?q=GPipe:+Efficient+Training+of+Giant+Neural+Networks+using+Pipeline+Parallelism 3. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism — Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, Bryan Catanzaro, 2020 https://scholar.google.com/scholar?q=Megatron-LM:+Training+Multi-Billion+Parameter+Language+Models+Using+Model+Parallelism 4. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding — Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Noam Shazeer, Zhifeng Chen, and others, 2020 https://scholar.google.com/scholar?q=GShard:+Scaling+Giant+Models+with+Conditional+Computation+and+Automatic+Sharding 5. Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM — Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Bryan Catanzaro, Amar Phanishayee, Matei Zaharia, and others, 2021 https://scholar.google.com/scholar?q=Efficient+Large-Scale+Language+Model+Training+on+GPU+Clusters+Using+Megatron-LM 6. Attention Is All You Need — Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, 2017 https://scholar.google.com/scholar?q=Attention+Is+All+You+Need 7. FTRANS: Energy-Efficient Acceleration of Transformers using FPGA — Jingcheng Rao, Yuchen Shao, Ke Wang, Zhihao Zhu, Xuehai Qian, Yiyu Shi, 2020 https://scholar.google.com/scholar?q=FTRANS:+Energy-Efficient+Acceleration+of+Transformers+using+FPGA 8. Fast Inference from Transformers via Speculative Decoding — Yaniv Leviathan, Matan Kalman, Yossi Matias, 2022 https://scholar.google.com/scholar?q=Fast+Inference+from+Transformers+via+Speculative+Decoding 9. PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference — Dongjie Yang, XiaoDong Han, Yan Gao, Yao Hu, Shilin Zhang, Hai Zhao, 2024 https://scholar.google.com/scholar?q=PyramidInfer:+Pyramid+KV+Cache+Compression+for+High-throughput+LLM+Inference 10. ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference — Xiang Liu, Zhenheng Tang, Peijie Dong, Zeyu Li, Bo Li, Xuming Hu, Xiaowen Chu, 2025 https://scholar.google.com/scholar?q=ChunkKV:+Semantic-Preserving+KV+Cache+Compression+for+Efficient+Long-Context+LLM+Inference 11. Cost-Optimal Grouped-Query Attention for Long-Context LLMs — Yingfa Chen, Yutong Wu, Xu Han, Zhiyuan Liu, Maosong Sun, 2025 https://scholar.google.com/scholar?q=Cost-Optimal+Grouped-Query+Attention+for+Long-Context+LLMs 12. Optimised Grouped-Query Attention Mechanism for Transformers — Yuang Chen, Cheng Zhang, Xitong Gao, Robert D. Mullins, George A. Constantinides, Yiren Zhao, 2024 https://scholar.google.com/scholar?q=Optimised+Grouped-Query+Attention+Mechanism+for+Transformers 13. Prefill-Decode Aggregation or Disaggregation? Unifying Both for Goodput-Optimized LLM Serving — Chao Wang, Pengfei Zuo, Zhangyu Chen, Yunkai Liang, Zhou Yu, Ming-Chang Yang, 2025 https://scholar.google.com/scholar?q=Prefill-Decode+Aggregation+or+Disaggregation?+Unifying+Both+for+Goodput-Optimized+LLM+Serving 14. Nexus: Proactive Intra-GPU Disaggregation of Prefill and Decode in LLM Serving — Xiaoxiang Shi, Colin Cai, Junjia Du, Zhihao Jia, 2025 https://scholar.google.com/scholar?q=Nexus:+Proactive+Intra-GPU+Disaggregation+of+Prefill+and+Decode+in+LLM+Serving 15. SPAD: Specialized Prefill and Decode Hardware for Disaggregated LLM Inference — Hengrui Zhang, Pratyush Patel, August Ning, David Wentzlaff, 2025 https://scholar.google.com/scholar?q=SPAD:+Specialized+Prefill+and+Decode+Hardware+for+Disaggregated+LLM+Inference 16. AI Post Transformers: Deep Kernel Fusion for Transformer Decoding — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-15-deep-kernel-fusion-for-transformer-decod-b1a703.mp3 17. AI Post Transformers: Affordable Large-Scale Decoding Through Model-System Co-Design — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-19-affordable-large-scale-decoding-through-e1d7ed.mp3 18. AI Post Transformers: LAPS for Length-Aware LLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-05-laps-for-length-aware-llm-serving-0c6149.mp3 19. AI Post Transformers: Prefill-as-a-Service for Cross-Datacenter KV Cache — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-19-prefill-as-a-service-for-cross-datacente-7560be.mp3 20. AI Post Transformers: Speculative Decoding in Real vLLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-speculative-decoding-in-real-vllm-servin-6f4e2b.mp3 21. AI Post Transformers: FengHuang for Rack-Scale LLM Inference Memory — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-12-fenghuang-for-rack-scale-llm-inference-m-62708e.mp3 22. AI Post Transformers: Caffeine: A Unified FPGA for CNNs — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-06-caffeine-a-unified-fpga-for-cnns-e8acbe.mp3 Interactive Visualization: DFX: Multi-FPGA Acceleration for Transformer Inference
4d ago

Trajectory Summaries for Long-Horizon Coding Agents

This episode explores a paper on inference-time scaling for coding agents, asking whether extra test-time compute still helps when tasks are long, messy, and require multi-step tool use rather than a single code completion. It focuses on the paper’s main argument that the real bottleneck is not generating more rollout attempts, but representing prior attempts well enough to compare, select, and reuse them, with structured trajectory summaries serving as the key middle layer between raw transcripts and final patches. The discussion examines two mechanisms: a parallel “tournament” style selection method over summaries, and a sequential refinement method that conditions later attempts on distilled lessons from earlier ones. Listeners would find it interesting because the conversation connects agent performance gains to practical questions of context management, selection versus reuse, and whether the reported improvements reflect a deep scaling insight or simply better engineering around long-horizon coding workflows. Sources: 1. Scaling Test-Time Compute for Agentic Coding — Joongwon Kim, Wannan Yang, Kelvin Niu, Hongming Zhang, Yun Zhu, Eryk Helenowski, Ruan Silva, Zhengxing Chen, Srinivasan Iyer, Manzil Zaheer, Daniel Fried, Hannaneh Hajishirzi, Sanjeev Arora, Gabriel Synnaeve, Ruslan Salakhutdinov, Anirudh Goyal, 2026 http://arxiv.org/abs/2604.16529 2. ReAct: Synergizing Reasoning and Acting in Language Models — Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, Yuan Cao, 2023 https://scholar.google.com/scholar?q=ReAct:+Synergizing+Reasoning+and+Acting+in+Language+Models 3. Reflexion: Language Agents with Verbal Reinforcement Learning — Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, Shunyu Yao, 2023 https://scholar.google.com/scholar?q=Reflexion:+Language+Agents+with+Verbal+Reinforcement+Learning 4. ExpeL: LLM Agents Are Experiential Learners — Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, Gao Huang, 2023 https://scholar.google.com/scholar?q=ExpeL:+LLM+Agents+Are+Experiential+Learners 5. Rethinking Thinking Tokens: LLMs as Improvement Operators — Lovish Madaan, Aniket Didolkar, Suchin Gururangan, John Quan, Ruan Silva, Ruslan Salakhutdinov, Manzil Zaheer, Sanjeev Arora, Anirudh Goyal, 2025 https://scholar.google.com/scholar?q=Rethinking+Thinking+Tokens:+LLMs+as+Improvement+Operators 6. CodeMonkeys: Scaling Test-Time Compute for Software Engineering — Ryan Ehrlich, Bradley Brown, Jordan Juravsky, Ronald Clark, Christopher Re, Azalia Mirhoseini, 2025 https://scholar.google.com/scholar?q=CodeMonkeys:+Scaling+Test-Time+Compute+for+Software+Engineering 7. S*: Test Time Scaling for Code Generation — Dacheng Li, Shiyi Cao, Chengkun Cao, Xiuyu Li, Shangyin Tan, Kurt Keutzer, Jiarong Xing, Joseph E. Gonzalez, Ion Stoica, 2025 https://scholar.google.com/scholar?q=S*:+Test+Time+Scaling+for+Code+Generation 8. Scaling Test-time Compute for LLM Agents — King Zhu, Hanhao Li, Siwei Wu, Tianshun Xing, Dehua Ma, Xiangru Tang, Minghao Liu, Jian Yang, Jiaheng Liu, Yuchen Eleanor Jiang, Changwang Zhang, Chenghua Lin, Jun Wang, Ge Zhang, Wangchunshu Zhou, 2025 https://scholar.google.com/scholar?q=Scaling+Test-time+Compute+for+LLM+Agents 9. Agentic Test-Time Scaling for WebAgents — Nicholas Lee, Lutfi Eren Erdogan, Chris Joseph John, Surya Krishnapillai, Michael W. Mahoney, Kurt Keutzer, Amir Gholami, 2026 https://scholar.google.com/scholar?q=Agentic+Test-Time+Scaling+for+WebAgents 10. Does SWE-Bench-Verified Test Agent Ability or Model Memory? — Thanosan Prathifkumar, Noble Saji Mathews, Meiyappan Nagappan, 2025 https://scholar.google.com/scholar?q=Does+SWE-Bench-Verified+Test+Agent+Ability+or+Model+Memory? 11. A Benchmark for Procedural Memory Retrieval in Language Agents — Ishant Kohar, Aswanth Krishnan, 2025 https://scholar.google.com/scholar?q=A+Benchmark+for+Procedural+Memory+Retrieval+in+Language+Agents 12. PROCED-MEM: Benchmarking Procedural Memory Retrieval in Language Agents Across Domains — Ishant Kohar, Aswanth Krishnan, 2026 https://scholar.google.com/scholar?q=PROCED-MEM:+Benchmarking+Procedural+Memory+Retrieval+in+Language+Agents+Across+Domains 13. G-Memory: Tracing Hierarchical Memory for Multi-Agent Systems — Guibin Zhang, Muxin Fu, Guancheng Wan, Miao Yu, Kun Wang, Shuicheng Yan, 2025 https://scholar.google.com/scholar?q=G-Memory:+Tracing+Hierarchical+Memory+for+Multi-Agent+Systems 14. Scaling Agentic Verifier for Competitive Coding — Zeyao Ma et al., 2026 https://scholar.google.com/scholar?q=Scaling+Agentic+Verifier+for+Competitive+Coding 15. AgentPro: Enhancing LLM Agents with Automated Process Supervision — Yuchen Deng, Shichen Fan, Naibo Wang, Xinkui Zhao, See-Kiong Ng, 2025 https://scholar.google.com/scholar?q=AgentPro:+Enhancing+LLM+Agents+with+Automated+Process+Supervision 16. Recursive Introspection: Teaching Language Model Agents How to Self-Improve — Yuxiao Qu, Tianjun Zhang, Naman Garg, Aviral Kumar, 2024 https://scholar.google.com/scholar?q=Recursive+Introspection:+Teaching+Language+Model+Agents+How+to+Self-Improve 17. Agentic Refactoring: An Empirical Study of AI Coding Agents — Kosei Horikawa, Hao Li, Yutaro Kashiwa, Bram Adams, Hajimu Iida, Ahmed E. Hassan, 2025 https://scholar.google.com/scholar?q=Agentic+Refactoring:+An+Empirical+Study+of+AI+Coding+Agents 18. AI Post Transformers: TMAS: Scaling Test-Time Compute with Multi-Agent Synergy — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-14-tmas-scaling-test-time-compute-with-mult-3abe7a.mp3 19. AI Post Transformers: Benchmarking Test-Time Scaling for General LLM Agents — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-22-benchmarking-test-time-scaling-for-gener-8f14f9.mp3 20. AI Post Transformers: MiA-Signature and Global Activation for Long Context — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-13-mia-signature-and-global-activation-for-5ad62f.mp3 21. AI Post Transformers: Explicit Information Transmission for Context Compression — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-05-explicit-information-transmission-for-co-24e3c2.mp3 Interactive Visualization: Trajectory Summaries for Long-Horizon Coding Agents
5d ago

Generative Recursive Reasoning in Latent Space

This episode explores Generative Recursive Reasoning, a paper that asks whether models can reason more effectively by repeatedly refining an internal latent state instead of externalizing long chains of thought as tokens. It explains how recursive reasoning trades parameter growth for inference-time computation, and how this approach may be especially useful for tasks like Sudoku, ARC-style problems, graph coloring, and N-Queens that benefit from iterative constraint solving. A central focus is the paper’s argument that reasoning should be stochastic rather than locked into a single deterministic path, using variational methods to model multiple possible latent trajectories and improve coverage when problems have more than one valid answer. The discussion is especially interesting because it contrasts this elegant search-like mechanism with mainstream transformer practice, highlighting both the promise of branching internal hypotheses and the practical reasons industry has not adopted such architectures at scale. Sources: 1. Generative Recursive Reasoning in Latent Space https://arxiv.org/pdf/2605.19376v1 2. Universal Transformers — Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, Lukasz Kaiser, 2018 https://scholar.google.com/scholar?q=Universal+Transformers 3. Looped Transformers are Better at Learning Learning Algorithms — Liu Yang, Kangwook Lee, Robert Nowak, Dimitris Papailiopoulos, 2024 https://scholar.google.com/scholar?q=Looped+Transformers+are+Better+at+Learning+Learning+Algorithms 4. Hierarchical Reasoning Model — Guan Wang, Jin Li, Yuhao Sun, Xing Chen, Changling Liu, Yue Wu, Meng Lu, Sen Song, Yasin Abbasi Yadkori, 2025 https://scholar.google.com/scholar?q=Hierarchical+Reasoning+Model 5. Less is More: Recursive Reasoning with Tiny Networks — Alexia Jolicoeur-Martineau, 2025 https://scholar.google.com/scholar?q=Less+is+More:+Recursive+Reasoning+with+Tiny+Networks 6. Probabilistic Tiny Recursive Model — Amin Sghaier, Ali Parviz, Alexia Jolicoeur-Martineau, 2026 https://scholar.google.com/scholar?q=Probabilistic+Tiny+Recursive+Model 7. Auto-Encoding Variational Bayes — Diederik P. Kingma, Max Welling, 2013/2014 https://scholar.google.com/scholar?q=Auto-Encoding+Variational+Bayes 8. Stochastic Backpropagation and Approximate Inference in Deep Generative Models — Danilo Jimenez Rezende, Shakir Mohamed, Daan Wierstra, 2014 https://scholar.google.com/scholar?q=Stochastic+Backpropagation+and+Approximate+Inference+in+Deep+Generative+Models 9. Inference Suboptimality in Variational Autoencoders — Chris Cremer, Xuechen Li, David Duvenaud, 2018 https://scholar.google.com/scholar?q=Inference+Suboptimality+in+Variational+Autoencoders 10. Iterative Amortized Inference — Joseph Marino, Yisong Yue, Stephan Mandt, 2018 https://scholar.google.com/scholar?q=Iterative+Amortized+Inference 11. Amortized Variational Inference: A Systematic Review — Ankush Ganguly, Sanjana Jain, Ukrit Watchareeruetai, 2023 https://scholar.google.com/scholar?q=Amortized+Variational+Inference:+A+Systematic+Review 12. Reasoning with Latent Thoughts: On the Power of Looped Transformers — Nikunj Saunshi, Nishanth Dikkala, Zhiyuan Li, Sanjiv Kumar, Sashank J. Reddi, 2025 https://scholar.google.com/scholar?q=Reasoning+with+Latent+Thoughts:+On+the+Power+of+Looped+Transformers 13. Structured Denoising Diffusion Models in Discrete State-Spaces — Jacob Austin, Daniel D. Johnson, Jonathan Ho, Daniel Tarlow, Rianne van den Berg, 2021 https://scholar.google.com/scholar?q=Structured+Denoising+Diffusion+Models+in+Discrete+State-Spaces 14. One Step Forward and K Steps Back: Better Reasoning with Denoising Recursion Models — Chris Cameron, Wangzheng Wang, Nikita Ivanov, Ashmita Bhattacharyya, Didier Chetelat, Yingxue Zhang, 2026 https://scholar.google.com/scholar?q=One+Step+Forward+and+K+Steps+Back:+Better+Reasoning+with+Denoising+Recursion+Models 15. LASAR: Latent Adaptive Semantic Aligned Reasoning for Generative Recommendation — approx. recent recommendation/LLM reasoning authors, 2025 https://scholar.google.com/scholar?q=LASAR:+Latent+Adaptive+Semantic+Aligned+Reasoning+for+Generative+Recommendation 16. Towards Inference-time Scaling for Continuous Space Reasoning — approx. recent latent reasoning authors, 2025 https://scholar.google.com/scholar?q=Towards+Inference-time+Scaling+for+Continuous+Space+Reasoning 17. GTS: Inference-Time Scaling of Latent Reasoning with a Learnable Gaussian Thought Sampler — approx. recent latent reasoning authors, 2025 https://scholar.google.com/scholar?q=GTS:+Inference-Time+Scaling+of+Latent+Reasoning+with+a+Learnable+Gaussian+Thought+Sampler 18. Latent Chain-of-Thought for Visual Reasoning — approx. recent visual reasoning authors, 2025 https://scholar.google.com/scholar?q=Latent+Chain-of-Thought+for+Visual+Reasoning 19. AI Post Transformers: TMAS: Scaling Test-Time Compute with Multi-Agent Synergy — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-14-tmas-scaling-test-time-compute-with-mult-3abe7a.mp3 20. AI Post Transformers: Agentic Discovery for Test-Time Scaling — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-12-agentic-discovery-for-test-time-scaling-f9a81f.mp3 21. AI Post Transformers: Latent Space as a New Computational Paradigm — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-05-latent-space-as-a-new-computational-para-810f39.mp3 22. AI Post Transformers: Causal-JEPA for Object-Level World Models — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-15-causal-jepa-for-object-level-world-model-311a8b.mp3 23. AI Post Transformers: LeWorldModel: Stable Joint-Embedding World Models from Pixels — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-25-leworldmodel-stable-joint-embedding-worl-650f9f.mp3 Interactive Visualization: Generative Recursive Reasoning in Latent Space
May 23

LPU Chip for Low-Latency LLM Inference

This episode explores a 2024 paper on the LPU, a custom processor designed specifically for large language model inference, with an emphasis on reducing the per-token delay that users notice in interactive systems. It explains why autoregressive decoding is often limited by memory movement and synchronization rather than raw compute, making conventional GPU strengths less decisive in small-batch, user-facing generation. The discussion highlights the paper’s full-stack argument: a specialized chip, a supporting software stack called HyperDex, and a multi-device link meant to preserve low latency while scaling across processors. Listeners would find it interesting because it reframes AI hardware performance around real conversational responsiveness and digs into whether the paper’s bold efficiency and scaling claims actually hold up under careful comparison. Sources: 1. LPU Chip for Low-Latency LLM Inference https://arxiv.org/pdf/2408.07326 2. DFX: A Low-latency Multi-FPGA Appliance for Accelerating Transformer-based Text Generation — Seongmin Hong, et al., 2022 https://scholar.google.com/scholar?q=DFX:+A+Low-latency+Multi-FPGA+Appliance+for+Accelerating+Transformer-based+Text+Generation 3. SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning — Hanrui Wang, Zhekai Zhang, Song Han, 2021 https://scholar.google.com/scholar?q=SpAtten:+Efficient+Sparse+Attention+Architecture+with+Cascade+Token+and+Head+Pruning 4. A Software-Defined Tensor Streaming Multiprocessor for Large-Scale Machine Learning — Dennis Abts, et al., 2022 https://scholar.google.com/scholar?q=A+Software-Defined+Tensor+Streaming+Multiprocessor+for+Large-Scale+Machine+Learning 5. Efficient Memory Management for Large Language Model Serving with PagedAttention — Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, et al., 2023 https://scholar.google.com/scholar?q=Efficient+Memory+Management+for+Large+Language+Model+Serving+with+PagedAttention 6. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving — Yinmin Zhong, Shengyu Zhao, et al., 2024 https://scholar.google.com/scholar?q=DistServe:+Disaggregating+Prefill+and+Decoding+for+Goodput-optimized+Large+Language+Model+Serving 7. RazorAttention: Efficient KV Cache Compression Through Retrieval Heads — Hanlin Tang et al., 2024 https://scholar.google.com/scholar?q=RazorAttention:+Efficient+KV+Cache+Compression+Through+Retrieval+Heads 8. Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning — Yu Fu et al., 2024 https://scholar.google.com/scholar?q=Not+All+Heads+Matter:+A+Head-Level+KV+Cache+Compression+Method+with+Integrated+Retrieval+and+Reasoning 9. Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction — Haoran Qiu et al., 2024 https://scholar.google.com/scholar?q=Efficient+Interactive+LLM+Serving+with+Proxy+Model-based+Sequence+Length+Prediction 10. Slice-Level Scheduling for High Throughput and Load Balanced LLM Serving — Ke Cheng et al., 2024 https://scholar.google.com/scholar?q=Slice-Level+Scheduling+for+High+Throughput+and+Load+Balanced+LLM+Serving 11. Deferred Continuous Batching in Resource-Efficient Large Language Model Serving — Yongjun He, Yao Lu, Gustavo Alonso, 2024 https://scholar.google.com/scholar?q=Deferred+Continuous+Batching+in+Resource-Efficient+Large+Language+Model+Serving 12. TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference — Raja Gond, Nipun Kwatra, Ramachandran Ramjee, 2025 https://scholar.google.com/scholar?q=TokenWeave:+Efficient+Compute-Communication+Overlap+for+Distributed+LLM+Inference 13. Characterizing Communication Patterns in Distributed Large Language Model Inference — Lang Xu et al., 2025 https://scholar.google.com/scholar?q=Characterizing+Communication+Patterns+in+Distributed+Large+Language+Model+Inference 14. Mind the Memory Gap: Unveiling GPU Bottlenecks in Large-Batch LLM Inference — Pol G. Recasens et al., 2025 https://scholar.google.com/scholar?q=Mind+the+Memory+Gap:+Unveiling+GPU+Bottlenecks+in+Large-Batch+LLM+Inference 15. AI Post Transformers: FengHuang for Rack-Scale LLM Inference Memory — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-12-fenghuang-for-rack-scale-llm-inference-m-62708e.mp3 16. AI Post Transformers: Deep Kernel Fusion for Transformer Decoding — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-15-deep-kernel-fusion-for-transformer-decod-b1a703.mp3 17. AI Post Transformers: Affordable Large-Scale Decoding Through Model-System Co-Design — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-19-affordable-large-scale-decoding-through-e1d7ed.mp3 18. AI Post Transformers: Speculative Decoding in Real vLLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-speculative-decoding-in-real-vllm-servin-6f4e2b.mp3 19. AI Post Transformers: Why LLM Serving Needs Mathematical Optimization — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-05-why-llm-serving-needs-mathematical-optim-647fc6.mp3 20. AI Post Transformers: LAPS for Length-Aware LLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-05-laps-for-length-aware-llm-serving-0c6149.mp3 21. AI Post Transformers: Splitwise: Phase-Split LLM Inference — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-26-splitwise-phase-split-llm-inference-e8945b.mp3 22. AI Post Transformers: JANUS for Scalable MoE Inference — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-15-janus-for-scalable-moe-inference-78ae30.mp3 Interactive Visualization: LPU Chip for Low-Latency LLM Inference

See All (665)

3.7

out of 5

3 Ratings

Creator

mcgrof
Years Active

2025 - 2026
Episodes

665
Rating

Clean
Show Website

AI Post Transformers

Technology

Technology

Updated Biweekly

AI Post Transformers

KVzap: Fast, Adaptive, Faithful KV Cache Pruning

KVzip for Query-Agnostic KV Cache Compression

Beluga: CXL Memory Pooling for LLM KV Cache

CXL-GPU and Beyond Onboard Memory

DFX: Multi-FPGA Acceleration for Transformer Inference

Trajectory Summaries for Long-Horizon Coding Agents

Generative Recursive Reasoning in Latent Space

LPU Chip for Low-Latency LLM Inference

Ratings & Reviews

About

Information

You Might Also Like

AI Post Transformers

Episodes

KVzap: Fast, Adaptive, Faithful KV Cache Pruning

KVzip for Query-Agnostic KV Cache Compression

Beluga: CXL Memory Pooling for LLM KV Cache

CXL-GPU and Beyond Onboard Memory

DFX: Multi-FPGA Acceleration for Transformer Inference

Trajectory Summaries for Long-Horizon Coding Agents

Generative Recursive Reasoning in Latent Space

LPU Chip for Low-Latency LLM Inference

Ratings & Reviews

About

Information

You Might Also Like