AI Post Transformers

mcgrof

AI-generated podcast where hosts Hal Turing and Dr. Ada Shannon discuss the latest research papers and reports in machine learning, AI systems, and optimization. Featuring honest critical analysis, proper citations, and nerdy humor.

  1. 4D AGO

    Accelerating LLM Cold Starts with Programmable Page Cache

    This episode explores a USENIX FAST'26 paper that addresses the infrastructure bottleneck of loading massive language model weights from storage into accelerator memory during inference deployments. The authors present a programmable page cache framework that achieves 2-4× faster cold start times by exploiting predictable sequential access patterns and XPU affinity, while maintaining full compatibility with existing model formats, inference frameworks, and hardware—unlike prior approaches such as ServerlessLLM and BlitzScale that require custom formats or specific interconnects. The discussion examines why the standard kernel page cache underutilizes modern SSD bandwidth through conservative prefetching and inappropriate LRU eviction policies designed for general workloads, and how a userspace-programmable caching layer can optimize for the specific characteristics of model loading without intrusive kernel modifications. Listeners interested in production ML infrastructure, storage systems optimization, or the operational challenges of deploying large models at scale will find concrete insights into how I/O dominates cold start latency and emerging solutions that bridge the three-orders-of-magnitude gap between SSD and GPU memory bandwidth. Sources: 1. https://www.usenix.org/system/files/fast26-liu-yubo.pdf 2. Orca: A Distributed Serving System for Transformer-Based Generative Models — Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, Byung-Gon Chun, 2022 https://scholar.google.com/scholar?q=Orca:+A+Distributed+Serving+System+for+Transformer-Based+Generative+Models 3. AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving — Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu, Ying Sheng, Xin Jin, Yanping Huang, Zhifeng Chen, Hao Zhang, Joseph E. Gonzalez, Ion Stoica, 2023 https://scholar.google.com/scholar?q=AlpaServe:+Statistical+Multiplexing+with+Model+Parallelism+for+Deep+Learning+Serving 4. FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU — Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y. Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E. Gonzalez, Percy Liang, Christopher Ré, Ion Stoica, Ce Zhang, 2023 https://scholar.google.com/scholar?q=FlexGen:+High-Throughput+Generative+Inference+of+Large+Language+Models+with+a+Single+GPU 5. ServerlessLLM: Locality-Enhanced Serverless Inference for Large Language Models — Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, Luo Mai, 2024 https://scholar.google.com/scholar?q=ServerlessLLM:+Locality-Enhanced+Serverless+Inference+for+Large+Language+Models 6. DeepSpeed-Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale — Aminabadi et al., 2022 https://scholar.google.com/scholar?q=DeepSpeed-Inference:+Enabling+Efficient+Inference+of+Transformer+Models+at+Unprecedented+Scale 7. ZeRO-Offload: Democratizing Billion-Scale Model Training — Ren et al., 2021 https://scholar.google.com/scholar?q=ZeRO-Offload:+Democratizing+Billion-Scale+Model+Training 8. Safetensors: Simple, safe way to store and distribute tensors — HuggingFace, 2022 https://scholar.google.com/scholar?q=Safetensors:+Simple,+safe+way+to+store+and+distribute+tensors 9. AI Post Transformers: LLM Cold Starts: Fixing Linux Page Cache for Model Loading — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-17-llm-cold-starts-fixing-linux-page-cache-a9f9a9.mp3 10. AI Post Transformers: SolidAttention: Efficient SSD-based KV Cache Offloading for Long-Context LLMs — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-17-solidattention-efficient-ssd-based-kv-ca-336b79.mp3 11. AI Post Transformers: Bidaw: Computation-Storage Aware KV Caching for LLMs — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-16-bidaw-computation-storage-aware-kv-cachi-9d89fb.mp3 12. AI Post Transformers: xLLM: Co-Locating Online and Offline LLM Inference — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-16-xllm-co-locating-online-and-offline-llm-10bb81.mp3 Interactive Visualization: Accelerating LLM Cold Starts with Programmable Page Cache

  2. 4D AGO

    Generative File Systems: Replacing Code with Formal Specifications

    This episode explores a 2026 USENIX FAST paper that proposes replacing hand-written file system code with LLM-generated implementations derived from formal specifications. The authors demonstrate SYSSPEC, a system that uses three types of formal specifications—Hoare logic for functionality, rely-guarantee conditions for modularity, and explicit concurrency protocols—to guide code generation while using validation agents to catch hallucinations and ensure correctness. Analysis of Ext4's commit history reveals that 82.4% of changes are bug fixes and maintenance, suggesting traditional file system development wastes enormous effort on code upkeep rather than innovation. The researchers show that their approach can generate a working file system (SPECFS) and evolve it by patching specifications rather than code, potentially transforming how systems software is developed and maintained. Sources: 1. https://www.usenix.org/system/files/fast26-liu-qingyuan.pdf 2. Yggdrasil: An Optimized System for Training Deep Decision Trees at Scale — Fabrice Popineau, Artem Vysogorets, et al., 2020 https://scholar.google.com/scholar?q=Yggdrasil:+An+Optimized+System+for+Training+Deep+Decision+Trees+at+Scale 3. Hyperkernel: Push-Button Verification of an OS Kernel — Luke Nelson, Helgi Sigurbjarnarson, Kaiyuan Zhang, et al., 2017 https://scholar.google.com/scholar?q=Hyperkernel:+Push-Button+Verification+of+an+OS+Kernel 4. Program Synthesis from Natural Language Using Recurrent Neural Networks — Xi Victoria Lin, Chenglong Wang, Luke Zettlemoyer, Michael D. Ernst, 2017 https://scholar.google.com/scholar?q=Program+Synthesis+from+Natural+Language+Using+Recurrent+Neural+Networks 5. Crash Hoare Logic — Tej Chajed, Frans Kaashoek, Butler Lampson, Nickolai Zeldovich, 2018 https://scholar.google.com/scholar?q=Crash+Hoare+Logic 6. FSCQ: A Verified File System — Haogang Chen et al., 2015 https://scholar.google.com/scholar?q=FSCQ:+A+Verified+File+System 7. Yxv6: An Educational File System with Formal Specifications — Helgi Sigurbjarnarson et al., 2016 https://scholar.google.com/scholar?q=Yxv6:+An+Educational+File+System+with+Formal+Specifications 8. Crash Consistency in Database Systems — Goetz Graefe, 2009 https://scholar.google.com/scholar?q=Crash+Consistency+in+Database+Systems 9. Using Crash Hoare Logic for Certifying the FSCQ File System — Haogang Chen et al., 2015 https://scholar.google.com/scholar?q=Using+Crash+Hoare+Logic+for+Certifying+the+FSCQ+File+System 10. Jitk: A Trustworthy In-Kernel Interpreter Infrastructure — Xi Wang et al., 2014 https://scholar.google.com/scholar?q=Jitk:+A+Trustworthy+In-Kernel+Interpreter+Infrastructure 11. AI Post Transformers: LLM Agents Reason About Code Without Running It — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-15-llm-agents-reason-about-code-without-run-2a1876.mp3 12. AI Post Transformers: SYSSPEC: LLM-Generated File Systems from Formal Specifications — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-17-sysspec-llm-generated-file-systems-from-02f5a9.mp3 13. AI Post Transformers: Generative File Systems from Formal Specifications with SysSpec — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-16-generative-file-systems-from-formal-spec-ff240b.mp3 14. AI Post Transformers: Sharpen the Spec, Cut the Code: LLM-Generated File Systems — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-16-sharpen-the-spec-cut-the-code-llm-genera-8eb6b1.mp3 Interactive Visualization: Generative File Systems: Replacing Code with Formal Specifications

  3. 4D AGO

    Optimizing Mixture of Block Attention Through Statistical Theory

    This episode examines the statistical foundations of Mixture of Block Attention (MoBA), a sparse attention mechanism that divides key-value sequences into blocks and routes queries only to the most relevant ones. The paper derives a signal-to-noise ratio showing that retrieval accuracy depends on the square root of head dimension divided by block size, revealing why smaller blocks improve a router's ability to distinguish relevant from irrelevant content despite increasing computational overhead. The authors introduce FlashMoBA, a hardware-optimized CUDA kernel that makes small block sizes practical on GPUs, and demonstrate how depthwise convolutions on keys can cluster related signals to further boost routing performance. The work provides theoretical grounding for why routing-based sparse attention succeeds at reducing quadratic attention costs to near-linear scaling in long-context language models. Sources: 1. https://arxiv.org/pdf/2511.11571v2 2. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness — Dao et al., 2022 https://scholar.google.com/scholar?q=FlashAttention:+Fast+and+Memory-Efficient+Exact+Attention+with+IO-Awareness 3. Mixture of Experts: A Survey — Various (MoE literature), 2020-2024 https://scholar.google.com/scholar?q=Mixture+of+Experts:+A+Survey 4. Sparse Attention Mechanisms (Zaheer et al., Guo et al., Xu et al.) — Cited in paper, 2020-2025 https://scholar.google.com/scholar?q=Sparse+Attention+Mechanisms+(Zaheer+et+al.,+Guo+et+al.,+Xu+et+al.) 5. AI Post Transformers: Optimizing Mixture of Block Attention for Long-Context Transformers — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-17-optimizing-mixture-of-block-attention-fo-ea4612.mp3 6. AI Post Transformers: SolidAttention: Co-Designing Sparse Attention and SSD I/O — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-18-solidattention-co-designing-sparse-atten-5a8622.mp3 7. AI Post Transformers: Bidaw: Bidirectional Awareness for Interactive LLM KV Caching — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-17-bidaw-bidirectional-awareness-for-intera-87c311.mp3 Interactive Visualization: Optimizing Mixture of Block Attention Through Statistical Theory

  4. 4D AGO

    SolidAttention: Co-Designing Sparse Attention and SSD I/O

    This episode explores SolidAttention, a system that enables large language models to run on memory-constrained consumer PCs by offloading the KV cache to SSD storage. The paper addresses a fundamental mismatch: sparse attention patterns create random I/O access that kills SSD performance, while previous offloading solutions like FlexGen only work well with high request concurrency unavailable on local machines. The researchers co-designed sparse attention algorithms with SSD storage management to enable coarse-grained sequential reads instead of fine-grained random access, achieving practical local LLM inference on systems with just 8-16GB of RAM. The discussion covers why KV caches consume four times the memory of model weights, the trade-offs of quantization versus offloading, and why treating attention sparsity and storage optimization as separate problems fails on consumer hardware. Sources: 1. https://www.usenix.org/system/files/fast26-zheng.pdf 2. FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU — Sheng et al., 2023 https://scholar.google.com/scholar?q=FlexGen:+High-Throughput+Generative+Inference+of+Large+Language+Models+with+a+Single+GPU 3. Efficient Streaming Language Models with Attention Sinks — Xiao et al., 2024 https://scholar.google.com/scholar?q=Efficient+Streaming+Language+Models+with+Attention+Sinks 4. H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models — Zhang et al., 2023 https://scholar.google.com/scholar?q=H2O:+Heavy-Hitter+Oracle+for+Efficient+Generative+Inference+of+Large+Language+Models 5. SSD I/O Characteristics: Impacts of Request Size, Access Pattern, and Parallelism — Chen et al., 2016 https://scholar.google.com/scholar?q=SSD+I/O+Characteristics:+Impacts+of+Request+Size,+Access+Pattern,+and+Parallelism 6. vLLM: Efficient Memory Management for Large Language Model Serving with PagedAttention — Kwon et al., 2023 https://scholar.google.com/scholar?q=vLLM:+Efficient+Memory+Management+for+Large+Language+Model+Serving+with+PagedAttention 7. AI Post Transformers: SolidAttention: Efficient SSD-based KV Cache Offloading for Long-Context LLMs — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-17-solidattention-efficient-ssd-based-kv-ca-336b79.mp3 8. AI Post Transformers: SolidAttention: Fast SSD-Based Serving on Memory-Constrained PCs — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-16-solidattention-fast-ssd-based-serving-on-1c305d.mp3 9. AI Post Transformers: SolidAttention: Low-Latency SSD-based Serving on Memory-Constrained PCs — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-16-solidattention-low-latency-ssd-based-ser-e22a0d.mp3 10. AI Post Transformers: Bidaw: Bidirectional Awareness for Interactive LLM KV Caching — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-17-bidaw-bidirectional-awareness-for-intera-87c311.mp3 11. AI Post Transformers: Bidaw: Reducing LLM KV Cache Latency with Two-Tier Storage — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-16-bidaw-reducing-llm-kv-cache-latency-with-15dd25.mp3 12. AI Post Transformers: Bidaw: Computation-Storage Aware KV Caching for LLMs — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-16-bidaw-computation-storage-aware-kv-cachi-9d89fb.mp3 13. AI Post Transformers: CacheSlide: Unlocking Cross Position-Aware KV Cache Reuse for LLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-17-cacheslide-unlocking-cross-position-awar-487b2b.mp3 14. AI Post Transformers: Efficient KV Cache Reuse in Dynamic Agent Workflows — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-16-efficient-kv-cache-reuse-in-dynamic-agen-558f19.mp3 15. AI Post Transformers: Accelerating LLM Cold Starts with Programmable Page Cache — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-17-accelerating-llm-cold-starts-with-progra-0912d1.mp3 16. AI Post Transformers: LLM Cold Starts: Fixing Linux Page Cache for Model Loading — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-pacific.com/episodes/2026-03-17-llm-cold-starts-fixing-linux-page-cache-a9f9a9.mp3 Interactive Visualization: SolidAttention: Co-Designing Sparse Attention and SSD I/O

  5. 4D AGO

    Xerxes: CXL 3.0 Simulation for Scalable Memory Systems

    This episode explores Xerxes, a new open-source simulator designed to model CXL 3.0 features before the hardware exists. The hosts explain how CXL adds cache coherence to PCIe to solve memory access bottlenecks in AI and HPC workloads, then dive into the two major architectural changes in CXL 3.0: Port-Based Routing, which enables arbitrary fabric topologies beyond rigid trees, and Device-Managed Coherence, which lets devices handle coherence protocols peer-to-peer without routing every transaction through the host CPU. The discussion highlights why this simulator matters for designing next-generation rack-scale memory pools and accelerator fabrics, addressing the chicken-and-egg problem of validating designs before physical hardware ships. The hosts question how validation works without reference hardware and preview a deeper look at Xerxes' architecture and methodology. Sources: 1. https://www.usenix.org/system/files/fast26-an.pdf 2. CXL Memory Disaggregation: Opportunities and Challenges — Guz et al. (Intel), 2023 https://scholar.google.com/scholar?q=CXL+Memory+Disaggregation:+Opportunities+and+Challenges 3. Pond: CXL-Based Memory Pooling Systems for Cloud Platforms — Li et al., 2023 https://scholar.google.com/scholar?q=Pond:+CXL-Based+Memory+Pooling+Systems+for+Cloud+Platforms 4. TPP: Transparent Page Placement for CXL-Enabled Tiered Memory — Maruf et al., 2023 https://scholar.google.com/scholar?q=TPP:+Transparent+Page+Placement+for+CXL-Enabled+Tiered+Memory 5. The CXL Memory Expander: Performance and Cost Analysis — Gouk et al. (SK hynix), 2023 https://scholar.google.com/scholar?q=The+CXL+Memory+Expander:+Performance+and+Cost+Analysis 6. Exploring CXL 3.0 Port-Based Routing for Scalable Memory Systems — Pan et al., 2024 https://scholar.google.com/scholar?q=Exploring+CXL+3.0+Port-Based+Routing+for+Scalable+Memory+Systems 7. SMART: Scalable Memory Architecture with Port-Based Routing — Kim et al., 2024 https://scholar.google.com/scholar?q=SMART:+Scalable+Memory+Architecture+with+Port-Based+Routing 8. Deadlock-Free Routing for CXL Fabrics — Zhang et al., 2024 https://scholar.google.com/scholar?q=Deadlock-Free+Routing+for+CXL+Fabrics 9. DMC: Distributed Cache Coherence for CXL Memory Systems — Lee et al., 2024 https://scholar.google.com/scholar?q=DMC:+Distributed+Cache+Coherence+for+CXL+Memory+Systems 10. Scaling Cache Coherence to Thousands of Devices with CXL DMC — Wang et al., 2024 https://scholar.google.com/scholar?q=Scaling+Cache+Coherence+to+Thousands+of+Devices+with+CXL+DMC 11. Coherence Protocol Verification for CXL Device-Managed Coherence — Chen et al., 2024 https://scholar.google.com/scholar?q=Coherence+Protocol+Verification+for+CXL+Device-Managed+Coherence 12. gem5: A Multiple-ISA Full-System Simulator — Binkert et al., 2011 https://scholar.google.com/scholar?q=gem5:+A+Multiple-ISA+Full-System+Simulator 13. The ZSim Simulator: Fast and Accurate Multicore Simulation — Sanchez and Kozyrakis, 2013 https://scholar.google.com/scholar?q=The+ZSim+Simulator:+Fast+and+Accurate+Multicore+Simulation 14. Simulating Multi-Core Systems with Shared Memory Coherence — Martin et al. (Wisconsin Multifacet group), 2005 https://scholar.google.com/scholar?q=Simulating+Multi-Core+Systems+with+Shared+Memory+Coherence 15. PARADE: A Cycle-Accurate Full-System Simulation Platform for Accelerator-Rich Architectures — Fuchs et al., 2020 https://scholar.google.com/scholar?q=PARADE:+A+Cycle-Accurate+Full-System+Simulation+Platform+for+Accelerator-Rich+Architectures 16. A Primer on Memory Consistency and Cache Coherence — Sorin, Hill, and Wood, 2011 https://scholar.google.com/scholar?q=A+Primer+on+Memory+Consistency+and+Cache+Coherence 17. Coherence and Consistency Models in Shared-Memory Multiprocessors — Adve and Gharachorloo, 1996 https://scholar.google.com/scholar?q=Coherence+and+Consistency+Models+in+Shared-Memory+Multiprocessors 18. DASH: A Scalable Directory-Based Multiprocessor — Lenoski et al. (Stanford DASH project), 1992 https://scholar.google.com/scholar?q=DASH:+A+Scalable+Directory-Based+Multiprocessor 19. Directory-Based Cache Coherence in Large-Scale Multiprocessors — Chaiken et al. (Alewife project), 1991 https://scholar.google.com/scholar?q=Directory-Based+Cache+Coherence+in+Large-Scale+Multiprocessors 20. Enabling Rack-Scale Confidential Computing using Heterogeneous Trusted Execution Environment — Jianping Zhu, Hang Yin, Yuekai Jia, Wenhao Wang, Chunhui Li, Jiashuo Liang, Shoumeng Yan, Zhengyu He, Qingkui Liu, Alex X. Liu, 2024 https://scholar.google.com/scholar?q=Enabling+Rack-Scale+Confidential+Computing+using+Heterogeneous+Trusted+Execution+Environment 21. Understanding the Overheads of Hardware Memory Coherence — Lena E. Olson, Joseph Izraelevitz, Mark D. Hill, 2015 https://scholar.google.com/scholar?q=Understanding+the+Overheads+of+Hardware+Memory+Coherence 22. AI Post Transformers: SolidAttention: Co-Designing Sparse Attention and SSD I/O — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-18-solidattention-co-designing-sparse-atten-5a8622.mp3 23. AI Post Transformers: Accelerating LLM Cold Starts with Programmable Page Cache — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-17-accelerating-llm-cold-starts-with-progra-0912d1.mp3 24. AI Post Transformers: xLLM: Co-Locating Online and Offline LLM Inference — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-16-xllm-co-locating-online-and-offline-llm-10bb81.mp3 Interactive Visualization: Xerxes: CXL 3.0 Simulation for Scalable Memory Systems

  6. 6D AGO

    Bidaw: Computation-Storage Aware KV Caching for LLMs

    This episode explores a new system called Bidaw that dramatically improves the performance of long, multi-turn AI chatbot conversations by solving a critical caching problem. The paper reveals that existing approaches waste over 93% of computation redundantly recalculating conversation history, and that naive two-tier storage systems (using both RAM and SSD) increase latency by 3.8x because the GPU scheduler and storage system don't coordinate. Bidaw introduces "bidirectional awareness" where the scheduler prioritizes requests whose data is already in fast memory while background-loading slower SSD data, and the storage system uses conversation flow patterns to predict which cached data to keep hot. Listeners interested in LLM infrastructure, production ML systems, or the practical challenges of deploying interactive AI services will learn how clever coordination between compute and storage layers can unlock major performance gains without requiring more expensive hardware. Sources: 1. https://www.usenix.org/system/files/fast26-hu-shipeng.pdf 2. FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU — Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y. Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E. Gonzalez, Percy Liang, Christopher Ré, Ion Stoica, Ce Zhang, 2023 https://scholar.google.com/scholar?q=FlexGen:+High-Throughput+Generative+Inference+of+Large+Language+Models+with+a+Single+GPU 3. vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention — Woosuk Kwon, Zhuohan Li, Siddharth Devadas, Ion Stoica, Joseph E. Gonzalez, 2023 https://scholar.google.com/scholar?q=vLLM:+Easy,+Fast,+and+Cheap+LLM+Serving+with+PagedAttention 4. PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU — Yixin Song, Zeyu Mi, Haotong Xie, Haibo Chen, 2023 https://scholar.google.com/scholar?q=PowerInfer:+Fast+Large+Language+Model+Serving+with+a+Consumer-grade+GPU 5. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving — Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, Hao Zhang, 2024 https://scholar.google.com/scholar?q=DistServe:+Disaggregating+Prefill+and+Decoding+for+Goodput-optimized+Large+Language+Model+Serving 6. LineFS: Efficient SmartNIC Offload of a Distributed File System with Pipeline Parallelism — Jongyul Kim, Sangwoo Kang, Juhyeong Ryu, Jaehyeong Im, Seongyeop Jeong, Jin-Soo Kim, 2021 https://scholar.google.com/scholar?q=LineFS:+Efficient+SmartNIC+Offload+of+a+Distributed+File+System+with+Pipeline+Parallelism 7. Flashield: a Hybrid Key-value Cache that Controls Flash Write Amplification — Yiwen Zhang, Xin Chen, Zhuo Chang, Huanchen Zhang, 2019 https://scholar.google.com/scholar?q=Flashield:+a+Hybrid+Key-value+Cache+that+Controls+Flash+Write+Amplification 8. Nexus: A GPU Cluster Engine for Accelerating DNN-Based Video Analysis — Haichen Shen, Lequn Chen, Yuchen Jin, Liangyu Zhao, Bingyu Kong, Matthai Philipose, Arvind Krishnamurthy, Ravi Sundaram, 2019 https://scholar.google.com/scholar?q=Nexus:+A+GPU+Cluster+Engine+for+Accelerating+DNN-Based+Video+Analysis 9. Clockwork: A Scheduler for GPU-Accelerated Deep Learning Serving — Arpan Gujarati, Reza Karimi, Safya Alzayat, Wei Hao, Antoine Kaufmann, Ymir Vigfusson, Jonathan Mace, 2020 https://scholar.google.com/scholar?q=Clockwork:+A+Scheduler+for+GPU-Accelerated+Deep+Learning+Serving 10. Learning to Cache: Neural Adaptive Caching Policies — Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, Werner Vogels, 2018 https://scholar.google.com/scholar?q=Learning+to+Cache:+Neural+Adaptive+Caching+Policies 11. Semantic Caching for Large Language Models — Zheng Gao, Peiyuan Liu, Junwei Cao, Xin Li, 2023 https://scholar.google.com/scholar?q=Semantic+Caching+for+Large+Language+Models 12. Predicting User Behavior in Multi-Turn Dialogue Systems — Yun-Nung Chen, Dilek Hakkani-Tür, Gökhan Tür, Jianfeng Gao, Li Deng, 2016 https://scholar.google.com/scholar?q=Predicting+User+Behavior+in+Multi-Turn+Dialogue+Systems 13. Machine Learning for Storage Systems: A Comprehensive Survey — Jianliang Zhang, Zeke Wang, Tong Zhang, 2023 https://scholar.google.com/scholar?q=Machine+Learning+for+Storage+Systems:+A+Comprehensive+Survey 14. PagedAttention: Efficient Memory Management for LLM Serving — Kwon et al. (vLLM), 2023 https://scholar.google.com/scholar?q=PagedAttention:+Efficient+Memory+Management+for+LLM+Serving 15. Adaptive Replacement Cache (ARC) — Megiddo and Modha, 2003 https://scholar.google.com/scholar?q=Adaptive+Replacement+Cache+(ARC) 16. Learned Cache Replacement Policies — Vietri et al., 2020 https://scholar.google.com/scholar?q=Learned+Cache+Replacement+Policies 17. LoRA: Low-Rank Adaptation of Large Language Models — Hu et al., 2021 https://scholar.google.com/scholar?q=LoRA:+Low-Rank+Adaptation+of+Large+Language+Models 18. No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization — MiKV authors, 2024-2025 https://scholar.google.com/scholar?q=No+Token+Left+Behind:+Reliable+KV+Cache+Compression+via+Importance-Aware+Mixed+Precision+Quantization 19. CommVQ: Commutative Vector Quantization for KV Cache Compression — CommVQ authors, 2024-2025 https://scholar.google.com/scholar?q=CommVQ:+Commutative+Vector+Quantization+for+KV+Cache+Compression 20. KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse — KVLink authors, 2024-2025 https://scholar.google.com/scholar?q=KVLink:+Accelerating+Large+Language+Models+via+Efficient+KV+Cache+Reuse 21. Compute or Load KV Cache? Why Not Both? — Unknown, 2024-2025 https://scholar.google.com/scholar?q=Compute+or+Load+KV+Cache?+Why+Not+Both? 22. MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention — MInference authors, 2024-2025 https://scholar.google.com/scholar?q=MInference+1.0:+Accelerating+Pre-filling+for+Long-Context+LLMs+via+Dynamic+Sparse+Attention 23. KVCache Cache in the Wild: Characterizing and Optimizing KVCache at a Large Cloud Provider — Cloud provider study authors, 2024-2025 https://scholar.google.com/scholar?q=KVCache+Cache+in+the+Wild:+Characterizing+and+Optimizing+KVCache+at+a+Large+Cloud+Provider 24. Efficient KV Cache Reuse in Dynamic Agent Workflows https://podcast.do-not-panic.com/episodes/2026-03-16-efficient-kv-cache-reuse-in-dynamic-agen-558f19.mp3 25. 50x KV Cache Compression in Seconds via Attention Matching https://podcast.do-not-panic.com/episodes/2026-03-09-50x-kv-cache-compression-in-seconds-via-9402c1.mp3 26. Statistical Routing Theory in CARTRIDGE Block Attention https://podcast.do-not-panic.com/episodes/2026-03-16-statistical-routing-theory-in-cartridge-2083f4.mp3 27. xLLM: Co-Locating Online and Offline LLM Inference https://podcast.do-not-panic.com/episodes/2026-03-16-xllm-co-locating-online-and-offline-llm-10bb81.mp3 Interactive Visualization: Bidaw: Computation-Storage Aware KV Caching for LLMs

Ratings & Reviews

3.7
out of 5
3 Ratings

About

AI-generated podcast where hosts Hal Turing and Dr. Ada Shannon discuss the latest research papers and reports in machine learning, AI systems, and optimization. Featuring honest critical analysis, proper citations, and nerdy humor.

You Might Also Like