This episode explores DAK, a Cornell systems paper arguing that LLM inference on tiered-memory machines can be faster when offloaded weights and KV-cache blocks are fetched directly into on-chip shared memory instead of being prefetched and staged through GPU HBM. It breaks down the tradeoffs among HBM capacity, HBM bandwidth, KV-cache growth during decoding, and prior approaches such as FlexGen, vLLM’s PagedAttention, and emerging KV offload systems like LMCache. The discussion focuses on DAK’s core technical idea: using Hopper’s Tensor Memory Accelerator inside custom GEMM and FlashAttention kernels so data movement and computation are co-designed, reducing bounce buffers, HBM contention, and pipeline bubbles while aggregating bandwidth from multiple memory tiers. Listeners would find it interesting because it turns a low-level memory-path decision into a concrete argument about when offloading is merely a fallback and when it becomes a real performance advantage for serving larger models, longer contexts, or bigger batches. Sources: 1. DAK: Direct-Access-Enabled GPU Memory Offloading with Optimal Efficiency for LLM Inference — Shouxu Lin, Zhiyuan Guo, Jiaxin Lin, 2026 http://arxiv.org/abs/2604.26074 2. ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning — Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, Yuxiong He, 2021 https://scholar.google.com/scholar?q=ZeRO-Infinity:+Breaking+the+GPU+Memory+Wall+for+Extreme+Scale+Deep+Learning 3. FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU — Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Ion Stoica, Percy Liang, Ce Zhang, and colleagues, 2023 https://scholar.google.com/scholar?q=FlexGen:+High-Throughput+Generative+Inference+of+Large+Language+Models+with+a+Single+GPU 4. Efficient Memory Management for Large Language Model Serving with PagedAttention — Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica, 2023 https://scholar.google.com/scholar?q=Efficient+Memory+Management+for+Large+Language+Model+Serving+with+PagedAttention 5. DAK: Direct-Access-Enabled GPU Memory Offloading with Optimal Efficiency for LLM Inference — Shouxu Lin, Zhiyuan Guo, Jiaxin Lin, 2026 https://scholar.google.com/scholar?q=DAK:+Direct-Access-Enabled+GPU+Memory+Offloading+with+Optimal+Efficiency+for+LLM+Inference 6. PIE: Pooling CPU Memory for LLM Inference — Yi Xu, Ziming Mao, Xiangxi Mo, Shu Liu, Ion Stoica, 2024 https://scholar.google.com/scholar?q=PIE:+Pooling+CPU+Memory+for+LLM+Inference 7. NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference — Xuanlin Jiang, Yang Zhou, Shiyi Cao, Ion Stoica, Minlan Yu, 2025 https://scholar.google.com/scholar?q=NEO:+Saving+GPU+Memory+Crisis+with+CPU+Offloading+for+Online+LLM+Inference 8. Understanding Data Movement in Tightly Coupled Heterogeneous Systems: A Case Study with the Grace Hopper Superchip — Luigi Fusco, Mikhail Khalilov, Marcin Chrapek, Giridhar Chukkapalli, Thomas Schulthess, Torsten Hoefler, 2024 https://scholar.google.com/scholar?q=Understanding+Data+Movement+in+Tightly+Coupled+Heterogeneous+Systems:+A+Case+Study+with+the+Grace+Hopper+Superchip 9. FengHuang: Next-Generation Memory Orchestration for AI Inferencing — Jiamin Li, Lei Qu, Tao Zhang, Grigory Chirkov, Shuotao Xu, Peng Cheng, Lidong Zhou, 2025 https://scholar.google.com/scholar?q=FengHuang:+Next-Generation+Memory+Orchestration+for+AI+Inferencing 10. Reducing Transformer Key-Value Cache Size with Cross-Layer Attention — William Brandon et al., 2024 https://scholar.google.com/scholar?q=Reducing+Transformer+Key-Value+Cache+Size+with+Cross-Layer+Attention 11. xKV: Cross-Layer SVD for KV-Cache Compression — Chi-Chih Chang et al., 2025 https://scholar.google.com/scholar?q=xKV:+Cross-Layer+SVD+for+KV-Cache+Compression 12. XQuant: Achieving Ultra-Low Bit KV Cache Quantization with Cross-Layer Compression — Haoqi Yang et al., 2025 https://scholar.google.com/scholar?q=XQuant:+Achieving+Ultra-Low+Bit+KV+Cache+Quantization+with+Cross-Layer+Compression 13. InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management — Wonbeom Lee et al., 2024 https://scholar.google.com/scholar?q=InfiniGen:+Efficient+Generative+Inference+of+Large+Language+Models+with+Dynamic+KV+Cache+Management 14. Accelerating LLM Inference Throughput via Asynchronous KV Cache Prefetching — Yanhao Dong et al., 2025 https://scholar.google.com/scholar?q=Accelerating+LLM+Inference+Throughput+via+Asynchronous+KV+Cache+Prefetching 15. KVShare: Semantic-Aware Key-Value Cache Sharing for Efficient Large Language Model Inference — Huan Yang et al., 2025 https://scholar.google.com/scholar?q=KVShare:+Semantic-Aware+Key-Value+Cache+Sharing+for+Efficient+Large+Language+Model+Inference 16. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration — Ji Lin et al., 2023 https://scholar.google.com/scholar?q=AWQ:+Activation-aware+Weight+Quantization+for+LLM+Compression+and+Acceleration 17. SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression — Tim Dettmers et al., 2023 https://scholar.google.com/scholar?q=SpQR:+A+Sparse-Quantized+Representation+for+Near-Lossless+LLM+Weight+Compression 18. AI Post Transformers: Beluga: CXL Memory Pooling for LLM KV Cache — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-27-beluga-cxl-memory-pooling-for-llm-kv-cac-b6142f.mp3 19. AI Post Transformers: ScoutAttention for Efficient KV Cache Offloading — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-24-scoutattention-for-efficient-kv-cache-of-b26699.mp3 20. AI Post Transformers: InfiniGen for Efficient Long-Context LLM Inference — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-06-18-infinigen-for-efficient-long-context-llm-143d77.mp3 21. AI Post Transformers: Mooncake for KV Cache-Centric LLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-06-05-mooncake-for-kv-cache-centric-llm-servin-1086d0.mp3 22. AI Post Transformers: Memory-Bound, Not Bandwidth-Limited Batch-1 LLM Decode — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-06-02-memory-bound-not-bandwidth-limited-batch-114799.mp3 23. AI Post Transformers: LAPS for Length-Aware LLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-05-laps-for-length-aware-llm-serving-0c6149.mp3