This episode explores TTKV, a temporal-tiered key-value cache design for long-context LLM inference, where decode speed degrades because growing KV state turns generation into a memory-bandwidth problem rather than a compute problem. It explains how the method keeps recent cache blocks in fast GPU HBM, evicts older blocks to slower host DRAM, and uses asymmetric quantization in the slow tier, preserving keys at higher precision while compressing values more aggressively. The discussion also breaks down the runtime mechanics behind block-wise streaming attention, including query-conditioned block ranking, top-k prefetching, decompression, and overlapping data transfer with attention computation. What makes the episode interesting is that it treats TTKV less as a new model idea and more as a systems design proposal, while critically questioning whether recency is a reliable proxy for importance and whether the paper fully specifies the cost of its block-selection function. Sources: 1. TTKV: Temporal-Tiered KV Cache for Long-Context LLM Inference — Gradwell Dzikanyanga, Weihao Yang, Hao Huang, Donglei Wu, Shihao Wang, Wen Xia, Sanjeeb K C, 2026 http://arxiv.org/abs/2604.19769 2. Efficient Memory Management for Large Language Model Serving with PagedAttention — Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica, 2023 https://scholar.google.com/scholar?q=Efficient+Memory+Management+for+Large+Language+Model+Serving+with+PagedAttention 3. Efficient Streaming Language Models with Attention Sinks — Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, Mike Lewis, 2023 https://scholar.google.com/scholar?q=Efficient+Streaming+Language+Models+with+Attention+Sinks 4. KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache — Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, Xia Hu, 2024 https://scholar.google.com/scholar?q=KIVI:+A+Tuning-Free+Asymmetric+2bit+Quantization+for+KV+Cache 5. ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference — Hanshi Sun, Li-Wen Chang, Wenlei Bao, Size Zheng, Ningxin Zheng, Xin Liu, Harry Dong, Yuejie Chi, Beidi Chen, 2024 https://scholar.google.com/scholar?q=ShadowKV:+KV+Cache+in+Shadows+for+High-Throughput+Long-Context+LLM+Inference 6. FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference — Guangda Liu et al., 2025 https://scholar.google.com/scholar?q=FreeKV:+Boosting+KV+Cache+Retrieval+for+Efficient+LLM+Inference 7. FIER: Fine-Grained and Efficient KV Cache Retrieval for Long-context LLM Inference — Dongwei Wang et al., 2025 https://scholar.google.com/scholar?q=FIER:+Fine-Grained+and+Efficient+KV+Cache+Retrieval+for+Long-context+LLM+Inference 8. CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving — Yuhan Liu et al., 2024 https://scholar.google.com/scholar?q=CacheGen:+KV+Cache+Compression+and+Streaming+for+Fast+Large+Language+Model+Serving 9. Retrieval Head Mechanistically Explains Long-Context Factuality — Wenhao Wu et al., 2024 https://arxiv.org/abs/2404.15574 10. DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads — Guangxuan Xiao et al., 2024 https://arxiv.org/abs/2410.10819 11. Query-Focused Retrieval Heads Improve Long-Context Reasoning and Re-ranking — Wuwei Zhang et al., 2025 https://arxiv.org/abs/2506.09944 12. LKV: End-to-End Learning of Head-wise Budgets and Token Selection for LLM KV Cache Eviction — Enshuai Zhou et al., 2026 https://arxiv.org/abs/2605.06676 13. IndexMem: Learned KV-Cache Eviction with Latent Memory for Long-Context LLM Inference — Xintong Yang et al., 2026 https://arxiv.org/abs/2605.25475 14. KVTuner: Sensitivity-Aware Layer-wise Mixed Precision KV Cache Quantization for Efficient and Nearly Lossless LLM Inference — Xing Li et al., 2025 https://arxiv.org/abs/2502.04420 15. AsymKV: Enabling 1-Bit Quantization of KV Cache with Layer-Wise Asymmetric Quantization Configurations — Qian Tao et al., 2024 https://arxiv.org/abs/2410.13212 16. vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention — Ramya Prabhu et al., 2024 https://arxiv.org/abs/2405.04437 17. AI Post Transformers: ScoutAttention for Efficient KV Cache Offloading — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-24-scoutattention-for-efficient-kv-cache-of-b26699.mp3 18. AI Post Transformers: MiniMax Sparse Attention at Million-Token Scale — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-06-13-minimax-sparse-attention-at-million-toke-300108.mp3 19. AI Post Transformers: IndexMem: Learned KV-Cache Eviction for Long-Context LLMs — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-06-12-indexmem-learned-kv-cache-eviction-for-l-132c2a.mp3 20. AI Post Transformers: Mooncake for KV Cache-Centric LLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-06-05-mooncake-for-kv-cache-centric-llm-servin-1086d0.mp3 21. AI Post Transformers: Memory-Bound, Not Bandwidth-Limited Batch-1 LLM Decode — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-06-02-memory-bound-not-bandwidth-limited-batch-114799.mp3 22. AI Post Transformers: Why LLM Serving Needs Mathematical Optimization — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-05-why-llm-serving-needs-mathematical-optim-647fc6.mp3