This episode explores a new system called Bidaw that dramatically improves the performance of long, multi-turn AI chatbot conversations by solving a critical caching problem. The paper reveals that existing approaches waste over 93% of computation redundantly recalculating conversation history, and that naive two-tier storage systems (using both RAM and SSD) increase latency by 3.8x because the GPU scheduler and storage system don't coordinate. Bidaw introduces "bidirectional awareness" where the scheduler prioritizes requests whose data is already in fast memory while background-loading slower SSD data, and the storage system uses conversation flow patterns to predict which cached data to keep hot. Listeners interested in LLM infrastructure, production ML systems, or the practical challenges of deploying interactive AI services will learn how clever coordination between compute and storage layers can unlock major performance gains without requiring more expensive hardware. Sources: 1. https://www.usenix.org/system/files/fast26-hu-shipeng.pdf 2. FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU — Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y. Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E. Gonzalez, Percy Liang, Christopher Ré, Ion Stoica, Ce Zhang, 2023 https://scholar.google.com/scholar?q=FlexGen:+High-Throughput+Generative+Inference+of+Large+Language+Models+with+a+Single+GPU 3. vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention — Woosuk Kwon, Zhuohan Li, Siddharth Devadas, Ion Stoica, Joseph E. Gonzalez, 2023 https://scholar.google.com/scholar?q=vLLM:+Easy,+Fast,+and+Cheap+LLM+Serving+with+PagedAttention 4. PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU — Yixin Song, Zeyu Mi, Haotong Xie, Haibo Chen, 2023 https://scholar.google.com/scholar?q=PowerInfer:+Fast+Large+Language+Model+Serving+with+a+Consumer-grade+GPU 5. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving — Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, Hao Zhang, 2024 https://scholar.google.com/scholar?q=DistServe:+Disaggregating+Prefill+and+Decoding+for+Goodput-optimized+Large+Language+Model+Serving 6. LineFS: Efficient SmartNIC Offload of a Distributed File System with Pipeline Parallelism — Jongyul Kim, Sangwoo Kang, Juhyeong Ryu, Jaehyeong Im, Seongyeop Jeong, Jin-Soo Kim, 2021 https://scholar.google.com/scholar?q=LineFS:+Efficient+SmartNIC+Offload+of+a+Distributed+File+System+with+Pipeline+Parallelism 7. Flashield: a Hybrid Key-value Cache that Controls Flash Write Amplification — Yiwen Zhang, Xin Chen, Zhuo Chang, Huanchen Zhang, 2019 https://scholar.google.com/scholar?q=Flashield:+a+Hybrid+Key-value+Cache+that+Controls+Flash+Write+Amplification 8. Nexus: A GPU Cluster Engine for Accelerating DNN-Based Video Analysis — Haichen Shen, Lequn Chen, Yuchen Jin, Liangyu Zhao, Bingyu Kong, Matthai Philipose, Arvind Krishnamurthy, Ravi Sundaram, 2019 https://scholar.google.com/scholar?q=Nexus:+A+GPU+Cluster+Engine+for+Accelerating+DNN-Based+Video+Analysis 9. Clockwork: A Scheduler for GPU-Accelerated Deep Learning Serving — Arpan Gujarati, Reza Karimi, Safya Alzayat, Wei Hao, Antoine Kaufmann, Ymir Vigfusson, Jonathan Mace, 2020 https://scholar.google.com/scholar?q=Clockwork:+A+Scheduler+for+GPU-Accelerated+Deep+Learning+Serving 10. Learning to Cache: Neural Adaptive Caching Policies — Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, Werner Vogels, 2018 https://scholar.google.com/scholar?q=Learning+to+Cache:+Neural+Adaptive+Caching+Policies 11. Semantic Caching for Large Language Models — Zheng Gao, Peiyuan Liu, Junwei Cao, Xin Li, 2023 https://scholar.google.com/scholar?q=Semantic+Caching+for+Large+Language+Models 12. Predicting User Behavior in Multi-Turn Dialogue Systems — Yun-Nung Chen, Dilek Hakkani-Tür, Gökhan Tür, Jianfeng Gao, Li Deng, 2016 https://scholar.google.com/scholar?q=Predicting+User+Behavior+in+Multi-Turn+Dialogue+Systems 13. Machine Learning for Storage Systems: A Comprehensive Survey — Jianliang Zhang, Zeke Wang, Tong Zhang, 2023 https://scholar.google.com/scholar?q=Machine+Learning+for+Storage+Systems:+A+Comprehensive+Survey 14. PagedAttention: Efficient Memory Management for LLM Serving — Kwon et al. (vLLM), 2023 https://scholar.google.com/scholar?q=PagedAttention:+Efficient+Memory+Management+for+LLM+Serving 15. Adaptive Replacement Cache (ARC) — Megiddo and Modha, 2003 https://scholar.google.com/scholar?q=Adaptive+Replacement+Cache+(ARC) 16. Learned Cache Replacement Policies — Vietri et al., 2020 https://scholar.google.com/scholar?q=Learned+Cache+Replacement+Policies 17. LoRA: Low-Rank Adaptation of Large Language Models — Hu et al., 2021 https://scholar.google.com/scholar?q=LoRA:+Low-Rank+Adaptation+of+Large+Language+Models 18. No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization — MiKV authors, 2024-2025 https://scholar.google.com/scholar?q=No+Token+Left+Behind:+Reliable+KV+Cache+Compression+via+Importance-Aware+Mixed+Precision+Quantization 19. CommVQ: Commutative Vector Quantization for KV Cache Compression — CommVQ authors, 2024-2025 https://scholar.google.com/scholar?q=CommVQ:+Commutative+Vector+Quantization+for+KV+Cache+Compression 20. KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse — KVLink authors, 2024-2025 https://scholar.google.com/scholar?q=KVLink:+Accelerating+Large+Language+Models+via+Efficient+KV+Cache+Reuse 21. Compute or Load KV Cache? Why Not Both? — Unknown, 2024-2025 https://scholar.google.com/scholar?q=Compute+or+Load+KV+Cache?+Why+Not+Both? 22. MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention — MInference authors, 2024-2025 https://scholar.google.com/scholar?q=MInference+1.0:+Accelerating+Pre-filling+for+Long-Context+LLMs+via+Dynamic+Sparse+Attention 23. KVCache Cache in the Wild: Characterizing and Optimizing KVCache at a Large Cloud Provider — Cloud provider study authors, 2024-2025 https://scholar.google.com/scholar?q=KVCache+Cache+in+the+Wild:+Characterizing+and+Optimizing+KVCache+at+a+Large+Cloud+Provider 24. Efficient KV Cache Reuse in Dynamic Agent Workflows https://podcast.do-not-panic.com/episodes/2026-03-16-efficient-kv-cache-reuse-in-dynamic-agen-558f19.mp3 25. 50x KV Cache Compression in Seconds via Attention Matching https://podcast.do-not-panic.com/episodes/2026-03-09-50x-kv-cache-compression-in-seconds-via-9402c1.mp3 26. Statistical Routing Theory in CARTRIDGE Block Attention https://podcast.do-not-panic.com/episodes/2026-03-16-statistical-routing-theory-in-cartridge-2083f4.mp3 27. xLLM: Co-Locating Online and Offline LLM Inference https://podcast.do-not-panic.com/episodes/2026-03-16-xllm-co-locating-online-and-offline-llm-10bb81.mp3 Interactive Visualization: Bidaw: Computation-Storage Aware KV Caching for LLMs