This episode explores a 2026 paper on cache-resident LLM inference, asking whether modern CPUs with gigabyte-scale last-level caches can cut decoding latency by keeping model weights on-chip instead of repeatedly fetching them from DRAM. It explains why autoregressive decoding is often memory-bound rather than compute-bound, then breaks down the paper’s main design ideas: separating weight-heavy projections and feed-forward work from attention and KV-cache handling, and using fine-grained static scheduling to reduce synchronization overhead. The discussion gets concrete about the system architecture on AMD EPYC 9684X machines, including dual-socket role separation, INT8 weights and KV caches, and locality-aware placement of weight shards and activations. A listener would find it interesting because it gives a sharp, skeptical look at where CPU-based LLM serving might genuinely improve throughput and time-per-output-token, while also arguing that this is a targeted systems win rather than a replacement for GPU-first inference. Sources: 1. Cache-Resident LLM Inference in GB-Scale Last-Level Caches — Wanning Zhang, Tongzhou Gu, Marco Canini, Ceyu Xu, Jian Weng, 2026 http://arxiv.org/abs/2606.25353 2. LLM Inference Serving: Survey of Recent Advances and Opportunities — Baolin Li, Yankai Jiang, Vijay Gadepally, Devesh Tiwari, 2024 https://scholar.google.com/scholar?q=LLM+Inference+Serving:+Survey+of+Recent+Advances+and+Opportunities 3. Efficient Memory Management for Large Language Model Serving with PagedAttention — Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Hao Zhang, Ion Stoica, et al., 2023 https://scholar.google.com/scholar?q=Efficient+Memory+Management+for+Large+Language+Model+Serving+with+PagedAttention 4. Splitwise: Efficient generative LLM inference using phase splitting — Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Inigo Goiri, Saeed Maleki, Ricardo Bianchini, 2023 https://scholar.google.com/scholar?q=Splitwise:+Efficient+generative+LLM+inference+using+phase+splitting 5. FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU — Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Joseph E. Gonzalez, Percy Liang, Christopher Re, Ion Stoica, Ce Zhang, et al., 2023 https://scholar.google.com/scholar?q=FlexGen:+High-Throughput+Generative+Inference+of+Large+Language+Models+with+a+Single+GPU 6. Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks — Charles Eckert, Xiaowei Wang, Jingcheng Wang, Arun Subramaniyan, Ravi Iyer, Dennis Sylvester, David Blaauw, Reetuparna Das, 2018 https://scholar.google.com/scholar?q=Neural+Cache:+Bit-Serial+In-Cache+Acceleration+of+Deep+Neural+Networks 7. Proximu$: Efficiently Scaling DNN Inference in Multi-core CPUs through Near-Cache Compute — Anant V. Nori, Rahul Bera, Shankar Balachandran, Joydeep Rakshit, Om J. Omer, Avishaii Abuhatzera, Belliappa Kuttanna, Sreenivas Subramoney, 2020 https://scholar.google.com/scholar?q=Proximu$:+Efficiently+Scaling+DNN+Inference+in+Multi-core+CPUs+through+Near-Cache+Compute 8. Inference Performance Optimization for Large Language Models on CPUs — Pujiang He, Shan Zhou, Wenhuan Huang, Changqing Li, Duyi Wang, Bin Guo, Chen Meng, Sheng Gui, Weifei Yu, Yi Xie, 2024 https://scholar.google.com/scholar?q=Inference+Performance+Optimization+for+Large+Language+Models+on+CPUs 9. ArcLight: A Lightweight LLM Inference Architecture for Many-Core CPUs — Yuzhuang Xu, Xu Han, Yuxuan Li, Wanxiang Che, 2026 https://scholar.google.com/scholar?q=ArcLight:+A+Lightweight+LLM+Inference+Architecture+for+Many-Core+CPUs 10. vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention — Ramya Prabhu, Ajay Nayak, Jayashree Mohan, Ramachandran Ramjee, Ashish Panwar, 2025 https://scholar.google.com/scholar?q=vAttention:+Dynamic+Memory+Management+for+Serving+LLMs+without+PagedAttention 11. Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve — Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, Ramachandran Ramjee, 2024 https://scholar.google.com/scholar?q=Taming+Throughput-Latency+Tradeoff+in+LLM+Inference+with+Sarathi-Serve 12. DistServe: Disaggregating Prefill and Decoding for Goodput-Optimized Large Language Model Serving — Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, Hao Zhang, 2024 https://scholar.google.com/scholar?q=DistServe:+Disaggregating+Prefill+and+Decoding+for+Goodput-Optimized+Large+Language+Model+Serving 13. WaferLLM: Large Language Model Inference at Wafer Scale — Congjie He, Yeqi Huang, Pei Mu, Ziming Miao, Jilong Xue, Lingxiao Ma, Fan Yang, Luo Mai, 2025 https://scholar.google.com/scholar?q=WaferLLM:+Large+Language+Model+Inference+at+Wafer+Scale 14. T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge — Jianyu Wei, Shijie Cao, Ting Cao, Lingxiao Ma, Lei Wang, Yanyong Zhang, Mao Yang, 2025 https://scholar.google.com/scholar?q=T-MAC:+CPU+Renaissance+via+Table+Lookup+for+Low-Bit+LLM+Deployment+on+Edge 15. Compute Or Load KV Cache? Why Not Both? — Shuowei Jin et al., 2024 https://arxiv.org/abs/2410.03065 16. KVFlow: Efficient Prefix Caching for Accelerating LLM-Based Multi-Agent Workflows — Zaifeng Pan et al., 2025 https://arxiv.org/abs/2507.07400 17. Reducing Transformer Key-Value Cache Size with Cross-Layer Attention — William Brandon et al., 2024 https://arxiv.org/abs/2405.12981 18. QCQA: Quality and Capacity-aware Grouped Query Attention — Vinay Joshi et al., 2024 https://arxiv.org/abs/2406.10247 19. Beyond KV Caching: Shared Attention for Efficient LLMs — Bingli Liao and Danilo Vasconcellos Vargas, 2024 https://arxiv.org/abs/2407.12866 20. ClusterFusion: Expanding Operator Fusion Scope for LLM Inference via Cluster-Level Collective Primitive — Xinhao Luo et al., 2025 https://arxiv.org/abs/2508.18850 21. AI Post Transformers: Memory-Bound, Not Bandwidth-Limited Batch-1 LLM Decode — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-06-02-memory-bound-not-bandwidth-limited-batch-114799.mp3 22. AI Post Transformers: Deep Kernel Fusion for Transformer Decoding — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-15-deep-kernel-fusion-for-transformer-decod-b1a703.mp3 23. AI Post Transformers: ScoutAttention for Efficient KV Cache Offloading — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-24-scoutattention-for-efficient-kv-cache-of-b26699.mp3 24. AI Post Transformers: CacheFlow and 3D-Parallel KV Cache Restoration — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-01-cacheflow-and-3d-parallel-kv-cache-resto-8db883.mp3 25. AI Post Transformers: VeriCache: Lossless LLM Inference from Lossy KV Caches — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-06-02-vericache-lossless-llm-inference-from-lo-df9daf.mp3 26. AI Post Transformers: Harvest: Borrowing Peer GPU Memory for LLMs — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-06-02-harvest-borrowing-peer-gpu-memory-for-ll-e9e54f.mp3