AI Post Transformers

Why LLM Serving Needs Mathematical Optimization

This episode explores a position paper arguing that modern LLM serving has outgrown simple heuristics like FIFO, shortest-queue routing, and LRU eviction. It explains why transformer inference creates harder control problems than standard inference, focusing on continuous batching, KV-cache growth, and the tension between compute-heavy prefill and memory-bound decode phases. The discussion highlights the paper’s central claim that serving systems need explicit objective-driven optimization for routing, admission control, scheduling, and cache management, while also questioning where formal methods would truly outperform today’s stronger heuristic baselines such as vLLM and PagedAttention-inspired designs. Listeners would find it interesting because it connects low-level serving mechanics to real product tradeoffs like latency, throughput, and cache churn, showing why infrastructure choices increasingly shape LLM performance. Sources: 1. Position: LLM Serving Needs Mathematical Optimization and Algorithmic Foundations, Not Just Heuristics — Zijie Zhou, 2026 http://arxiv.org/abs/2605.01280 2. PREBLE: Efficient Distributed Prompt Scheduling for LLM Serving — Vikranth Srivatsa, Zijian He, Reyna Abhyankar, Dongming Li, Yiying Zhang, 2024 https://scholar.google.com/scholar?q=PREBLE:+Efficient+Distributed+Prompt+Scheduling+for+LLM+Serving 3. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving — Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, Hao Zhang, 2024 https://scholar.google.com/scholar?q=DistServe:+Disaggregating+Prefill+and+Decoding+for+Goodput-optimized+Large+Language+Model+Serving 4. Semantic Caching for Low-Cost LLM Serving: From Offline Learning to Online Adaptation — Xutong Liu, Baran Atalar, Xiangxiang Dai, Jinhang Zuo, Siwei Wang, John C. S. Lui, Wei Chen, Carlee Joe-Wong, 2026 https://scholar.google.com/scholar?q=Semantic+Caching+for+Low-Cost+LLM+Serving:+From+Offline+Learning+to+Online+Adaptation 5. POLAR: Online Learning for LoRA Adapter Caching and Routing in Edge LLM Serving — Shaoang Li, Jian Li, 2026 https://scholar.google.com/scholar?q=POLAR:+Online+Learning+for+LoRA+Adapter+Caching+and+Routing+in+Edge+LLM+Serving 6. Orca: A Distributed Serving System for Transformer-Based Generative Models — Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, Byung-Gon Chun, 2022 https://scholar.google.com/scholar?q=Orca:+A+Distributed+Serving+System+for+Transformer-Based+Generative+Models 7. Efficient Memory Management for Large Language Model Serving with PagedAttention — Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica, 2023 https://scholar.google.com/scholar?q=Efficient+Memory+Management+for+Large+Language+Model+Serving+with+PagedAttention 8. Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve — Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, Ramachandran Ramjee, 2024 https://scholar.google.com/scholar?q=Taming+Throughput-Latency+Tradeoff+in+LLM+Inference+with+Sarathi-Serve 9. SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills — Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Ramachandran Ramjee, 2023 https://scholar.google.com/scholar?q=SARATHI:+Efficient+LLM+Inference+by+Piggybacking+Decodes+with+Chunked+Prefills 10. Faster LLM Inference using DBMS-Inspired Preemption and Cache Replacement Policies — Kyoungmin Kim, Jiacheng Li, Kijae Hong, Anastasia Ailamaki, 2024 https://scholar.google.com/scholar?q=Faster+LLM+Inference+using+DBMS-Inspired+Preemption+and+Cache+Replacement+Policies 11. DualMap: Enabling Both Cache Affinity and Load Balancing for Distributed LLM Serving — Ying Yuan et al., 2026 https://scholar.google.com/scholar?q=DualMap:+Enabling+Both+Cache+Affinity+and+Load+Balancing+for+Distributed+LLM+Serving 12. Slice-Level Scheduling for High Throughput and Load Balanced LLM Serving — Ke Cheng et al., 2024 https://scholar.google.com/scholar?q=Slice-Level+Scheduling+for+High+Throughput+and+Load+Balanced+LLM+Serving 13. A Predictive and Synergistic Two-Layer Scheduling Framework for LLM Serving — Yue Zhang et al., 2025 https://scholar.google.com/scholar?q=A+Predictive+and+Synergistic+Two-Layer+Scheduling+Framework+for+LLM+Serving 14. KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse — Jingbo Yang et al., 2025 https://scholar.google.com/scholar?q=KVLink:+Accelerating+Large+Language+Models+via+Efficient+KV+Cache+Reuse 15. HyperRAG: Enhancing Quality-Efficiency Tradeoffs in Retrieval-Augmented Generation with Reranker KV-Cache Reuse — Yuwei An et al., 2025 https://scholar.google.com/scholar?q=HyperRAG:+Enhancing+Quality-Efficiency+Tradeoffs+in+Retrieval-Augmented+Generation+with+Reranker+KV-Cache+Reuse 16. ProphetKV: User-Query-Driven Selective Recomputation for Efficient KV Cache Reuse in Retrieval-Augmented Generation — Shihao Wang et al., 2026 https://scholar.google.com/scholar?q=ProphetKV:+User-Query-Driven+Selective+Recomputation+for+Efficient+KV+Cache+Reuse+in+Retrieval-Augmented+Generation 17. dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving — Bingyang Wu et al., 2024 https://scholar.google.com/scholar?q=dLoRA:+Dynamically+Orchestrating+Requests+and+Adapters+for+LoRA+LLM+Serving 18. SPAD: Specialized Prefill and Decode Hardware for Disaggregated LLM Inference — Hengrui Zhang et al., 2025 https://scholar.google.com/scholar?q=SPAD:+Specialized+Prefill+and+Decode+Hardware+for+Disaggregated+LLM+Inference 19. AI Post Transformers: Speculative Decoding in Real vLLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-speculative-decoding-in-real-vllm-servin-6f4e2b.mp3 20. AI Post Transformers: Lookahead Q-Cache for Consistent KV Eviction — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-25-lookahead-q-cache-for-consistent-kv-evic-d97b09.mp3 21. AI Post Transformers: LookaheadKV: Fast and Accurate KV Cache Eviction by Glimpsing into the Future without Generation — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-25-lookaheadkv-fast-and-accurate-kv-9cfc9f.mp3 22. AI Post Transformers: KV Cache TTL for Multi-Turn Agent Scheduling — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-09-kv-cache-ttl-for-multi-turn-agent-schedu-996bf1.mp3 23. AI Post Transformers: CacheFlow and 3D-Parallel KV Cache Restoration — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-01-cacheflow-and-3d-parallel-kv-cache-resto-8db883.mp3 24. AI Post Transformers: ContiguousKV for Faster LLM Prefill KV Reuse — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-20-contiguouskv-for-faster-llm-prefill-kv-r-59f545.mp3 25. AI Post Transformers: Breaking the Prefix Barrier with Shared KV Cache — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-24-breaking-the-prefix-barrier-with-shared-a5e5a6.mp3 26. AI Post Transformers: TokenDance for Multi-Agent KV Cache Sharing — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-22-tokendance-for-multi-agent-kv-cache-shar-aa9b99.mp3 27. AI Post Transformers: RetrievalAttention for Long-Context LLM Inference — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-17-retrievalattention-for-long-context-llm-ddf566.mp3 28. AI Post Transformers: KVSwap for Disk-Aware Long-Context On-Device Inference — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-16-kvswap-for-disk-aware-long-context-on-de-f3c15e.mp3 Interactive Visualization: Why LLM Serving Needs Mathematical Optimization