AI Post Transformers

Splitwise: Phase-Split LLM Inference

This episode examines Splitwise: Efficient Generative LLM Inference Using Phase Splitting, a 2024 systems paper from researchers at the University of Washington and Microsoft, and centers the discussion on a simple claim with large deployment consequences: prompt prefill and token decode are different enough that they should not necessarily run on the same hardware. The hosts walk through the basic mechanics of generative inference, explaining prefill as the parallel, compute-heavy stage that processes the prompt, and decode as the sequential, KV-cache-driven stage that generates tokens one by one. That distinction sets up the paper’s core argument that modern serving stacks are paying a penalty by treating inference as a uniform workload when its phases are constrained by very different resources. The conversation stays focused on why that split matters in practice. It unpacks phase heterogeneity in terms of throughput, latency, utilization, memory pressure, and power draw, and explains why decode can remain bottlenecked by memory bandwidth and capacity even on newer accelerators with far more raw FLOPs. From there, the episode explores Splitwise’s broader systems framing: if compute is scaling faster than memory, then assigning prefill to high-throughput hardware and decode to cheaper or lower-power machines may be a more realistic datacenter strategy than continuing to push everything through one homogeneous GPU fleet. The hosts also emphasize power-normalized evaluation as a more honest lens for operators than simple box-for-box performance comparisons. Along the way, the episode places Splitwise in public context alongside ORCA, PagedAttention, and SARATHI without losing its anchor. Those earlier systems are used to clarify what Splitwise does and does not claim: continuous batching, KV-cache-aware memory management, and batch reshaping all improve serving efficiency, but they do not eliminate the underlying asymmetry between prefill and decode. The result is a grounded discussion of phase splitting as a deployment decision rather than a purely algorithmic trick, with particular attention to where prefill-decode disaggregation looks compelling, where it depends on the realities of cluster design, and where the limits of PD disaggregation still leave open systems questions. Sources: 1. Splitwise: Efficient generative LLM inference using phase splitting — Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, Ricardo Bianchini, 2023 http://arxiv.org/abs/2311.18677 2. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving — Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, Hao Zhang, 2024 http://arxiv.org/abs/2401.09670 3. DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference — Yongtong Wu, Shaoyuan Chen, Yinmin Zhong, Rilin Huang, Yixuan Tan, Wentao Zhang, Liyue Zhang, Shangyan Zhou, Yuxuan Liu, Shunfeng Zhou, Mingxing Zhang, Xin Jin, Panpan Huang, 2026 http://arxiv.org/abs/2602.21548 4. Not All Prefills Are Equal: PPD Disaggregation for Multi-turn LLM Serving — Zongze Li, Jingyu Liu, Zach Xu, Yineng Zhang, Tahseen Rabbani, Ce Zhang, 2026 http://arxiv.org/abs/2603.13358 5. Orca: A Distributed Serving System for Transformer-Based Generative Models — Gyeong-In Yu, Joo Seong Jeong, Gun-Woo Kim, Seungtae Kim, Byung-Gon Chun, 2022 https://scholar.google.com/scholar?q=Orca:+A+Distributed+Serving+System+for+Transformer-Based+Generative+Models 6. Efficient Memory Management for Large Language Model Serving with PagedAttention — Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica, 2023 https://scholar.google.com/scholar?q=Efficient+Memory+Management+for+Large+Language+Model+Serving+with+PagedAttention 7. SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills — Animesh Agrawal, Aakash Panwar, Jaya Mohan, Nakul Kwatra, Bhaskar S. Gulavani, Ramachandran Ramjee, 2023 https://scholar.google.com/scholar?q=SARATHI:+Efficient+LLM+Inference+by+Piggybacking+Decodes+with+Chunked+Prefills 8. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving — Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, Hao Zhang, 2024 https://scholar.google.com/scholar?q=DistServe:+Disaggregating+Prefill+and+Decoding+for+Goodput-optimized+Large+Language+Model+Serving 9. Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving — Ruoyu Qin, Zheming Li, Weiran He, Mingxing Zhang, Yongwei Wu, Weimin Zheng, Xinran Xu, 2024 https://scholar.google.com/scholar?q=Mooncake:+A+KVCache-centric+Disaggregated+Architecture+for+LLM+Serving 10. KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse — Jingbo Yang, Bairu Hou, Wei Wei, Yujia Bao, Shiyu Chang, 2025 https://scholar.google.com/scholar?q=KVLink:+Accelerating+Large+Language+Models+via+Efficient+KV+Cache+Reuse 11. HyperRAG: Enhancing Quality-Efficiency Tradeoffs in Retrieval-Augmented Generation with Reranker KV-Cache Reuse — Yuwei An, Yihua Cheng, Seo Jin Park, Junchen Jiang, 2025 https://scholar.google.com/scholar?q=HyperRAG:+Enhancing+Quality-Efficiency+Tradeoffs+in+Retrieval-Augmented+Generation+with+Reranker+KV-Cache+Reuse 12. ProphetKV: User-Query-Driven Selective Recomputation for Efficient KV Cache Reuse in Retrieval-Augmented Generation — Shihao Wang, Jiahao Chen, Yanqi Pan, Hao Huang and colleagues, 2026 https://scholar.google.com/scholar?q=ProphetKV:+User-Query-Driven+Selective+Recomputation+for+Efficient+KV+Cache+Reuse+in+Retrieval-Augmented+Generation 13. Accelerating LLM Inference with Staged Speculative Decoding — Benjamin Spector, Chris Re, 2023 https://scholar.google.com/scholar?q=Accelerating+LLM+Inference+with+Staged+Speculative+Decoding 14. SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices — Ruslan Svirschevski, Avner May, Zhuoming Chen, Beidi Chen, Zhihao Jia, Max Ryabinin, 2024 https://scholar.google.com/scholar?q=SpecExec:+Massively+Parallel+Speculative+Decoding+for+Interactive+LLM+Inference+on+Consumer+Devices 15. KVDirect: Distributed Disaggregated LLM Inference — Shiyang Chen, Rain Jiang, Dezhi Yu, Jinlai Xu, Mengyuan Chao and colleagues, 2025 https://scholar.google.com/scholar?q=KVDirect:+Distributed+Disaggregated+LLM+Inference 16. Arrow: Adaptive Scheduling Mechanisms for Disaggregated LLM Inference Architecture — Yu Wu, Tongxuan Liu, Yuting Zeng, Siyu Wu, Jun Xiong, Xianzhe Dong and colleagues, 2025 https://scholar.google.com/scholar?q=Arrow:+Adaptive+Scheduling+Mechanisms+for+Disaggregated+LLM+Inference+Architecture 17. WindServe: Efficient Phase-Disaggregated LLM Serving with Stream-based Dynamic Scheduling — Jingqi Feng, Yukai Huang, Rui Zhang, Sicheng Liang, Ming Yan, Jie Wu, 2025 https://scholar.google.com/scholar?q=WindServe:+Efficient+Phase-Disaggregated+LLM+Serving+with+Stream-based+Dynamic+Scheduling 18. AI Post Transformers: CacheSlide: Position-Aware KV Cache Reuse for Agent LLMs — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-16-cacheslide-position-aware-kv-cache-reuse-cd59c7.mp3 19. AI Post Transformers: SGLang: Efficient Language Model Program Execution — Hal Turing & Dr. Ada Shannon, Sun, https://podcast.do-not-panic.com/episodes/sglang-efficient-language-model-program-execution/ 20. AI Post Transformers: Episode: Speculative Speculative Decoding — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-25-speculative-speculative-decoding-1b7a10.mp3 21. AI Post Transformers: xLLM: Co-Locating Online and Offline LLM Inference — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-16-xllm-co-locating-online-and-offline-llm-10bb81.mp3 22. AI Post Transformers: FAST26: Bidaw: Enhancing Key-Value Caching for Interactive LLM Serving via Bidirectional — Hal Turing & Dr. Ada Shannon, Wed, https://podcast.do-not-panic.com/episodes/fast26-bidaw-enhancing-key-value-caching-for-interactive-llm-serving-via-bidirec/ 23. AI Post Transformers: Lookahead Q-Cache for Consistent KV Eviction — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-25-lookahead-q-cache-for-consistent-kv-evic-d97b09.mp3 Interactive Visualization: Splitwise: Phase-Split LLM Inference