This episode explores SGLang, a system for making complex language model workflows run faster by treating them as full programs rather than single prompt-response calls. It explains how modern LLM applications involve branching, tool use, retries, and structured outputs, then examines SGLang’s co-design of a Python-embedded language with a specialized runtime that can optimize those patterns directly. The discussion highlights ideas like KV-cache reuse through RadixAttention, grammar-constrained decoding for reliable JSON output, and why these systems techniques matter more than just nicer prompt scripting. Listeners would find it interesting because it connects practical agent-style LLM engineering to deeper questions about compilers, serving infrastructure, and whether headline speedups really hold across real workloads. Sources: 1. SGLang: Efficient Execution of Structured Language Model Programs — Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, Ying Sheng, 2023 http://arxiv.org/abs/2312.07104 2. Grammar-Constrained Decoding for Structured NLP Tasks without Finetuning — Saibo Geng, Martin Josifoski, Maxime Peyrard, Robert West, 2023 https://scholar.google.com/scholar?q=Grammar-Constrained+Decoding+for+Structured+NLP+Tasks+without+Finetuning 3. SGLang: Efficient Execution of Structured Language Model Programs — Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, Ying Sheng, 2024 https://scholar.google.com/scholar?q=SGLang:+Efficient+Execution+of+Structured+Language+Model+Programs 4. XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models — Yixin Dong, Charlie F. Ruan, Yaxing Cai, Ruihang Lai, Ziyi Xu, Yilong Zhao, Tianqi Chen, 2024 https://scholar.google.com/scholar?q=XGrammar:+Flexible+and+Efficient+Structured+Generation+Engine+for+Large+Language+Models 5. Generating Structured Outputs from Language Models: Benchmark and Studies — Saibo Geng, Hudson Cooper, Michał Moskal, Samuel Jenkins, Julian Berman, Nathan Ranchin, Robert West, Eric Horvitz, Harsha Nori, 2025 https://scholar.google.com/scholar?q=Generating+Structured+Outputs+from+Language+Models:+Benchmark+and+Studies 6. Language Model Cascades — David Dohan, Winnie Xu, Aitor Lewkowycz, Jacob Austin, David Bieber, Raphael Gontijo Lopes, Yuhuai Wu, Henryk Michalewski, Rif A. Saurous, Jascha Sohl-Dickstein, Kevin Murphy, Charles Sutton, 2022 https://scholar.google.com/scholar?q=Language+Model+Cascades 7. Prompting Is Programming: A Query Language for Large Language Models — Luca Beurer-Kellner, Marc Fischer, Martin Vechev, 2023 https://scholar.google.com/scholar?q=Prompting+Is+Programming:+A+Query+Language+for+Large+Language+Models 8. DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines — Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, Christopher Potts, 2023 https://scholar.google.com/scholar?q=DSPy:+Compiling+Declarative+Language+Model+Calls+into+Self-Improving+Pipelines 9. vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention — Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, et al., 2023 https://scholar.google.com/scholar?q=vLLM:+Easy,+Fast,+and+Cheap+LLM+Serving+with+PagedAttention 10. Guidance: A Guidance Language for Controlling Large Language Models — Microsoft Research and collaborators, 2023 https://scholar.google.com/scholar?q=Guidance:+A+Guidance+Language+for+Controlling+Large+Language+Models 11. LMQL: A Programming Language for Large Language Models — Luca Beurer-Kellner, Marc Fischer, Martin Vechev, 2023 https://scholar.google.com/scholar?q=LMQL:+A+Programming+Language+for+Large+Language+Models 12. Outlines — Thibault Glaunec and contributors, 2023 https://scholar.google.com/scholar?q=Outlines 13. KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse — Jingbo Yang, Bairu Hou, Wei Wei, Yujia Bao, Shiyu Chang, 2025 https://scholar.google.com/scholar?q=KVLink:+Accelerating+Large+Language+Models+via+Efficient+KV+Cache+Reuse 14. Key, Value, Compress: A Systematic Exploration of KV Cache Compression Techniques — Neusha Javidnia, Bita Rouhani, Farinaz Koushanfar, 2025 https://scholar.google.com/scholar?q=Key,+Value,+Compress:+A+Systematic+Exploration+of+KV+Cache+Compression+Techniques 15. KV-CAR: KV Cache Compression using Autoencoders and KV Reuse in Large Language Models — Sourjya Roy, Shrihari Sridharan, Surya Selvam, Anand Raghunathan, 2025 https://scholar.google.com/scholar?q=KV-CAR:+KV+Cache+Compression+using+Autoencoders+and+KV+Reuse+in+Large+Language+Models 16. Grammar-Aligned Decoding — Kanghee Park, Jiayu Wang, Taylor Berg-Kirkpatrick, Nadia Polikarpova, Loris D'Antoni, 2024 https://scholar.google.com/scholar?q=Grammar-Aligned+Decoding 17. Grammar-Constrained Decoding Makes Large Language Models Better Logical Parsers — Federico Raspanti, Tanir Ozcelebi, Mike J. Holenderski, 2025 https://scholar.google.com/scholar?q=Grammar-Constrained+Decoding+Makes+Large+Language+Models+Better+Logical+Parsers 18. Marconi: Prefix Caching for the Era of Hybrid LLMs — Rui Pan, Zhuang Wang, Zhen Jia, Can Karakus, Luca Zancato, Tri Dao, Yida Wang, Ravi Netravali, 2025 https://scholar.google.com/scholar?q=Marconi:+Prefix+Caching+for+the+Era+of+Hybrid+LLMs 19. Towards Efficient Agents: A Co-Design of Inference Architecture and System — Weizhe Lin, Hui-Ling Zhen, Shuai Yang, Xian Wang, Renxi Liu, Hanting Chen, Wangze Zhang, Chuansai Zhou, Yiming Li, Chen Chen, Xing Li, Zhiyuan Yang, Xiaosong Li, Xianzhi Yu, Zhenhua Dong, Mingxuan Yuan, Yunhe Wang, 2025 https://scholar.google.com/scholar?q=Towards+Efficient+Agents:+A+Co-Design+of+Inference+Architecture+and+System 20. Optimizing Agentic Language Model Inference via Speculative Tool Calls — Daniel Nichols, Prajwal Singhania, Charles Jekel, Abhinav Bhatele, Harshitha Menon, 2025 https://scholar.google.com/scholar?q=Optimizing+Agentic+Language+Model+Inference+via+Speculative+Tool+Calls 21. AI Post Transformers: SGLang: Efficient Language Model Program Execution — Hal Turing & Dr. Ada Shannon, Sun, https://podcast.do-not-panic.com/episodes/sglang-efficient-language-model-program-execution/ 22. AI Post Transformers: Breaking the Prefix Barrier with Shared KV Cache — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-24-breaking-the-prefix-barrier-with-shared-a5e5a6.mp3 23. AI Post Transformers: Continuous Batching for LLM Inference: Throughput and Latency Gains — Hal Turing & Dr. Ada Shannon, Mon, https://podcast.do-not-panic.com/episodes/continuous-batching-for-llm-inference-throughput-and-latency-gains/ 24. AI Post Transformers: Speculative Decoding in Real vLLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-speculative-decoding-in-real-vllm-servin-6f4e2b.mp3 25. AI Post Transformers: KV Cache TTL for Multi-Turn Agent Scheduling — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-09-kv-cache-ttl-for-multi-turn-agent-schedu-996bf1.mp3 26. AI Post Transformers: TokenDance for Multi-Agent KV Cache Sharing — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-22-tokendance-for-multi-agent-kv-cache-shar-aa9b99.mp3 Interactive Visualization: SGLang for Faster Structured LLM Programs