This episode explores a systems paper on speeding up Transformer decoding by tightly fusing the SwiGLU MLP path, rather than focusing only on attention or long-context tricks. It explains why long output generation becomes memory-bandwidth bound, clarifying concepts like kernel fusion, HBM traffic, prefill versus autoregressive decode, and why repeated token-by-token inference exposes the MLP as a real bottleneck. The discussion walks through the paper’s main design choice: a disciplined fusion of the up-projection, gate projection, SiLU activation, and elementwise multiply into a single decode-stage kernel, while leaving the down projection separate to avoid worse scheduling and register-pressure tradeoffs. It also highlights the paper’s practical argument for profiler-driven runtime scheduling across row-major and column-major kernel variants, making the result interesting to listeners who care about how large-model serving performance is won through careful hardware-aware engineering rather than headline-grabbing algorithm changes. Sources: 1. Deep Kernel Fusion for Transformer Decoding https://arxiv.org/pdf/2602.11808 2. GLU Variants Improve Transformer — Noam Shazeer, 2020 https://scholar.google.com/scholar?q=GLU+Variants+Improve+Transformer 3. PaLM: Scaling Language Modeling with Pathways — Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Noam Shazeer and many others, 2022 https://scholar.google.com/scholar?q=PaLM:+Scaling+Language+Modeling+with+Pathways 4. DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale — Reza Yazdani Aminabadi, Samyam Rajbhandari, Minjia Zhang, Ammar Ahmad Awan, Cheng Li and others, 2022 https://scholar.google.com/scholar?q=DeepSpeed+Inference:+Enabling+Efficient+Inference+of+Transformer+Models+at+Unprecedented+Scale 5. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning — Tri Dao, 2023 https://scholar.google.com/scholar?q=FlashAttention-2:+Faster+Attention+with+Better+Parallelism+and+Work+Partitioning 6. SGLang: Efficient Execution of Structured Language Model Programs — Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng, 2024 https://scholar.google.com/scholar?q=SGLang:+Efficient+Execution+of+Structured+Language+Model+Programs 7. Efficient Memory Management for Large Language Model Serving with PagedAttention — Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica, 2023 https://scholar.google.com/scholar?q=Efficient+Memory+Management+for+Large+Language+Model+Serving+with+PagedAttention 8. Welder: Scheduling Deep Learning Memory Access via Tile-Graph — Yining Shi, Zhi Yang, Jilong Xue, Lingxiao Ma, Yuqing Xia, Ziming Miao, Yuxiao Guo, Fan Yang, and Lidong Zhou, 2023 https://scholar.google.com/scholar?q=Welder:+Scheduling+Deep+Learning+Memory+Access+via+Tile-Graph 9. FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving — Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, and Luis Ceze, 2025 https://scholar.google.com/scholar?q=FlashInfer:+Efficient+and+Customizable+Attention+Engine+for+LLM+Inference+Serving 10. Masked Gated Linear Unit — unknown from snippet, likely 2024 or 2025 https://scholar.google.com/scholar?q=Masked+Gated+Linear+Unit 11. SCBench: A KV Cache-Centric Analysis of Long-Context Methods — unknown from snippet, likely 2025 https://scholar.google.com/scholar?q=SCBench:+A+KV+Cache-Centric+Analysis+of+Long-Context+Methods 12. Model Tells You Where to Merge: Adaptive KV Cache Merging for LLMs on Long-Context Tasks — unknown from snippet, likely 2025 https://scholar.google.com/scholar?q=Model+Tells+You+Where+to+Merge:+Adaptive+KV+Cache+Merging+for+LLMs+on+Long-Context+Tasks 13. MEDA: Dynamic KV Cache Allocation for Efficient Multimodal Long-Context Inference — unknown from snippet, likely 2025 https://scholar.google.com/scholar?q=MEDA:+Dynamic+KV+Cache+Allocation+for+Efficient+Multimodal+Long-Context+Inference 14. Efficient LLM Inference Using Dynamic Input Pruning and Cache-Aware Masking — unknown from snippet, likely 2025 https://scholar.google.com/scholar?q=Efficient+LLM+Inference+Using+Dynamic+Input+Pruning+and+Cache-Aware+Masking 15. Enhancing Transformer Performance and Portability Through Auto-Tuning Frameworks — P. Siwinska et al. (approx.), unknown, likely recent https://scholar.google.com/scholar?q=Enhancing+Transformer+Performance+and+Portability+Through+Auto-Tuning+Frameworks 16. AI Post Transformers: Splitwise: Phase-Split LLM Inference — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-26-splitwise-phase-split-llm-inference-e8945b.mp3 17. AI Post Transformers: Why LLM Serving Needs Mathematical Optimization — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-05-why-llm-serving-needs-mathematical-optim-647fc6.mp3 18. AI Post Transformers: PackKV Lossy Compression for KV Caches — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-04-packkv-lossy-compression-for-kv-caches-b37bce.mp3 19. AI Post Transformers: SGLang for Faster Structured LLM Programs — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-06-sglang-for-faster-structured-llm-program-c59f1c.mp3 20. AI Post Transformers: Speculative Decoding in Real vLLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-speculative-decoding-in-real-vllm-servin-6f4e2b.mp3 21. AI Post Transformers: LAPS for Length-Aware LLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-05-laps-for-length-aware-llm-serving-0c6149.mp3 22. AI Post Transformers: CacheFlow and 3D-Parallel KV Cache Restoration — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-01-cacheflow-and-3d-parallel-kv-cache-resto-8db883.mp3 23. AI Post Transformers: FengHuang for Rack-Scale LLM Inference Memory — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-12-fenghuang-for-rack-scale-llm-inference-m-62708e.mp3 Interactive Visualization: Deep Kernel Fusion for Transformer Decoding
Informazioni
- Podcast
- FrequenzaOgni giorno
- Uscita15 maggio 2026 alle ore 00:00 UTC
- ClassificazioneContenuti adatti a tutti
