AI Post Transformers

mcgrof

AI-generated podcast where hosts Hal Turing and Dr. Ada Shannon discuss the latest research papers and reports in machine learning, AI systems, and optimization. Featuring honest critical analysis, proper citations, and nerdy humor.

  1. 1 ngày trước

    SuperInfer: SLO-Aware LLM Inference on Superchips

    This episode explores SuperInfer, a system for serving large language models on GH200-style superchips by treating memory management as the key lever for meeting latency targets rather than just maximizing compute use. It explains why KV cache growth, HBM pressure, and head-of-line blocking often hurt responsiveness first, then breaks down how the paper’s RotaSched policy proactively rotates request state out of fast memory to protect time-to-first-token deadlines. It also covers DuplexKV, the transfer mechanism that makes this practical by batching fragmented KV data, using bidirectional movement across NVLink-C2C, and overlapping transfers with model execution instead of stalling the whole system. Listeners would find it interesting because the discussion ties concrete serving pain points to a specific systems design that reportedly boosts TTFT SLO attainment by up to 74.7 percent while keeping throughput and token pacing roughly stable. Sources: 1. SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference on Superchips — Jiahuan Yu, Mingtao Hu, Zichao Lin, Minjia Zhang, 2026 http://arxiv.org/abs/2601.20309 2. Pie: Pooling CPU Memory for LLM Inference — Y. Xu, Z. Mao, X. Mo, S. Liu, I. Stoica, 2024 https://scholar.google.com/scholar?q=Pie:+Pooling+CPU+Memory+for+LLM+Inference 3. Understanding Data Movement in Tightly Coupled Heterogeneous Systems: A Case Study with the Grace Hopper Superchip — L. Fusco, M. Khalilov, M. Chrapek, G. Chukkapalli, T. Schulthess, T. Hoefler, 2024 https://scholar.google.com/scholar?q=Understanding+Data+Movement+in+Tightly+Coupled+Heterogeneous+Systems:+A+Case+Study+with+the+Grace+Hopper+Superchip 4. Memory Offloading for Large Language Model Inference with Latency SLO Guarantees — C. Ma, Z. Ye, H. Zhao, Z. Yang, T. Fu, J. Han, J. Zhang, Y. Luo, X. Wang, Z. Wang, et al., 2025 https://scholar.google.com/scholar?q=Memory+Offloading+for+Large+Language+Model+Inference+with+Latency+SLO+Guarantees 5. Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve — A. Agrawal, N. Kedia, A. Panwar, J. Mohan, N. Kwatra, B. Gulavani, A. Tumanov, R. Ramjee, 2024 https://scholar.google.com/scholar?q=Taming+Throughput-Latency+Tradeoff+in+LLM+Inference+with+Sarathi-Serve 6. Mooncake: Trading More Storage for Less Computation - a KVCache-centric Architecture for Serving LLM Chatbot — R. Qin, Z. Li, W. He, J. Cui, F. Ren, M. Zhang, Y. Wu, W. Zheng, X. Xu, 2025 https://scholar.google.com/scholar?q=Mooncake:+Trading+More+Storage+for+Less+Computation+-+a+KVCache-centric+Architecture+for+Serving+LLM+Chatbot 7. TokenLake: A Unified Segment-level Prefix Cache Pool for Fine-grained Elastic Long-Context LLM Serving — Bingyang Wu et al., 2025 https://scholar.google.com/scholar?q=TokenLake:+A+Unified+Segment-level+Prefix+Cache+Pool+for+Fine-grained+Elastic+Long-Context+LLM+Serving 8. KVFlow: Efficient Prefix Caching for Accelerating LLM-Based Multi-Agent Workflows — Zaifeng Pan et al., 2025 https://scholar.google.com/scholar?q=KVFlow:+Efficient+Prefix+Caching+for+Accelerating+LLM-Based+Multi-Agent+Workflows 9. SAW-INT4: System-Aware 4-Bit KV-Cache Quantization for Real-World LLM Serving — Jinda Jia et al., 2026 https://scholar.google.com/scholar?q=SAW-INT4:+System-Aware+4-Bit+KV-Cache+Quantization+for+Real-World+LLM+Serving 10. KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization — Coleman Hooper et al., 2024 https://scholar.google.com/scholar?q=KVQuant:+Towards+10+Million+Context+Length+LLM+Inference+with+KV+Cache+Quantization 11. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving — Yinmin Zhong et al., 2024 https://scholar.google.com/scholar?q=DistServe:+Disaggregating+Prefill+and+Decoding+for+Goodput-optimized+Large+Language+Model+Serving 12. Prefill-Decode Aggregation or Disaggregation? Unifying Both for Goodput-Optimized LLM Serving — Chao Wang et al., 2025 https://scholar.google.com/scholar?q=Prefill-Decode+Aggregation+or+Disaggregation?+Unifying+Both+for+Goodput-Optimized+LLM+Serving 13. Enhancing LLM Efficiency: Targeted Pruning for Prefill-Decode Disaggregation in Inference — Hao Zhang et al., 2025 https://scholar.google.com/scholar?q=Enhancing+LLM+Efficiency:+Targeted+Pruning+for+Prefill-Decode+Disaggregation+in+Inference 14. AI Post Transformers: Mooncake for KV Cache-Centric LLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-06-05-mooncake-for-kv-cache-centric-llm-servin-1086d0.mp3 15. AI Post Transformers: LAPS for Length-Aware LLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-05-laps-for-length-aware-llm-serving-0c6149.mp3 16. AI Post Transformers: AI+HW 2035: Co-Designing Efficient AI Systems — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-06-24-aihw-2035-co-designing-efficient-ai-syst-95c11e.mp3 17. AI Post Transformers: FengHuang for Rack-Scale LLM Inference Memory — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-12-fenghuang-for-rack-scale-llm-inference-m-62708e.mp3 18. AI Post Transformers: KVSwap for Disk-Aware Long-Context On-Device Inference — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-16-kvswap-for-disk-aware-long-context-on-de-f3c15e.mp3 19. AI Post Transformers: Memory-Bound, Not Bandwidth-Limited Batch-1 LLM Decode — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-06-02-memory-bound-not-bandwidth-limited-batch-114799.mp3

  2. 1 ngày trước

    Temporal-Tiered KV Cache for Long Context

    This episode explores TTKV, a temporal-tiered key-value cache design for long-context LLM inference, where decode speed degrades because growing KV state turns generation into a memory-bandwidth problem rather than a compute problem. It explains how the method keeps recent cache blocks in fast GPU HBM, evicts older blocks to slower host DRAM, and uses asymmetric quantization in the slow tier, preserving keys at higher precision while compressing values more aggressively. The discussion also breaks down the runtime mechanics behind block-wise streaming attention, including query-conditioned block ranking, top-k prefetching, decompression, and overlapping data transfer with attention computation. What makes the episode interesting is that it treats TTKV less as a new model idea and more as a systems design proposal, while critically questioning whether recency is a reliable proxy for importance and whether the paper fully specifies the cost of its block-selection function. Sources: 1. TTKV: Temporal-Tiered KV Cache for Long-Context LLM Inference — Gradwell Dzikanyanga, Weihao Yang, Hao Huang, Donglei Wu, Shihao Wang, Wen Xia, Sanjeeb K C, 2026 http://arxiv.org/abs/2604.19769 2. Efficient Memory Management for Large Language Model Serving with PagedAttention — Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica, 2023 https://scholar.google.com/scholar?q=Efficient+Memory+Management+for+Large+Language+Model+Serving+with+PagedAttention 3. Efficient Streaming Language Models with Attention Sinks — Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, Mike Lewis, 2023 https://scholar.google.com/scholar?q=Efficient+Streaming+Language+Models+with+Attention+Sinks 4. KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache — Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, Xia Hu, 2024 https://scholar.google.com/scholar?q=KIVI:+A+Tuning-Free+Asymmetric+2bit+Quantization+for+KV+Cache 5. ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference — Hanshi Sun, Li-Wen Chang, Wenlei Bao, Size Zheng, Ningxin Zheng, Xin Liu, Harry Dong, Yuejie Chi, Beidi Chen, 2024 https://scholar.google.com/scholar?q=ShadowKV:+KV+Cache+in+Shadows+for+High-Throughput+Long-Context+LLM+Inference 6. FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference — Guangda Liu et al., 2025 https://scholar.google.com/scholar?q=FreeKV:+Boosting+KV+Cache+Retrieval+for+Efficient+LLM+Inference 7. FIER: Fine-Grained and Efficient KV Cache Retrieval for Long-context LLM Inference — Dongwei Wang et al., 2025 https://scholar.google.com/scholar?q=FIER:+Fine-Grained+and+Efficient+KV+Cache+Retrieval+for+Long-context+LLM+Inference 8. CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving — Yuhan Liu et al., 2024 https://scholar.google.com/scholar?q=CacheGen:+KV+Cache+Compression+and+Streaming+for+Fast+Large+Language+Model+Serving 9. Retrieval Head Mechanistically Explains Long-Context Factuality — Wenhao Wu et al., 2024 https://arxiv.org/abs/2404.15574 10. DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads — Guangxuan Xiao et al., 2024 https://arxiv.org/abs/2410.10819 11. Query-Focused Retrieval Heads Improve Long-Context Reasoning and Re-ranking — Wuwei Zhang et al., 2025 https://arxiv.org/abs/2506.09944 12. LKV: End-to-End Learning of Head-wise Budgets and Token Selection for LLM KV Cache Eviction — Enshuai Zhou et al., 2026 https://arxiv.org/abs/2605.06676 13. IndexMem: Learned KV-Cache Eviction with Latent Memory for Long-Context LLM Inference — Xintong Yang et al., 2026 https://arxiv.org/abs/2605.25475 14. KVTuner: Sensitivity-Aware Layer-wise Mixed Precision KV Cache Quantization for Efficient and Nearly Lossless LLM Inference — Xing Li et al., 2025 https://arxiv.org/abs/2502.04420 15. AsymKV: Enabling 1-Bit Quantization of KV Cache with Layer-Wise Asymmetric Quantization Configurations — Qian Tao et al., 2024 https://arxiv.org/abs/2410.13212 16. vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention — Ramya Prabhu et al., 2024 https://arxiv.org/abs/2405.04437 17. AI Post Transformers: ScoutAttention for Efficient KV Cache Offloading — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-24-scoutattention-for-efficient-kv-cache-of-b26699.mp3 18. AI Post Transformers: MiniMax Sparse Attention at Million-Token Scale — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-06-13-minimax-sparse-attention-at-million-toke-300108.mp3 19. AI Post Transformers: IndexMem: Learned KV-Cache Eviction for Long-Context LLMs — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-06-12-indexmem-learned-kv-cache-eviction-for-l-132c2a.mp3 20. AI Post Transformers: Mooncake for KV Cache-Centric LLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-06-05-mooncake-for-kv-cache-centric-llm-servin-1086d0.mp3 21. AI Post Transformers: Memory-Bound, Not Bandwidth-Limited Batch-1 LLM Decode — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-06-02-memory-bound-not-bandwidth-limited-batch-114799.mp3 22. AI Post Transformers: Why LLM Serving Needs Mathematical Optimization — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-05-why-llm-serving-needs-mathematical-optim-647fc6.mp3

  3. 2 ngày trước

    DSpark Improves Speculative Decoding Acceptance Rates

    This episode explores DSpark, a DeepSeek-AI paper on improving speculative decoding by starting from a DFlash-style block-parallel draft model and increasing how often a larger verifier accepts its proposed tokens. It explains the mechanics of speculative decoding in plain language, situates DSpark within earlier blockwise and multi-token prediction work, and notes that the technique is already used in serving stacks such as vLLM, TensorRT-LLM, and SGLang. The discussion focuses on DSpark’s concrete additions: a Markov head that feeds previous-token information into draft logits, a confidence head that estimates whether drafted tokens will survive verification, and a training recipe centered on knowledge distillation. It is interesting because it treats inference speed as an operational systems problem, arguing that higher acceptance matters but only alongside draft latency, verifier cost, batching, and scheduler behavior. Sources: 1. DSpark Improves Speculative Decoding Acceptance Rates https://github.com/deepseek-ai/DeepSpec/blob/main/DSpark_paper.pdf 2. DFlash: Block Diffusion for Flash Speculative Decoding — Jian Chen, Yesheng Liang, Zhijian Liu, 2026 https://scholar.google.com/scholar?q=DFlash:+Block+Diffusion+for+Flash+Speculative+Decoding 3. EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test — Yuhui Li, Fangyun Wei, Chao Zhang, Hongyang Zhang, 2025 https://scholar.google.com/scholar?q=EAGLE-3:+Scaling+up+Inference+Acceleration+of+Large+Language+Models+via+Training-Time+Test 4. Decoding Speculative Decoding — Minghao Yan, Saurabh Agarwal, Shivaram Venkataraman, 2024 https://scholar.google.com/scholar?q=Decoding+Speculative+Decoding 5. Speculative Decoding with a Speculative Vocabulary — Miles Williams, Young D. Kwon, Rui Li, Alexandros Kouris, Stylianos I. Venieris, 2026 https://scholar.google.com/scholar?q=Speculative+Decoding+with+a+Speculative+Vocabulary 6. DFlare: Scaling Up Draft Capacity for Block Diffusion Speculative Decoding — Jiebin Zhang, Zhenghan Yu, Song Liu, Eugene J. Yu, Zheng Li, Dawei Zhu, Jiangshan Duo, Weimin Xiong, Yifan Song, Guanghua Yu, Jianchen Zhu, Sujian Li, 2026 https://scholar.google.com/scholar?q=DFlare:+Scaling+Up+Draft+Capacity+for+Block+Diffusion+Speculative+Decoding 7. AI Post Transformers: Speculative Decoding in Real vLLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-speculative-decoding-in-real-vllm-servin-6f4e2b.mp3 8. AI Post Transformers: Adaptive Control for Batched Speculative Decoding in LLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/adaptive-control-for-batched-speculative-decoding-in-llm-serving/ 9. AI Post Transformers: JETSPEC and Parallel Tree Speculative Decoding — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-06-27-jetspec-and-parallel-tree-speculative-de-3d144c.mp3

  4. 2 ngày trước

    LLMServingSim 2.0 for Disaggregated LLM Serving

    This episode explores LLMServingSim 2.0, a simulator designed to model how large language models behave when they are served on mixed hardware fleets with separated compute, memory, and networking resources rather than a uniform GPU cluster. It explains the practical serving concepts that shape user experience, including prefill versus decode, time to first token, time per output token, prefix caching, KV-cache movement, and why latency problems emerge from interactions among batching, routing, placement, and interconnect contention rather than a single bottleneck. The discussion highlights the paper’s core idea of a Model Serving Group, which combines queueing, scheduling, operation mapping, memory modeling, and power modeling into one runtime-style unit driven by measured hardware profiles instead of purely theoretical kernel estimates. Listeners would find it interesting because it shows how modern AI performance depends not just on better models, but on the messy systems engineering tradeoffs that determine speed, efficiency, and scalability in real deployments. Sources: 1. LLMServingSim 2.0: A Unified Simulator for Heterogeneous and Disaggregated LLM Serving Infrastructure — Jaehong Cho, Hyunmin Choi, Guseul Heo, Jongse Park, 2026 http://arxiv.org/abs/2602.23036 2. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving — Yinmin Zhong et al., 2024 https://scholar.google.com/scholar?q=DistServe:+Disaggregating+Prefill+and+Decoding+for+Goodput-optimized+Large+Language+Model+Serving 3. P/D-Serve: Serving Disaggregated Large Language Model at Scale — Yibo Jin et al., 2024 https://scholar.google.com/scholar?q=P/D-Serve:+Serving+Disaggregated+Large+Language+Model+at+Scale 4. LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference — Yuhan Liu, Yihua Cheng, Jiayi Yao, et al., 2025 https://scholar.google.com/scholar?q=LMCache:+An+Efficient+KV+Cache+Layer+for+Enterprise-Scale+LLM+Inference 5. Mooncake: Trading More Storage for Less Computation - A KVCache-centric Architecture for Serving LLM Chatbot — Ruoyu Qin et al., 2025 https://scholar.google.com/scholar?q=Mooncake:+Trading+More+Storage+for+Less+Computation+-+A+KVCache-centric+Architecture+for+Serving+LLM+Chatbot 6. NeuPIMs: NPU-PIM Heterogeneous Acceleration for Batched LLM Inferencing — Guseul Heo et al., 2024 https://scholar.google.com/scholar?q=NeuPIMs:+NPU-PIM+Heterogeneous+Acceleration+for+Batched+LLM+Inferencing 7. Frontier: Towards Comprehensive and Accurate LLM Inference Simulation — Yicheng Feng et al., 2026 https://scholar.google.com/scholar?q=Frontier:+Towards+Comprehensive+and+Accurate+LLM+Inference+Simulation 8. KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse — Jingbo Yang et al., 2025 https://scholar.google.com/scholar?q=KVLink:+Accelerating+Large+Language+Models+via+Efficient+KV+Cache+Reuse 9. Selective KV-Cache Sharing to Mitigate Timing Side-Channels in LLM Inference — Kexin Chu et al., 2025 https://scholar.google.com/scholar?q=Selective+KV-Cache+Sharing+to+Mitigate+Timing+Side-Channels+in+LLM+Inference 10. Prefill-Decode Aggregation or Disaggregation? Unifying Both for Goodput-Optimized LLM Serving — Chao Wang et al., 2025 https://scholar.google.com/scholar?q=Prefill-Decode+Aggregation+or+Disaggregation?+Unifying+Both+for+Goodput-Optimized+LLM+Serving 11. Enhancing LLM Efficiency: Targeted Pruning for Prefill-Decode Disaggregation in Inference — Hao Zhang et al., 2025 https://scholar.google.com/scholar?q=Enhancing+LLM+Efficiency:+Targeted+Pruning+for+Prefill-Decode+Disaggregation+in+Inference 12. AdaServe: SLO-Customized LLM Serving with Fine-Grained Speculative Decoding — Zikun Li et al., 2025 https://scholar.google.com/scholar?q=AdaServe:+SLO-Customized+LLM+Serving+with+Fine-Grained+Speculative+Decoding 13. SwiftSpec: Ultra-Low Latency LLM Decoding by Scaling Asynchronous Speculative Decoding — Ziyi Zhang et al., 2025 https://scholar.google.com/scholar?q=SwiftSpec:+Ultra-Low+Latency+LLM+Decoding+by+Scaling+Asynchronous+Speculative+Decoding 14. Nightjar: Dynamic Adaptive Speculative Decoding for Large Language Models Serving — Rui Li et al., 2025 https://scholar.google.com/scholar?q=Nightjar:+Dynamic+Adaptive+Speculative+Decoding+for+Large+Language+Models+Serving 15. Hetis: Serving LLMs in Heterogeneous GPU Clusters with Fine-grained and Dynamic Parallelism — Zizhao Mo et al., 2025 https://scholar.google.com/scholar?q=Hetis:+Serving+LLMs+in+Heterogeneous+GPU+Clusters+with+Fine-grained+and+Dynamic+Parallelism 16. AI Post Transformers: Why LLM Serving Needs Mathematical Optimization — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-05-why-llm-serving-needs-mathematical-optim-647fc6.mp3 17. AI Post Transformers: FengHuang for Rack-Scale LLM Inference Memory — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-12-fenghuang-for-rack-scale-llm-inference-m-62708e.mp3 18. AI Post Transformers: Vistara Brings CXL Memory to Hyperscale — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-11-vistara-brings-cxl-memory-to-hyperscale-b5199e.mp3 19. AI Post Transformers: Characterizing LLM KV Cache Workloads in Production — Hal Turing & Dr. Ada Shannon, 2025 https://podcast.do-not-panic.com/episodes/characterizing-llm-kv-cache-workloads-in-production/ 20. AI Post Transformers: Memory-Bound, Not Bandwidth-Limited Batch-1 LLM Decode — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-06-02-memory-bound-not-bandwidth-limited-batch-114799.mp3 21. AI Post Transformers: LPU Chip for Low-Latency LLM Inference — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-20-lpu-chip-for-low-latency-llm-inference-be13c3.mp3 22. AI Post Transformers: TokenDance for Multi-Agent KV Cache Sharing — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-22-tokendance-for-multi-agent-kv-cache-shar-aa9b99.mp3

  5. 2 ngày trước

    Moebius: Seamless Parallelism Switching for MoE Serving

    This episode explores Moebius, a serving system for mixture-of-experts transformers that can switch at runtime between tensor parallelism and expert parallelism without restarting or draining live requests. It explains why tensor parallelism tends to give lower latency at low concurrency, while expert parallelism delivers better throughput at high concurrency, making bursty online traffic and RL rollouts natural settings where the best strategy changes over time. The discussion focuses on the hard systems problems behind that switch, including migrating in-flight requests, preserving paged KV caches, coping with CUDA graph address constraints, and handling KV-head mismatches that can waste cache capacity under tensor parallelism. It argues that the paper’s key contribution is treating the switch as a change in ownership and memory layout over one resident model and KV state, offering a concrete blueprint for serving large sparse models more efficiently. Sources: 1. Moebius: Serving Mixture-of-Expert Models with Seamless Runtime Parallelism Switch — Shaoyu Wang, Yizhuo Liang, Jaeyong Song, Chong Li, Seo Jin Park, 2026 http://arxiv.org/abs/2606.26607 2. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding — Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, et al., 2020 https://scholar.google.com/scholar?q=GShard:+Scaling+Giant+Models+with+Conditional+Computation+and+Automatic+Sharding 3. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity — William Fedus, Barret Zoph, Noam Shazeer, 2021 https://scholar.google.com/scholar?q=Switch+Transformers:+Scaling+to+Trillion+Parameter+Models+with+Simple+and+Efficient+Sparsity 4. DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale — Samyam Rajbhandari, Conglong Li, Zhewei Yao, et al., 2022 https://scholar.google.com/scholar?q=DeepSpeed-MoE:+Advancing+Mixture-of-Experts+Inference+and+Training+to+Power+Next-Generation+AI+Scale 5. MegaBlocks: Efficient Sparse Training with Mixture-of-Experts — Trevor Gale, Deepak Narayanan, Cliff Young, Matei Zaharia, 2022 https://scholar.google.com/scholar?q=MegaBlocks:+Efficient+Sparse+Training+with+Mixture-of-Experts 6. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism — Mohammad Shoeybi, Mostofa Patwary, Raul Puri, et al., 2019 https://scholar.google.com/scholar?q=Megatron-LM:+Training+Multi-Billion+Parameter+Language+Models+Using+Model+Parallelism 7. Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM — Deepak Narayanan, Mohammad Shoeybi, Jared Casper, et al., 2021 https://scholar.google.com/scholar?q=Efficient+Large-Scale+Language+Model+Training+on+GPU+Clusters+Using+Megatron-LM 8. Efficient Memory Management for Large Language Model Serving with PagedAttention — Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, et al., 2023 https://scholar.google.com/scholar?q=Efficient+Memory+Management+for+Large+Language+Model+Serving+with+PagedAttention 9. Nitsum: Serving Tiered LLM Requests with Adaptive Tensor Parallelism — Vikranth Srivatsa, Zijian He, Pu Guo, et al., 2026 https://scholar.google.com/scholar?q=Nitsum:+Serving+Tiered+LLM+Requests+with+Adaptive+Tensor+Parallelism 10. HAP: Hybrid Adaptive Parallelism for Efficient Mixture-of-Experts Inference — Haoran Lin et al., 2025 https://scholar.google.com/scholar?q=HAP:+Hybrid+Adaptive+Parallelism+for+Efficient+Mixture-of-Experts+Inference 11. Amoeba: Runtime Tensor Parallel Transformation for LLM Inference Services — Haoyu Chen et al., 2026 https://scholar.google.com/scholar?q=Amoeba:+Runtime+Tensor+Parallel+Transformation+for+LLM+Inference+Services 12. UCCL-EP: Portable Expert-Parallel Communication — Ziming Mao et al., 2026 https://scholar.google.com/scholar?q=UCCL-EP:+Portable+Expert-Parallel+Communication 13. RollPacker: Mitigating Long-Tail Rollouts for Fast, Synchronous RL Post-Training — Wei Gao et al., 2026 https://scholar.google.com/scholar?q=RollPacker:+Mitigating+Long-Tail+Rollouts+for+Fast,+Synchronous+RL+Post-Training 14. HD-MoE: Hybrid and Dynamic Parallelism for Mixture-of-Expert LLMs with 3D Near-Memory Processing — Haochen Huang et al., 2025 https://arxiv.org/abs/2509.09420 15. fMoE: Fine-Grained Expert Offloading for Large Mixture-of-Experts Serving — Hanfei Yu et al., 2025 https://arxiv.org/abs/2502.05370 16. HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference — Peng Tang et al., 2024 https://arxiv.org/abs/2411.01433 17. CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving — Yuhan Liu et al., 2023 https://arxiv.org/abs/2310.07240 18. KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction — Jang-Hyun Kim et al., 2025 https://arxiv.org/abs/2505.23416 19. AI Post Transformers: JANUS for Scalable MoE Inference — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-15-janus-for-scalable-moe-inference-78ae30.mp3 20. AI Post Transformers: Serving MoE Models with Disaggregated Expert Parallelism — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-19-serving-moe-models-with-disaggregated-ex-6979d2.mp3 21. AI Post Transformers: Batch-Aware Expert Routing for Faster MoE Decoding — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-batch-aware-expert-routing-for-faster-mo-683ab6.mp3 22. AI Post Transformers: Affordable Large-Scale Decoding Through Model-System Co-Design — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-19-affordable-large-scale-decoding-through-e1d7ed.mp3 23. AI Post Transformers: Mooncake for KV Cache-Centric LLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-06-05-mooncake-for-kv-cache-centric-llm-servin-1086d0.mp3 24. AI Post Transformers: Why LLM Serving Needs Mathematical Optimization — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-05-why-llm-serving-needs-mathematical-optim-647fc6.mp3

  6. 3 ngày trước

    Information-Aware KV Cache Compression for Long Reasoning

    This episode explores Information-Aware KV Cache Compression for Long Reasoning, a paper about making long-context inference cheaper and more reliable by deciding which KV-cache tokens to keep during extended reasoning. It explains why long prefilling and long decoding turn the cache into a major memory bottleneck, and why common heuristics such as sliding windows or recent-attention-based retention can discard tokens that only become important much later. The discussion centers on the paper’s claim that future usefulness is better captured by information-theoretic signals like predictive entropy and Forward Influence, with experiments showing that attention-ranked tokens help short-horizon predictions while entropy-ranked tokens matter more over long horizons. Listeners get a concrete account of how InfoKV blends recent attention with per-layer entropy-based scoring to improve the tradeoff between memory savings and long-range reasoning quality. Sources: 1. Information-Aware KV Cache Compression for Long Reasoning — Jushi Kai, Zhuiri Xiao, Alexandra Birch, Zhouhan Lin, 2026 http://arxiv.org/abs/2606.26875 2. Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time — Zichang Liu, Aditya Desai, Fangshuo Liao, Anshumali Shrivastava, et al., 2023 https://scholar.google.com/scholar?q=Scissorhands:+Exploiting+the+Persistence+of+Importance+Hypothesis+for+LLM+KV+Cache+Compression+at+Test+Time 3. H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models — Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Beidi Chen, Christopher Re, et al., 2023 https://scholar.google.com/scholar?q=H2O:+Heavy-Hitter+Oracle+for+Efficient+Generative+Inference+of+Large+Language+Models 4. SnapKV: LLM Knows What You are Looking for Before Generation — Yuhong Li, Yingbing Huang, Bowen Yang, Patrick Lewis, Deming Chen, et al., 2024 https://scholar.google.com/scholar?q=SnapKV:+LLM+Knows+What+You+are+Looking+for+Before+Generation 5. Information-Aware KV Cache Compression for Long Reasoning — Jushi Kai, Zhuiri Xiao, Alexandra Birch, Zhouhan Lin, 2026 https://scholar.google.com/scholar?q=Information-Aware+KV+Cache+Compression+for+Long+Reasoning 6. Expected Attention: KV Cache Compression by Estimating Attention from Future Queries Distribution — Alessio Devoto, Maximilian Jeblick, Simon Jegou, 2025 https://scholar.google.com/scholar?q=Expected+Attention:+KV+Cache+Compression+by+Estimating+Attention+from+Future+Queries+Distribution 7. Reasoning Path Compression: Compressing Generation Trajectories for Efficient LLM Reasoning — Jiwon Song, Dongwon Jo, Yulhwa Kim, Jae-Joon Kim, 2025 https://scholar.google.com/scholar?q=Reasoning+Path+Compression:+Compressing+Generation+Trajectories+for+Efficient+LLM+Reasoning 8. Compressing Context to Enhance Inference Efficiency of Large Language Models — Yucheng Li, Bo Dong, Chenghua Lin, Frank Guerin, 2023 https://scholar.google.com/scholar?q=Compressing+Context+to+Enhance+Inference+Efficiency+of+Large+Language+Models 9. FreqKV: Key-Value Compression in Frequency Domain for Context Window Extension — Jushi Kai et al., 2026 https://scholar.google.com/scholar?q=FreqKV:+Key-Value+Compression+in+Frequency+Domain+for+Context+Window+Extension 10. LongReason: A Synthetic Long-Context Reasoning Benchmark via Context Expansion — Zhan Ling et al., 2025 https://scholar.google.com/scholar?q=LongReason:+A+Synthetic+Long-Context+Reasoning+Benchmark+via+Context+Expansion 11. Attention Reveals More Than Tokens: Training-Free Long-Context Reasoning with Attention-guided Retrieval — Yuwei Zhang et al., 2025 https://scholar.google.com/scholar?q=Attention+Reveals+More+Than+Tokens:+Training-Free+Long-Context+Reasoning+with+Attention-guided+Retrieval 12. Uncertainty Quantification for Hallucination Detection in Large Language Models: Foundations, Methodology, and Future Directions — Sungmin Kang et al., 2025 https://scholar.google.com/scholar?q=Uncertainty+Quantification+for+Hallucination+Detection+in+Large+Language+Models:+Foundations,+Methodology,+and+Future+Directions 13. Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinations — Christian Tomani et al., 2024 https://scholar.google.com/scholar?q=Uncertainty-Based+Abstention+in+LLMs+Improves+Safety+and+Reduces+Hallucinations 14. KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction — Jang-Hyun Kim et al., 2025 https://scholar.google.com/scholar?q=KVzip:+Query-Agnostic+KV+Cache+Compression+with+Context+Reconstruction 15. Can LLMs Maintain Fundamental Abilities under KV Cache Compression? — Xiang Liu et al., 2025 https://scholar.google.com/scholar?q=Can+LLMs+Maintain+Fundamental+Abilities+under+KV+Cache+Compression? 16. KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches — Jiayi Yuan et al., 2024 https://scholar.google.com/scholar?q=KV+Cache+Compression,+But+What+Must+We+Give+in+Return?+A+Comprehensive+Benchmark+of+Long+Context+Capable+Approaches 17. KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse — Jingbo Yang et al., 2025 https://scholar.google.com/scholar?q=KVLink:+Accelerating+Large+Language+Models+via+Efficient+KV+Cache+Reuse 18. AI Post Transformers: IndexMem: Learned KV-Cache Eviction for Long-Context LLMs — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-06-12-indexmem-learned-kv-cache-eviction-for-l-132c2a.mp3 19. AI Post Transformers: When Quantization Hurts Reasoning Models — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-06-17-when-quantization-hurts-reasoning-models-eca9e7.mp3 20. AI Post Transformers: Hyper-Scaling LLM Inference with KV Cache Compression — Hal Turing & Dr. Ada Shannon, 2025 https://podcast.do-not-panic.com/episodes/hyper-scaling-llm-inference-with-kv-cache-compression/ 21. AI Post Transformers: Lattice: Fixed-Slot Compression for Transformer Memory — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-06-11-lattice-fixed-slot-compression-for-trans-5509ea.mp3 22. AI Post Transformers: Adaptive Compression Techniques for Efficient LLM Inference — Hal Turing & Dr. Ada Shannon, 2025 https://podcast.do-not-panic.com/episodes/adaptive-compression-techniques-for-efficient-llm-inference/ 23. AI Post Transformers: Explicit Information Transmission for Context Compression — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-05-explicit-information-transmission-for-co-24e3c2.mp3 24. AI Post Transformers: When LoRA Helps Under KV Cache Compression — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-06-12-when-lora-helps-under-kv-cache-compressi-76dda6.mp3

  7. 3 ngày trước

    JETSPEC and Parallel Tree Speculative Decoding

    This episode explores JETSPEC, a 2026 inference paper on speculative decoding that asks whether a language model can draft an entire tree of future tokens in parallel while preserving causal consistency and actually reducing latency on long generations. It explains why autoregressive decoding remains a serving bottleneck for long proofs, code completions, and assistant replies, even when the underlying transformer model itself is unchanged. The discussion compares JetSpec’s approach with Medusa, EAGLE-3, and DFlash, focusing on the central tradeoff between stronger path-conditioned drafts that are slow to produce and cheaper parallel drafts that risk internally inconsistent branches. Listeners would find it interesting because it turns a very practical systems problem, why powerful GPUs still feel slow at inference time, into a concrete debate about the next generation of real-world decoding optimizations. Sources: 1. JetSpec: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting — Lanxiang Hu, Zhaoxiang Feng, Yulun Wu, Haoran Yuan, Yujie Zhao, Yu-Yang Qian, Bojun Wang, Peng Zhao, Daxin Jiang, Yibo Zhu, Tajana Rosing, Hao Zhang, 2026 http://arxiv.org/abs/2606.18394 2. Fast Inference from Transformers via Speculative Decoding — Yaniv Leviathan, Matan Kalman, Yossi Matias, 2023 https://scholar.google.com/scholar?q=Fast+Inference+from+Transformers+via+Speculative+Decoding 3. Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads — Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, Tri Dao, 2024 https://scholar.google.com/scholar?q=Medusa:+Simple+LLM+Inference+Acceleration+Framework+with+Multiple+Decoding+Heads 4. EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees — Yuhui Li, Fangyun Wei, Chao Zhang, Hongyang Zhang, 2024 https://scholar.google.com/scholar?q=EAGLE-2:+Faster+Inference+of+Language+Models+with+Dynamic+Draft+Trees 5. DFlash: Block Diffusion for Flash Speculative Decoding — Jian Chen, Yesheng Liang, Zhijian Liu, 2026 https://scholar.google.com/scholar?q=DFlash:+Block+Diffusion+for+Flash+Speculative+Decoding 6. EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test — Yuhui Li, Fangyun Wei, Chao Zhang, Hongyang Zhang, 2025 https://scholar.google.com/scholar?q=EAGLE-3:+Scaling+up+Inference+Acceleration+of+Large+Language+Models+via+Training-Time+Test 7. SpecInfer: Accelerating Generative Large Language Model Serving with Tree-based Speculative Inference and Verification — Xupeng Miao et al., 2023 https://scholar.google.com/scholar?q=SpecInfer:+Accelerating+Generative+Large+Language+Model+Serving+with+Tree-based+Speculative+Inference+and+Verification 8. DFlare: Scaling Up Draft Capacity for Block Diffusion Speculative Decoding — Jiebin Zhang et al., 2026 https://scholar.google.com/scholar?q=DFlare:+Scaling+Up+Draft+Capacity+for+Block+Diffusion+Speculative+Decoding 9. TriSpec: Ternary Speculative Decoding via Lightweight Proxy Verification — Haoyun Jiang et al., 2026 https://scholar.google.com/scholar?q=TriSpec:+Ternary+Speculative+Decoding+via+Lightweight+Proxy+Verification 10. ParallelSpec: Parallel Drafter for Efficient Speculative Decoding — Zilin Xiao et al., 2024 https://arxiv.org/abs/2410.05589 11. Mamba Drafters for Speculative Decoding — Daewon Choi et al., 2025 https://arxiv.org/abs/2506.01206 12. OmniDraft: A Cross-vocabulary, Online Adaptive Drafter for On-device Speculative Decoding — Ramchalam Kinattinkara Ramakrishnan et al., 2025 https://arxiv.org/abs/2507.02659 13. Clover: Regressive Lightweight Speculative Decoding with Sequential Knowledge — Bin Xiao et al., 2024 https://arxiv.org/abs/2405.00263 14. Make Every Draft Count: Hidden State based Speculative Decoding — Yuetao Chen et al., 2026 https://arxiv.org/abs/2602.21224 15. When Hidden States Drift: Can KV Caches Rescue Long-Range Speculative Decoding? — Tianyu Liu et al., 2026 https://arxiv.org/abs/2604.26412 16. MoE-Spec: Expert Budgeting for Efficient Speculative Decoding — Bradley McDanel et al., 2026 https://arxiv.org/abs/2602.16052 17. Beat the long tail: Distribution-Aware Speculative Decoding for RL Training — Zelei Shao et al., 2025 https://arxiv.org/abs/2511.13841 18. AI Post Transformers: Speculative Decoding in Real vLLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-speculative-decoding-in-real-vllm-servin-6f4e2b.mp3 19. AI Post Transformers: Deep Kernel Fusion for Transformer Decoding — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-15-deep-kernel-fusion-for-transformer-decod-b1a703.mp3 20. AI Post Transformers: Serving MoE Models with Disaggregated Expert Parallelism — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-19-serving-moe-models-with-disaggregated-ex-6979d2.mp3 Interactive Visualization: JETSPEC and Parallel Tree Speculative Decoding

  8. 4 ngày trước

    DAK: Direct GPU Memory Offloading for LLMs

    This episode explores DAK, a Cornell systems paper arguing that LLM inference on tiered-memory machines can be faster when offloaded weights and KV-cache blocks are fetched directly into on-chip shared memory instead of being prefetched and staged through GPU HBM. It breaks down the tradeoffs among HBM capacity, HBM bandwidth, KV-cache growth during decoding, and prior approaches such as FlexGen, vLLM’s PagedAttention, and emerging KV offload systems like LMCache. The discussion focuses on DAK’s core technical idea: using Hopper’s Tensor Memory Accelerator inside custom GEMM and FlashAttention kernels so data movement and computation are co-designed, reducing bounce buffers, HBM contention, and pipeline bubbles while aggregating bandwidth from multiple memory tiers. Listeners would find it interesting because it turns a low-level memory-path decision into a concrete argument about when offloading is merely a fallback and when it becomes a real performance advantage for serving larger models, longer contexts, or bigger batches. Sources: 1. DAK: Direct-Access-Enabled GPU Memory Offloading with Optimal Efficiency for LLM Inference — Shouxu Lin, Zhiyuan Guo, Jiaxin Lin, 2026 http://arxiv.org/abs/2604.26074 2. ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning — Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, Yuxiong He, 2021 https://scholar.google.com/scholar?q=ZeRO-Infinity:+Breaking+the+GPU+Memory+Wall+for+Extreme+Scale+Deep+Learning 3. FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU — Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Ion Stoica, Percy Liang, Ce Zhang, and colleagues, 2023 https://scholar.google.com/scholar?q=FlexGen:+High-Throughput+Generative+Inference+of+Large+Language+Models+with+a+Single+GPU 4. Efficient Memory Management for Large Language Model Serving with PagedAttention — Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica, 2023 https://scholar.google.com/scholar?q=Efficient+Memory+Management+for+Large+Language+Model+Serving+with+PagedAttention 5. DAK: Direct-Access-Enabled GPU Memory Offloading with Optimal Efficiency for LLM Inference — Shouxu Lin, Zhiyuan Guo, Jiaxin Lin, 2026 https://scholar.google.com/scholar?q=DAK:+Direct-Access-Enabled+GPU+Memory+Offloading+with+Optimal+Efficiency+for+LLM+Inference 6. PIE: Pooling CPU Memory for LLM Inference — Yi Xu, Ziming Mao, Xiangxi Mo, Shu Liu, Ion Stoica, 2024 https://scholar.google.com/scholar?q=PIE:+Pooling+CPU+Memory+for+LLM+Inference 7. NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference — Xuanlin Jiang, Yang Zhou, Shiyi Cao, Ion Stoica, Minlan Yu, 2025 https://scholar.google.com/scholar?q=NEO:+Saving+GPU+Memory+Crisis+with+CPU+Offloading+for+Online+LLM+Inference 8. Understanding Data Movement in Tightly Coupled Heterogeneous Systems: A Case Study with the Grace Hopper Superchip — Luigi Fusco, Mikhail Khalilov, Marcin Chrapek, Giridhar Chukkapalli, Thomas Schulthess, Torsten Hoefler, 2024 https://scholar.google.com/scholar?q=Understanding+Data+Movement+in+Tightly+Coupled+Heterogeneous+Systems:+A+Case+Study+with+the+Grace+Hopper+Superchip 9. FengHuang: Next-Generation Memory Orchestration for AI Inferencing — Jiamin Li, Lei Qu, Tao Zhang, Grigory Chirkov, Shuotao Xu, Peng Cheng, Lidong Zhou, 2025 https://scholar.google.com/scholar?q=FengHuang:+Next-Generation+Memory+Orchestration+for+AI+Inferencing 10. Reducing Transformer Key-Value Cache Size with Cross-Layer Attention — William Brandon et al., 2024 https://scholar.google.com/scholar?q=Reducing+Transformer+Key-Value+Cache+Size+with+Cross-Layer+Attention 11. xKV: Cross-Layer SVD for KV-Cache Compression — Chi-Chih Chang et al., 2025 https://scholar.google.com/scholar?q=xKV:+Cross-Layer+SVD+for+KV-Cache+Compression 12. XQuant: Achieving Ultra-Low Bit KV Cache Quantization with Cross-Layer Compression — Haoqi Yang et al., 2025 https://scholar.google.com/scholar?q=XQuant:+Achieving+Ultra-Low+Bit+KV+Cache+Quantization+with+Cross-Layer+Compression 13. InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management — Wonbeom Lee et al., 2024 https://scholar.google.com/scholar?q=InfiniGen:+Efficient+Generative+Inference+of+Large+Language+Models+with+Dynamic+KV+Cache+Management 14. Accelerating LLM Inference Throughput via Asynchronous KV Cache Prefetching — Yanhao Dong et al., 2025 https://scholar.google.com/scholar?q=Accelerating+LLM+Inference+Throughput+via+Asynchronous+KV+Cache+Prefetching 15. KVShare: Semantic-Aware Key-Value Cache Sharing for Efficient Large Language Model Inference — Huan Yang et al., 2025 https://scholar.google.com/scholar?q=KVShare:+Semantic-Aware+Key-Value+Cache+Sharing+for+Efficient+Large+Language+Model+Inference 16. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration — Ji Lin et al., 2023 https://scholar.google.com/scholar?q=AWQ:+Activation-aware+Weight+Quantization+for+LLM+Compression+and+Acceleration 17. SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression — Tim Dettmers et al., 2023 https://scholar.google.com/scholar?q=SpQR:+A+Sparse-Quantized+Representation+for+Near-Lossless+LLM+Weight+Compression 18. AI Post Transformers: Beluga: CXL Memory Pooling for LLM KV Cache — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-27-beluga-cxl-memory-pooling-for-llm-kv-cac-b6142f.mp3 19. AI Post Transformers: ScoutAttention for Efficient KV Cache Offloading — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-24-scoutattention-for-efficient-kv-cache-of-b26699.mp3 20. AI Post Transformers: InfiniGen for Efficient Long-Context LLM Inference — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-06-18-infinigen-for-efficient-long-context-llm-143d77.mp3 21. AI Post Transformers: Mooncake for KV Cache-Centric LLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-06-05-mooncake-for-kv-cache-centric-llm-servin-1086d0.mp3 22. AI Post Transformers: Memory-Bound, Not Bandwidth-Limited Batch-1 LLM Decode — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-06-02-memory-bound-not-bandwidth-limited-batch-114799.mp3 23. AI Post Transformers: LAPS for Length-Aware LLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-05-laps-for-length-aware-llm-serving-0c6149.mp3

Xếp Hạng & Nhận Xét

3,7
/5
3 Xếp hạng

Giới Thiệu

AI-generated podcast where hosts Hal Turing and Dr. Ada Shannon discuss the latest research papers and reports in machine learning, AI systems, and optimization. Featuring honest critical analysis, proper citations, and nerdy humor.

Có Thể Bạn Cũng Thích