AI Post Transformers

mcgrof

3.7(3개의 평가)
과학 기술
매일 업데이트

AI-generated podcast where hosts Hal Turing and Dr. Ada Shannon discuss the latest research papers and reports in machine learning, AI systems, and optimization. Featuring honest critical analysis, proper citations, and nerdy humor.

1일 전

DART Speeds Up Speculative LLM Decoding

This episode explores the DART paper as a practical attempt to make speculative decoding deliver real end-to-end speedups for memory-bound LLM inference. It explains how exact draft-and-verify decoding works, why accepted chunk length only matters when the drafter is cheap enough, and how DART differs from Medusa and EAGLE by reusing target-model hidden states to predict several future tokens in parallel with a diffusion-inspired draft stage. The discussion focuses on DART’s mechanics, including multi-layer state reuse, masked future slots, N-gram-guided pruning, and a shifted-logit design that makes the first drafted token especially important because an early mistake invalidates the rest of the chunk. Listeners would find it interesting because it connects model architecture choices to real serving constraints like latency, batching, and GPU efficiency, showing where theoretical decoding gains do and do not survive in production. Sources: 1. DART: Diffusion-Inspired Speculative Decoding for Fast LLM Inference — Fuliang Liu, Xue Li, Ketai Zhao, Yinxi Gao, Ziyan Zhou, Zhonghui Zhang, Zhibin Wang, Wanchun Dou, Sheng Zhong, Chen Tian, 2026 http://arxiv.org/abs/2601.19278 2. Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding — Hemeng Xia, Zijian Wu, Chunxi Zhang, Yonggan Fu, Haoran Sun, Zhicong Liu, Ping Luo, 2024 https://arxiv.org/abs/2401.07851 3. Fast Inference from Transformers via Speculative Decoding — Yaniv Leviathan, Matan Kalman, Yossi Matias, 2023 https://arxiv.org/abs/2211.17192 4. EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty — Yuhui Li, Fangyun Wei, Chao Zhang, Hongyang Zhang, 2024 https://arxiv.org/abs/2401.15077 5. Speculative Diffusion Decoding: Accelerating Language Generation through Diffusion — Jacob K. Christopher, Brian R. Bartoldson, Tal Ben-Nun, Michael Cardei, Bhavya Kailkhura, Ferdinando Fioretto, 2024 https://arxiv.org/abs/2408.05636 6. EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test — Yuhui Li, Fangyun Wei, Chao Zhang, Hongyang Zhang, 2025 https://scholar.google.com/scholar?q=EAGLE-3:+Scaling+up+Inference+Acceleration+of+Large+Language+Models+via+Training-Time+Test 7. Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads — Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, Tri Dao, 2024 https://scholar.google.com/scholar?q=Medusa:+Simple+LLM+Inference+Acceleration+Framework+with+Multiple+Decoding+Heads 8. DiffuSpec: Unlocking Diffusion Language Models for Speculative Decoding — Guanghao Li, Zhihui Fu, Min Fang, Qibin Zhao, Ming Tang, Chun Yuan, Jun Wang, 2025 https://scholar.google.com/scholar?q=DiffuSpec:+Unlocking+Diffusion+Language+Models+for+Speculative+Decoding 9. SpecDiff-2: Scaling Diffusion Drafter Alignment For Faster Speculative Decoding — Jameson Sandler, Jacob K. Christopher, Thomas Hartvigsen, Nando Fioretto, 2025 https://scholar.google.com/scholar?q=SpecDiff-2:+Scaling+Diffusion+Drafter+Alignment+For+Faster+Speculative+Decoding 10. Speculative Decoding: Performance or Illusion? — Xiaoxuan Liu, Jiaxiang Yu, Jongseok Park, Ion Stoica, Alvin Cheung, 2026 https://scholar.google.com/scholar?q=Speculative+Decoding:+Performance+or+Illusion? 11. AI Post Transformers: Speculative Decoding in Real vLLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-speculative-decoding-in-real-vllm-servin-6f4e2b.mp3 12. AI Post Transformers: Deep Kernel Fusion for Transformer Decoding — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-15-deep-kernel-fusion-for-transformer-decod-b1a703.mp3 13. AI Post Transformers: InfiniGen for Efficient Long-Context LLM Inference — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-06-18-infinigen-for-efficient-long-context-llm-143d77.mp3 14. AI Post Transformers: Batch-Aware Expert Routing for Faster MoE Decoding — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-batch-aware-expert-routing-for-faster-mo-683ab6.mp3
1일 전

SALCA for Sparse Long-Context Decoding

This episode explores why long-context LLM decoding becomes memory-bandwidth bound: once prompt prefill is done, each new token must repeatedly scan an ever-growing KV cache, making inference limited more by data movement than raw compute. It explains sparse attention as the idea that only a small fraction of prior tokens matter for each step, and uses top-k recall to frame the core challenge of preserving the right token ranking while cutting memory traffic. The discussion centers on Salca’s main argument: a sparsity-aware accelerator can make sparse decoding practical by combining dominant-channel feature selection with asymmetric ultra-low-bit query/key prediction, reducing predictor traffic to roughly one-eighth of a standard 4-bit filtering baseline. A listener would find it interesting because it connects transformer inference theory, serving-system bottlenecks, and custom chip design into a concrete case for faster, more energy-efficient long-context generation. Sources: 1. SALCA for Sparse Long-Context Decoding https://arxiv.org/pdf/2604.24820 2. Fast Transformer Decoding: One Write-Head is All You Need — Noam Shazeer, 2019 https://scholar.google.com/scholar?q=Fast+Transformer+Decoding:+One+Write-Head+is+All+You+Need 3. Efficiently Scaling Transformer Inference — Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Jeff Dean, et al., 2022 https://scholar.google.com/scholar?q=Efficiently+Scaling+Transformer+Inference 4. Efficient Memory Management for Large Language Model Serving with PagedAttention — Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Joseph E. Gonzalez, Hao Zhang, Ion Stoica, 2023 https://scholar.google.com/scholar?q=Efficient+Memory+Management+for+Large+Language+Model+Serving+with+PagedAttention 5. QUEST: Query-Aware Sparsity for Efficient Long-Context LLM Inference — Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, Song Han, 2024 https://scholar.google.com/scholar?q=QUEST:+Query-Aware+Sparsity+for+Efficient+Long-Context+LLM+Inference 6. SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning — Hanrui Wang, Zhekai Zhang, Song Han, 2020 https://scholar.google.com/scholar?q=SpAtten:+Efficient+Sparse+Attention+Architecture+with+Cascade+Token+and+Head+Pruning 7. Energon: Towards Efficient Acceleration of Transformers Using Dynamic Sparse Attention — Zhe Zhou, Junlin Liu, Zhenyu Gu, Guangyu Sun, 2021 https://scholar.google.com/scholar?q=Energon:+Towards+Efficient+Acceleration+of+Transformers+Using+Dynamic+Sparse+Attention 8. S2-Attention: Hardware-Aware Context Sharding Among Attention Heads — Xihui Lin, Yunan Zhang, Suyu Ge, Liliang Ren, Barun Patra, Vishrav Chaudhary, Hao Peng, Xia Song, 2024 https://scholar.google.com/scholar?q=S2-Attention:+Hardware-Aware+Context+Sharding+Among+Attention+Heads 9. SnapMLA: Efficient Long-Context MLA Decoding via Hardware-Aware FP8 Quantized Pipelining — Yifan Zhang, Zunhai Su, Shuhao Hu, Rui Yang, Wei Wu, Yulei Qian, Yuchen Xie, Xunliang Cai, 2026 https://scholar.google.com/scholar?q=SnapMLA:+Efficient+Long-Context+MLA+Decoding+via+Hardware-Aware+FP8+Quantized+Pipelining 10. ParisKV: Fast and Drift-Robust KV-Cache Retrieval for Long-Context LLMs (https://arxiv.org/abs/2602.07721) — Yanlin Qi et al., 2026 https://scholar.google.com/scholar?q=ParisKV:+Fast+and+Drift-Robust+KV-Cache+Retrieval+for+Long-Context+LLMs+(https://arxiv.org/abs/2602.07721) 11. LouisKV: Efficient KV Cache Retrieval for Long Input-Output Sequences (https://arxiv.org/abs/2510.11292) — Wenbo Wu et al., 2025 https://scholar.google.com/scholar?q=LouisKV:+Efficient+KV+Cache+Retrieval+for+Long+Input-Output+Sequences+(https://arxiv.org/abs/2510.11292) 12. Efficient Low Rank Attention for Long-Context Inference in Large Language Models (https://arxiv.org/abs/2510.23649) — Tenghui Li et al., 2025 https://scholar.google.com/scholar?q=Efficient+Low+Rank+Attention+for+Long-Context+Inference+in+Large+Language+Models+(https://arxiv.org/abs/2510.23649) 13. LLMs Know What to Drop: Self-Attention Guided KV Cache Eviction for Efficient Long-Context Inference (https://arxiv.org/abs/2503.08879) — Guangtao Wang et al., 2025 https://scholar.google.com/scholar?q=LLMs+Know+What+to+Drop:+Self-Attention+Guided+KV+Cache+Eviction+for+Efficient+Long-Context+Inference+(https://arxiv.org/abs/2503.08879) 14. Double-P: Hierarchical Top-P Sparse Attention for Long-Context LLMs (https://arxiv.org/abs/2602.05191) — Wentao Ni et al., 2026 https://scholar.google.com/scholar?q=Double-P:+Hierarchical+Top-P+Sparse+Attention+for+Long-Context+LLMs+(https://arxiv.org/abs/2602.05191) 15. ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference (https://arxiv.org/abs/2502.00299) — Xiang Liu et al., 2025 https://scholar.google.com/scholar?q=ChunkKV:+Semantic-Preserving+KV+Cache+Compression+for+Efficient+Long-Context+LLM+Inference+(https://arxiv.org/abs/2502.00299) 16. ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification (https://arxiv.org/abs/2405.14256) — Yefei He et al., 2024 https://scholar.google.com/scholar?q=ZipCache:+Accurate+and+Efficient+KV+Cache+Quantization+with+Salient+Token+Identification+(https://arxiv.org/abs/2405.14256) 17. Accurate KV Cache Quantization with Outlier Tokens Tracing (https://arxiv.org/abs/2505.10938) — Yi Su et al., 2025 https://scholar.google.com/scholar?q=Accurate+KV+Cache+Quantization+with+Outlier+Tokens+Tracing+(https://arxiv.org/abs/2505.10938) 18. A Little Goes a Long Way: Efficient Long Context Training and Inference with Partial Contexts (https://arxiv.org/abs/2410.01485) — Suyu Ge et al., 2024 https://scholar.google.com/scholar?q=A+Little+Goes+a+Long+Way:+Efficient+Long+Context+Training+and+Inference+with+Partial+Contexts+(https://arxiv.org/abs/2410.01485) 19. AI Post Transformers: Memory-Bound, Not Bandwidth-Limited Batch-1 LLM Decode — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-06-02-memory-bound-not-bandwidth-limited-batch-114799.mp3 20. AI Post Transformers: FlatAttention for Tile-Based Accelerator Inference — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-flatattention-for-tile-based-accelerator-56e6ca.mp3 21. AI Post Transformers: MiniMax Sparse Attention at Million-Token Scale — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-06-13-minimax-sparse-attention-at-million-toke-300108.mp3 22. AI Post Transformers: ScoutAttention for Efficient KV Cache Offloading — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-24-scoutattention-for-efficient-kv-cache-of-b26699.mp3 23. AI Post Transformers: Deep Kernel Fusion for Transformer Decoding — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-15-deep-kernel-fusion-for-transformer-decod-b1a703.mp3 24. AI Post Transformers: Stochastic KV Routing for Cache Sharing — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-29-stochastic-kv-routing-for-cache-sharing-5fef63.mp3
1일 전

SERA: Repository Specialization for Open Coding Agents

This episode explores SERA, a method for specializing open-weight coding agents to individual repositories so they learn local APIs, naming conventions, refactor habits, and test idioms as model behavior rather than prompt context. It contrasts that idea with repository-aware retrieval, arguing that while RAG updates faster, weight adaptation could better capture the diffuse, codebase-specific patterns that matter for agentic tasks like searching, planning, editing, and validating changes. The discussion focuses on SERA’s soft-verification pipeline: a teacher model generates repository-grounded edit trajectories and synthetic pull request descriptions, then a second rollout regenerates the patch and keeps examples only when the two edits overlap enough at the line level. A listener would find it interesting because it gets into the practical tradeoff at the heart of coding agents: whether cheaper agreement-based filtering can make repo specialization useful without the heavy infrastructure cost of full execution-based verification. Sources: 1. SERA: Soft-Verified Efficient Repository Agents — Ethan Shen, Daniel Tormoen, Saurabh Shah, Ali Farhadi, Tim Dettmers, 2026 http://arxiv.org/abs/2601.20789 2. RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation — Fengji Zhang, Bei Chen, Yue Zhang, Jacky Keung, Jin Liu, Daoguang Zan, Yi Mao, Jian-Guang Lou, Weizhu Chen, 2023 https://scholar.google.com/scholar?q=RepoCoder:+Repository-Level+Code+Completion+Through+Iterative+Retrieval+and+Generation 3. RepoFusion: Training Code Models to Understand Your Repository — Disha Shrivastava, Denis Kocetkov, Harm de Vries, Dzmitry Bahdanau, Torsten Scholak, 2023 https://scholar.google.com/scholar?q=RepoFusion:+Training+Code+Models+to+Understand+Your+Repository 4. Customizing an LLM for Enterprise Software Engineering — Aditya Kini, Satish Chandra, Milad Hashemi, Saksham Thakur, Aditya Pandey, Vincent Nguyen, et al., 2026 https://scholar.google.com/scholar?q=Customizing+an+LLM+for+Enterprise+Software+Engineering 5. SERA: Soft-Verified Efficient Repository Agents — Ethan Shen, Daniel Tormoen, Saurabh Shah, Ali Farhadi, Tim Dettmers, 2026 https://scholar.google.com/scholar?q=SERA:+Soft-Verified+Efficient+Repository+Agents 6. CodeT: Code Generation with Generated Tests — Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, Weizhu Chen, 2022 https://scholar.google.com/scholar?q=CodeT:+Code+Generation+with+Generated+Tests 7. LEVER: Learning to Verify Language-to-Code Generation with Execution — Ansong Ni, Srini Iyer, Dragomir Radev, Ves Stoyanov, Wen-tau Yih, Sida I. Wang, Xi Victoria Lin, 2023 https://scholar.google.com/scholar?q=LEVER:+Learning+to+Verify+Language-to-Code+Generation+with+Execution 8. SWE-smith: Scaling Data for Software Engineering Agents — John Yang, Kilian Lieret, Carlos E. Jimenez, Alexander Wettig, Kabir Khandpur, Yanzhe Zhang, et al., 2025 https://scholar.google.com/scholar?q=SWE-smith:+Scaling+Data+for+Software+Engineering+Agents 9. R2E-Gym: Procedural Environments and Hybrid Verifiers for Scaling Open-Weights SWE Agents — N. Jain, J. Singh, M. Shetty, L. Zheng, K. Sen, and I. Stoica, 2025 https://scholar.google.com/scholar?q=R2E-Gym:+Procedural+Environments+and+Hybrid+Verifiers+for+Scaling+Open-Weights+SWE+Agents 10. RepoST: Scalable Repository-Level Coding Environment Construction with Sandbox Testing — Y. Xie, A. Xie, D. Sheth, P. Liu, D. Fried, and C. P. Rosé, 2025 https://scholar.google.com/scholar?q=RepoST:+Scalable+Repository-Level+Coding+Environment+Construction+with+Sandbox+Testing 11. SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents — I. Badertdinov, A. Golubev, M. Nekrashevich, A. Shevtsov, S. Karasik, A. Andriushchenko, M. Trofimova, D. Litvintseva, and B. Yangel, 2025 https://scholar.google.com/scholar?q=SWE-rebench:+An+Automated+Pipeline+for+Task+Collection+and+Decontaminated+Evaluation+of+Software+Engineering+Agents 12. CodeRAG-Bench: Can Retrieval Augment Code Generation? — Z. Z. Wang, A. Asai, X. V. Yu, F. F. Xu, Y. Xie, G. Neubig, and D. Fried, 2024 https://scholar.google.com/scholar?q=CodeRAG-Bench:+Can+Retrieval+Augment+Code+Generation? 13. Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces — M. A. Merrill et al., 2026 https://scholar.google.com/scholar?q=Terminal-Bench:+Benchmarking+Agents+on+Hard,+Realistic+Tasks+in+Command+Line+Interfaces 14. Shape of Thought: When Distribution Matters More than Correctness in Reasoning Tasks — A. Chandra, A. Agrawal, A. Hosseini, S. Fischmeister, R. Agarwal, N. Goyal, and A. Courville, 2026 https://scholar.google.com/scholar?q=Shape+of+Thought:+When+Distribution+Matters+More+than+Correctness+in+Reasoning+Tasks 15. GenX: Mastering Code and Test Generation with Execution Feedback — Nan Wang et al., 2024 https://scholar.google.com/scholar?q=GenX:+Mastering+Code+and+Test+Generation+with+Execution+Feedback 16. Enhancing LLM-Based Code Translation with Verified Multi-Semantic Representations — Yufu Wang et al., 2026 https://scholar.google.com/scholar?q=Enhancing+LLM-Based+Code+Translation+with+Verified+Multi-Semantic+Representations 17. StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback — Shihan Dou et al., 2024 https://scholar.google.com/scholar?q=StepCoder:+Improve+Code+Generation+with+Reinforcement+Learning+from+Compiler+Feedback 18. Execution-based Code Generation using Deep Reinforcement Learning — Parshin Shojaee et al., 2023 https://scholar.google.com/scholar?q=Execution-based+Code+Generation+using+Deep+Reinforcement+Learning 19. Beyond Code Snippets: Benchmarking LLMs on Repository-Level Question Answering — Yoseph Berhanu Alebachew et al., 2026 https://scholar.google.com/scholar?q=Beyond+Code+Snippets:+Benchmarking+LLMs+on+Repository-Level+Question+Answering 20. AI Post Transformers: AgenticQwen and Small Industrial Tool Agents — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-27-agenticqwen-and-small-industrial-tool-ag-dc676d.mp3 21. AI Post Transformers: Experimental Comparison of Agentic and Enhanced RAG — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-14-experimental-comparison-of-agentic-and-e-37d8bc.mp3 22. AI Post Transformers: Trace Rewriting Against Unauthorized LLM Distillation — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-15-trace-rewriting-against-unauthorized-llm-306357.mp3 23. AI Post Transformers: Learning Facts at Scale with Active Reading — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-06-25-learning-facts-at-scale-with-active-read-161bea.mp3
2일 전

Cache-Resident LLM Inference in GB-Scale Caches

This episode explores a 2026 paper on cache-resident LLM inference, asking whether modern CPUs with gigabyte-scale last-level caches can cut decoding latency by keeping model weights on-chip instead of repeatedly fetching them from DRAM. It explains why autoregressive decoding is often memory-bound rather than compute-bound, then breaks down the paper’s main design ideas: separating weight-heavy projections and feed-forward work from attention and KV-cache handling, and using fine-grained static scheduling to reduce synchronization overhead. The discussion gets concrete about the system architecture on AMD EPYC 9684X machines, including dual-socket role separation, INT8 weights and KV caches, and locality-aware placement of weight shards and activations. A listener would find it interesting because it gives a sharp, skeptical look at where CPU-based LLM serving might genuinely improve throughput and time-per-output-token, while also arguing that this is a targeted systems win rather than a replacement for GPU-first inference. Sources: 1. Cache-Resident LLM Inference in GB-Scale Last-Level Caches — Wanning Zhang, Tongzhou Gu, Marco Canini, Ceyu Xu, Jian Weng, 2026 http://arxiv.org/abs/2606.25353 2. LLM Inference Serving: Survey of Recent Advances and Opportunities — Baolin Li, Yankai Jiang, Vijay Gadepally, Devesh Tiwari, 2024 https://scholar.google.com/scholar?q=LLM+Inference+Serving:+Survey+of+Recent+Advances+and+Opportunities 3. Efficient Memory Management for Large Language Model Serving with PagedAttention — Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Hao Zhang, Ion Stoica, et al., 2023 https://scholar.google.com/scholar?q=Efficient+Memory+Management+for+Large+Language+Model+Serving+with+PagedAttention 4. Splitwise: Efficient generative LLM inference using phase splitting — Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Inigo Goiri, Saeed Maleki, Ricardo Bianchini, 2023 https://scholar.google.com/scholar?q=Splitwise:+Efficient+generative+LLM+inference+using+phase+splitting 5. FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU — Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Joseph E. Gonzalez, Percy Liang, Christopher Re, Ion Stoica, Ce Zhang, et al., 2023 https://scholar.google.com/scholar?q=FlexGen:+High-Throughput+Generative+Inference+of+Large+Language+Models+with+a+Single+GPU 6. Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks — Charles Eckert, Xiaowei Wang, Jingcheng Wang, Arun Subramaniyan, Ravi Iyer, Dennis Sylvester, David Blaauw, Reetuparna Das, 2018 https://scholar.google.com/scholar?q=Neural+Cache:+Bit-Serial+In-Cache+Acceleration+of+Deep+Neural+Networks 7. Proximu$: Efficiently Scaling DNN Inference in Multi-core CPUs through Near-Cache Compute — Anant V. Nori, Rahul Bera, Shankar Balachandran, Joydeep Rakshit, Om J. Omer, Avishaii Abuhatzera, Belliappa Kuttanna, Sreenivas Subramoney, 2020 https://scholar.google.com/scholar?q=Proximu$:+Efficiently+Scaling+DNN+Inference+in+Multi-core+CPUs+through+Near-Cache+Compute 8. Inference Performance Optimization for Large Language Models on CPUs — Pujiang He, Shan Zhou, Wenhuan Huang, Changqing Li, Duyi Wang, Bin Guo, Chen Meng, Sheng Gui, Weifei Yu, Yi Xie, 2024 https://scholar.google.com/scholar?q=Inference+Performance+Optimization+for+Large+Language+Models+on+CPUs 9. ArcLight: A Lightweight LLM Inference Architecture for Many-Core CPUs — Yuzhuang Xu, Xu Han, Yuxuan Li, Wanxiang Che, 2026 https://scholar.google.com/scholar?q=ArcLight:+A+Lightweight+LLM+Inference+Architecture+for+Many-Core+CPUs 10. vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention — Ramya Prabhu, Ajay Nayak, Jayashree Mohan, Ramachandran Ramjee, Ashish Panwar, 2025 https://scholar.google.com/scholar?q=vAttention:+Dynamic+Memory+Management+for+Serving+LLMs+without+PagedAttention 11. Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve — Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, Ramachandran Ramjee, 2024 https://scholar.google.com/scholar?q=Taming+Throughput-Latency+Tradeoff+in+LLM+Inference+with+Sarathi-Serve 12. DistServe: Disaggregating Prefill and Decoding for Goodput-Optimized Large Language Model Serving — Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, Hao Zhang, 2024 https://scholar.google.com/scholar?q=DistServe:+Disaggregating+Prefill+and+Decoding+for+Goodput-Optimized+Large+Language+Model+Serving 13. WaferLLM: Large Language Model Inference at Wafer Scale — Congjie He, Yeqi Huang, Pei Mu, Ziming Miao, Jilong Xue, Lingxiao Ma, Fan Yang, Luo Mai, 2025 https://scholar.google.com/scholar?q=WaferLLM:+Large+Language+Model+Inference+at+Wafer+Scale 14. T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge — Jianyu Wei, Shijie Cao, Ting Cao, Lingxiao Ma, Lei Wang, Yanyong Zhang, Mao Yang, 2025 https://scholar.google.com/scholar?q=T-MAC:+CPU+Renaissance+via+Table+Lookup+for+Low-Bit+LLM+Deployment+on+Edge 15. Compute Or Load KV Cache? Why Not Both? — Shuowei Jin et al., 2024 https://arxiv.org/abs/2410.03065 16. KVFlow: Efficient Prefix Caching for Accelerating LLM-Based Multi-Agent Workflows — Zaifeng Pan et al., 2025 https://arxiv.org/abs/2507.07400 17. Reducing Transformer Key-Value Cache Size with Cross-Layer Attention — William Brandon et al., 2024 https://arxiv.org/abs/2405.12981 18. QCQA: Quality and Capacity-aware Grouped Query Attention — Vinay Joshi et al., 2024 https://arxiv.org/abs/2406.10247 19. Beyond KV Caching: Shared Attention for Efficient LLMs — Bingli Liao and Danilo Vasconcellos Vargas, 2024 https://arxiv.org/abs/2407.12866 20. ClusterFusion: Expanding Operator Fusion Scope for LLM Inference via Cluster-Level Collective Primitive — Xinhao Luo et al., 2025 https://arxiv.org/abs/2508.18850 21. AI Post Transformers: Memory-Bound, Not Bandwidth-Limited Batch-1 LLM Decode — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-06-02-memory-bound-not-bandwidth-limited-batch-114799.mp3 22. AI Post Transformers: Deep Kernel Fusion for Transformer Decoding — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-15-deep-kernel-fusion-for-transformer-decod-b1a703.mp3 23. AI Post Transformers: ScoutAttention for Efficient KV Cache Offloading — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-24-scoutattention-for-efficient-kv-cache-of-b26699.mp3 24. AI Post Transformers: CacheFlow and 3D-Parallel KV Cache Restoration — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-01-cacheflow-and-3d-parallel-kv-cache-resto-8db883.mp3 25. AI Post Transformers: VeriCache: Lossless LLM Inference from Lossy KV Caches — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-06-02-vericache-lossless-llm-inference-from-lo-df9daf.mp3 26. AI Post Transformers: Harvest: Borrowing Peer GPU Memory for LLMs — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-06-02-harvest-borrowing-peer-gpu-memory-for-ll-e9e54f.mp3
2일 전

Scaling Prompt Tuning for Frozen T5 Models

This episode explores Brian Lester et al.’s 2021 paper on prompt tuning, which asks whether a large frozen T5 model can be adapted to new tasks by learning only a tiny soft prompt instead of fine-tuning all model weights. It explains the difference between soft prompt tuning, full fine-tuning, prefix-tuning, and GPT-3-style few-shot prompting, and frames the paper as a test of whether scaling laws make lightweight adaptation dramatically more effective at large model sizes. The discussion highlights the key result that prompt tuning lags on smaller models but approaches full fine-tuning on very large T5 checkpoints, with longer prompts and vocabulary-based initialization helping, while a five-token prompt can shrink task-specific parameters from 11 billion to roughly 20,000. Listeners would find it interesting because it connects model-scaling theory to concrete engineering tradeoffs around storage, mixed-task serving, and why industry later gravitated toward PEFT methods like LoRA and adapters. Sources: 1. The Power of Scale for Parameter-Efficient Prompt Tuning — Brian Lester, Rami Al-Rfou, Noah Constant, 2021 http://arxiv.org/abs/2104.08691 2. Prefix-Tuning: Optimizing Continuous Prompts for Generation — Xiang Lisa Li, Percy Liang, 2021 https://arxiv.org/abs/2101.00190 3. Learning How to Ask: Querying LMs with Mixtures of Soft Prompts — Guanghui Qin, Jason Eisner, 2021 https://arxiv.org/abs/2104.06599 4. The Power of Scale for Parameter-Efficient Prompt Tuning — Brian Lester, Rami Al-Rfou, Noah Constant, 2021 https://arxiv.org/abs/2104.08691 5. Multitask Prompt Tuning Enables Parameter-Efficient Transfer Learning — Zhen Wang, Rameswar Panda, Leonid Karlinsky, Rogerio Feris, Huan Sun, Yoon Kim, 2023 https://arxiv.org/abs/2303.02861 6. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer — Colin Raffel et al., 2020 https://scholar.google.com/scholar?q=Exploring+the+Limits+of+Transfer+Learning+with+a+Unified+Text-to-Text+Transformer 7. Language Models are Few-Shot Learners — Tom B. Brown et al., 2020 https://scholar.google.com/scholar?q=Language+Models+are+Few-Shot+Learners 8. AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts — Taylor Shin, Yasaman Razeghi, Robert L. Logan IV, Eric Wallace, Sameer Singh, 2020 https://scholar.google.com/scholar?q=AutoPrompt:+Eliciting+Knowledge+from+Language+Models+with+Automatically+Generated+Prompts 9. WARP: Word-level Adversarial ReProgramming — Karen Hambardzumyan, Hrant Khachatrian, Jonathan May, 2021 https://scholar.google.com/scholar?q=WARP:+Word-level+Adversarial+ReProgramming 10. Parameter-Efficient Transfer Learning for NLP — Neil Houlsby et al., 2019 https://scholar.google.com/scholar?q=Parameter-Efficient+Transfer+Learning+for+NLP 11. MRQA 2019 Shared Task: Evaluating Generalization in Reading Comprehension — Adam Fisch, Alon Talmor, Robin Jia, Minjoon Seo, Eunsol Choi, Danqi Chen, 2019 https://scholar.google.com/scholar?q=MRQA+2019+Shared+Task:+Evaluating+Generalization+in+Reading+Comprehension 12. Task Prompt Vectors: Effective Initialization through Multi-Task Soft-Prompt Transfer — Robert Belanec, Simon Ostermann, Ivan Srba, Maria Bielikova, 2024 https://scholar.google.com/scholar?q=Task+Prompt+Vectors:+Effective+Initialization+through+Multi-Task+Soft-Prompt+Transfer 13. Parameter-Efficient Fine-Tuning for Medical Text Summarization: A Comparative Study of LoRA, Prompt Tuning, and Full Fine-Tuning — Ulugbek Shernazarov et al., 2026 https://scholar.google.com/scholar?q=Parameter-Efficient+Fine-Tuning+for+Medical+Text+Summarization:+A+Comparative+Study+of+LoRA,+Prompt+Tuning,+and+Full+Fine-Tuning 14. MerA: Merging Pretrained Adapters For Few-Shot Learning — Shwai He et al., 2023 https://scholar.google.com/scholar?q=MerA:+Merging+Pretrained+Adapters+For+Few-Shot+Learning 15. Exploring the Relationship between In-Context Learning and Instruction Tuning — Hanyu Duan et al., 2023 https://scholar.google.com/scholar?q=Exploring+the+Relationship+between+In-Context+Learning+and+Instruction+Tuning 16. Is In-Context Learning Sufficient for Instruction Following in LLMs? — Hao Zhao et al., 2024 https://scholar.google.com/scholar?q=Is+In-Context+Learning+Sufficient+for+Instruction+Following+in+LLMs? 17. Symbol tuning improves in-context learning in language models — Jerry Wei et al., 2023 https://scholar.google.com/scholar?q=Symbol+tuning+improves+in-context+learning+in+language+models 18. Last One Standing: A Comparative Analysis of Security and Privacy of Soft Prompt Tuning, LoRA, and In-Context Learning — Rui Wen et al., 2023 https://scholar.google.com/scholar?q=Last+One+Standing:+A+Comparative+Analysis+of+Security+and+Privacy+of+Soft+Prompt+Tuning,+LoRA,+and+In-Context+Learning 19. AI Post Transformers: Benchmarking PEFT Techniques for Large Language Models — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-06-20-benchmarking-peft-techniques-for-large-l-41bbf5.mp3
2일 전

When Finer Microscaling Hurts LLM Quantization

This episode explores the paper Is Finer Better? The Limits of Microscaling Formats in Large Language Models and examines why shrinking microscaling block sizes can unexpectedly make low-bit LLM quantization worse instead of better. It walks through how microscaling pairs FP4 weights or activations with shared local FP8 scales, contrasts that setup with coarser quantization schemes, and places the work in the broader move from BF16 and FP8 toward cheaper, more hardware-friendly inference. The central argument is that smaller blocks do reduce element quantization error, but once the shared scale is itself quantized into a limited format like FP8 UE4M3, scale error can dominate and degrade perplexity. Listeners would find it interesting because the discussion turns a seemingly obvious engineering intuition on its head and shows that the real bottleneck in low-bit inference may be the precision of the scaling rule, not just the precision of the values being scaled. Sources: 1. Is Finer Better? The Limits of Microscaling Formats in Large Language Models — Andrea Fasoli, Monodeep Kar, Chi-Chun Liu, Swagath Venkataramani, Viji Srinivasan, Leland Chang, Naigang Wang, 2026 http://arxiv.org/abs/2601.19026 2. FP8 Formats for Deep Learning — Paulius Micikevicius, Dusan Stosic, Neil Burgess, Marius Cornea, Pradeep Dubey, Richard Grisenthwaite, Sangwon Ha, Alexander Heinecke, Patrick Judd, and others, 2022 https://arxiv.org/abs/2209.05433 3. With Shared Microexponents, A Little Shifting Goes a Long Way — Bita Rouhani, Ritchie Zhao, Venmugil Elango, Rasoul Shafipour, Mathew Hall, Maral Mesmakhosroshahi, Ankit More, and others, 2023 https://arxiv.org/abs/2302.08007 4. Microscaling Data Formats for Deep Learning — Bita Darvish Rouhani, Ritchie Zhao, Ankit More, Mathew Hall, Alireza Khodamoradi, Summer Deng, Dhruv Choudhary, and others, 2023 https://arxiv.org/abs/2310.10537 5. Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization — Vage Egiazarian, Roberto L. Castro, Denis Kuznedelev, Andrei Panferov, Eldar Kurtic, Shubhra Pandit, Alexandre Marques, Mark Kurtz, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh, 2025 https://arxiv.org/abs/2509.23202 6. AMXFP4: Taming Activation Outliers with Asymmetric Microscaling Floating-Point for 4-bit LLM Inference — Janghwan Lee et al., 2024 https://scholar.google.com/scholar?q=AMXFP4:+Taming+Activation+Outliers+with+Asymmetric+Microscaling+Floating-Point+for+4-bit+LLM+Inference 7. Nanoscaling Floating-Point (NxFP): NanoMantissa, Adaptive Microexponents, and Code Recycling for Direct-Cast Compression of Large Language Models — Yun-Chen Lo, Gu-Yeon Wei, David Brooks, 2024 https://scholar.google.com/scholar?q=Nanoscaling+Floating-Point+(NxFP):+NanoMantissa,+Adaptive+Microexponents,+and+Code+Recycling+for+Direct-Cast+Compression+of+Large+Language+Models 8. Elucidating the Design Space of FP4 Training — Robert Hu, Carlo Luschi, Paul Balanca, 2025 https://scholar.google.com/scholar?q=Elucidating+the+Design+Space+of+FP4+Training 9. Finer is Better (with the Right Scaling) — Clemens Schaefer, Gil Tabak, 2026 https://scholar.google.com/scholar?q=Finer+is+Better+(with+the+Right+Scaling) 10. Adaptive Block-Scaled Data Types — Jack Cook et al., 2026 https://arxiv.org/abs/2603.28765 11. Diagnosing FP4 inference: a layer-wise and block-wise sensitivity analysis of NVFP4 and MXFP4 — Musa Cim et al., 2026 https://arxiv.org/abs/2603.08747 12. Pretraining large language models with MXFP4 — Musa Cim et al., 2026 https://arxiv.org/abs/2605.09825 13. Dissecting Outlier Dynamics in LLM NVFP4 Pretraining — Peijie Dong et al., 2026 https://arxiv.org/abs/2602.02047 14. DuQuant++: Fine-grained Rotation Enhances Microscaling FP4 Quantization — Haokun Lin et al., 2026 https://arxiv.org/abs/2604.17789 15. AdaHOP: Fast and Accurate Low-Precision Training via Outlier-Pattern-Aware Rotation — Seonggon Kim et al., 2026 https://arxiv.org/abs/2604.02525 16. AI Post Transformers: Nemotron 3 Ultra for Long-Horizon Agents — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-06-17-nemotron-3-ultra-for-long-horizon-agents-32e4a5.mp3 17. AI Post Transformers: PackKV Lossy Compression for KV Caches — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-04-packkv-lossy-compression-for-kv-caches-b37bce.mp3 18. AI Post Transformers: FlashAttention-4 Conquers Asymmetric GPU Hardware Scaling — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-06-flashattention-4-conquers-asymmetric-gpu-78839b.mp3
3일 전

SuperInfer: SLO-Aware LLM Inference on Superchips

This episode explores SuperInfer, a system for serving large language models on GH200-style superchips by treating memory management as the key lever for meeting latency targets rather than just maximizing compute use. It explains why KV cache growth, HBM pressure, and head-of-line blocking often hurt responsiveness first, then breaks down how the paper’s RotaSched policy proactively rotates request state out of fast memory to protect time-to-first-token deadlines. It also covers DuplexKV, the transfer mechanism that makes this practical by batching fragmented KV data, using bidirectional movement across NVLink-C2C, and overlapping transfers with model execution instead of stalling the whole system. Listeners would find it interesting because the discussion ties concrete serving pain points to a specific systems design that reportedly boosts TTFT SLO attainment by up to 74.7 percent while keeping throughput and token pacing roughly stable. Sources: 1. SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference on Superchips — Jiahuan Yu, Mingtao Hu, Zichao Lin, Minjia Zhang, 2026 http://arxiv.org/abs/2601.20309 2. Pie: Pooling CPU Memory for LLM Inference — Y. Xu, Z. Mao, X. Mo, S. Liu, I. Stoica, 2024 https://scholar.google.com/scholar?q=Pie:+Pooling+CPU+Memory+for+LLM+Inference 3. Understanding Data Movement in Tightly Coupled Heterogeneous Systems: A Case Study with the Grace Hopper Superchip — L. Fusco, M. Khalilov, M. Chrapek, G. Chukkapalli, T. Schulthess, T. Hoefler, 2024 https://scholar.google.com/scholar?q=Understanding+Data+Movement+in+Tightly+Coupled+Heterogeneous+Systems:+A+Case+Study+with+the+Grace+Hopper+Superchip 4. Memory Offloading for Large Language Model Inference with Latency SLO Guarantees — C. Ma, Z. Ye, H. Zhao, Z. Yang, T. Fu, J. Han, J. Zhang, Y. Luo, X. Wang, Z. Wang, et al., 2025 https://scholar.google.com/scholar?q=Memory+Offloading+for+Large+Language+Model+Inference+with+Latency+SLO+Guarantees 5. Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve — A. Agrawal, N. Kedia, A. Panwar, J. Mohan, N. Kwatra, B. Gulavani, A. Tumanov, R. Ramjee, 2024 https://scholar.google.com/scholar?q=Taming+Throughput-Latency+Tradeoff+in+LLM+Inference+with+Sarathi-Serve 6. Mooncake: Trading More Storage for Less Computation - a KVCache-centric Architecture for Serving LLM Chatbot — R. Qin, Z. Li, W. He, J. Cui, F. Ren, M. Zhang, Y. Wu, W. Zheng, X. Xu, 2025 https://scholar.google.com/scholar?q=Mooncake:+Trading+More+Storage+for+Less+Computation+-+a+KVCache-centric+Architecture+for+Serving+LLM+Chatbot 7. TokenLake: A Unified Segment-level Prefix Cache Pool for Fine-grained Elastic Long-Context LLM Serving — Bingyang Wu et al., 2025 https://scholar.google.com/scholar?q=TokenLake:+A+Unified+Segment-level+Prefix+Cache+Pool+for+Fine-grained+Elastic+Long-Context+LLM+Serving 8. KVFlow: Efficient Prefix Caching for Accelerating LLM-Based Multi-Agent Workflows — Zaifeng Pan et al., 2025 https://scholar.google.com/scholar?q=KVFlow:+Efficient+Prefix+Caching+for+Accelerating+LLM-Based+Multi-Agent+Workflows 9. SAW-INT4: System-Aware 4-Bit KV-Cache Quantization for Real-World LLM Serving — Jinda Jia et al., 2026 https://scholar.google.com/scholar?q=SAW-INT4:+System-Aware+4-Bit+KV-Cache+Quantization+for+Real-World+LLM+Serving 10. KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization — Coleman Hooper et al., 2024 https://scholar.google.com/scholar?q=KVQuant:+Towards+10+Million+Context+Length+LLM+Inference+with+KV+Cache+Quantization 11. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving — Yinmin Zhong et al., 2024 https://scholar.google.com/scholar?q=DistServe:+Disaggregating+Prefill+and+Decoding+for+Goodput-optimized+Large+Language+Model+Serving 12. Prefill-Decode Aggregation or Disaggregation? Unifying Both for Goodput-Optimized LLM Serving — Chao Wang et al., 2025 https://scholar.google.com/scholar?q=Prefill-Decode+Aggregation+or+Disaggregation?+Unifying+Both+for+Goodput-Optimized+LLM+Serving 13. Enhancing LLM Efficiency: Targeted Pruning for Prefill-Decode Disaggregation in Inference — Hao Zhang et al., 2025 https://scholar.google.com/scholar?q=Enhancing+LLM+Efficiency:+Targeted+Pruning+for+Prefill-Decode+Disaggregation+in+Inference 14. AI Post Transformers: Mooncake for KV Cache-Centric LLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-06-05-mooncake-for-kv-cache-centric-llm-servin-1086d0.mp3 15. AI Post Transformers: LAPS for Length-Aware LLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-05-laps-for-length-aware-llm-serving-0c6149.mp3 16. AI Post Transformers: AI+HW 2035: Co-Designing Efficient AI Systems — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-06-24-aihw-2035-co-designing-efficient-ai-syst-95c11e.mp3 17. AI Post Transformers: FengHuang for Rack-Scale LLM Inference Memory — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-12-fenghuang-for-rack-scale-llm-inference-m-62708e.mp3 18. AI Post Transformers: KVSwap for Disk-Aware Long-Context On-Device Inference — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-16-kvswap-for-disk-aware-long-context-on-de-f3c15e.mp3 19. AI Post Transformers: Memory-Bound, Not Bandwidth-Limited Batch-1 LLM Decode — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-06-02-memory-bound-not-bandwidth-limited-batch-114799.mp3
3일 전

Temporal-Tiered KV Cache for Long Context

This episode explores TTKV, a temporal-tiered key-value cache design for long-context LLM inference, where decode speed degrades because growing KV state turns generation into a memory-bandwidth problem rather than a compute problem. It explains how the method keeps recent cache blocks in fast GPU HBM, evicts older blocks to slower host DRAM, and uses asymmetric quantization in the slow tier, preserving keys at higher precision while compressing values more aggressively. The discussion also breaks down the runtime mechanics behind block-wise streaming attention, including query-conditioned block ranking, top-k prefetching, decompression, and overlapping data transfer with attention computation. What makes the episode interesting is that it treats TTKV less as a new model idea and more as a systems design proposal, while critically questioning whether recency is a reliable proxy for importance and whether the paper fully specifies the cost of its block-selection function. Sources: 1. TTKV: Temporal-Tiered KV Cache for Long-Context LLM Inference — Gradwell Dzikanyanga, Weihao Yang, Hao Huang, Donglei Wu, Shihao Wang, Wen Xia, Sanjeeb K C, 2026 http://arxiv.org/abs/2604.19769 2. Efficient Memory Management for Large Language Model Serving with PagedAttention — Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica, 2023 https://scholar.google.com/scholar?q=Efficient+Memory+Management+for+Large+Language+Model+Serving+with+PagedAttention 3. Efficient Streaming Language Models with Attention Sinks — Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, Mike Lewis, 2023 https://scholar.google.com/scholar?q=Efficient+Streaming+Language+Models+with+Attention+Sinks 4. KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache — Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, Xia Hu, 2024 https://scholar.google.com/scholar?q=KIVI:+A+Tuning-Free+Asymmetric+2bit+Quantization+for+KV+Cache 5. ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference — Hanshi Sun, Li-Wen Chang, Wenlei Bao, Size Zheng, Ningxin Zheng, Xin Liu, Harry Dong, Yuejie Chi, Beidi Chen, 2024 https://scholar.google.com/scholar?q=ShadowKV:+KV+Cache+in+Shadows+for+High-Throughput+Long-Context+LLM+Inference 6. FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference — Guangda Liu et al., 2025 https://scholar.google.com/scholar?q=FreeKV:+Boosting+KV+Cache+Retrieval+for+Efficient+LLM+Inference 7. FIER: Fine-Grained and Efficient KV Cache Retrieval for Long-context LLM Inference — Dongwei Wang et al., 2025 https://scholar.google.com/scholar?q=FIER:+Fine-Grained+and+Efficient+KV+Cache+Retrieval+for+Long-context+LLM+Inference 8. CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving — Yuhan Liu et al., 2024 https://scholar.google.com/scholar?q=CacheGen:+KV+Cache+Compression+and+Streaming+for+Fast+Large+Language+Model+Serving 9. Retrieval Head Mechanistically Explains Long-Context Factuality — Wenhao Wu et al., 2024 https://arxiv.org/abs/2404.15574 10. DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads — Guangxuan Xiao et al., 2024 https://arxiv.org/abs/2410.10819 11. Query-Focused Retrieval Heads Improve Long-Context Reasoning and Re-ranking — Wuwei Zhang et al., 2025 https://arxiv.org/abs/2506.09944 12. LKV: End-to-End Learning of Head-wise Budgets and Token Selection for LLM KV Cache Eviction — Enshuai Zhou et al., 2026 https://arxiv.org/abs/2605.06676 13. IndexMem: Learned KV-Cache Eviction with Latent Memory for Long-Context LLM Inference — Xintong Yang et al., 2026 https://arxiv.org/abs/2605.25475 14. KVTuner: Sensitivity-Aware Layer-wise Mixed Precision KV Cache Quantization for Efficient and Nearly Lossless LLM Inference — Xing Li et al., 2025 https://arxiv.org/abs/2502.04420 15. AsymKV: Enabling 1-Bit Quantization of KV Cache with Layer-Wise Asymmetric Quantization Configurations — Qian Tao et al., 2024 https://arxiv.org/abs/2410.13212 16. vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention — Ramya Prabhu et al., 2024 https://arxiv.org/abs/2405.04437 17. AI Post Transformers: ScoutAttention for Efficient KV Cache Offloading — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-24-scoutattention-for-efficient-kv-cache-of-b26699.mp3 18. AI Post Transformers: MiniMax Sparse Attention at Million-Token Scale — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-06-13-minimax-sparse-attention-at-million-toke-300108.mp3 19. AI Post Transformers: IndexMem: Learned KV-Cache Eviction for Long-Context LLMs — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-06-12-indexmem-learned-kv-cache-eviction-for-l-132c2a.mp3 20. AI Post Transformers: Mooncake for KV Cache-Centric LLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-06-05-mooncake-for-kv-cache-centric-llm-servin-1086d0.mp3 21. AI Post Transformers: Memory-Bound, Not Bandwidth-Limited Batch-1 LLM Decode — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-06-02-memory-bound-not-bandwidth-limited-batch-114799.mp3 22. AI Post Transformers: Why LLM Serving Needs Mathematical Optimization — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-05-why-llm-serving-needs-mathematical-optim-647fc6.mp3

모두 보기(748개)

3.7

최고 5점

3개의 평가

제작진

mcgrof
방송 연도

2025년 - 2026년
에피소드

748
등급

전체 연령 사용가
웹사이트 보기

AI Post Transformers

데일리 뉴스

데일리 뉴스

매일 업데이트

AI Post Transformers

DART Speeds Up Speculative LLM Decoding

SALCA for Sparse Long-Context Decoding

SERA: Repository Specialization for Open Coding Agents

Cache-Resident LLM Inference in GB-Scale Caches

Scaling Prompt Tuning for Frozen T5 Models

When Finer Microscaling Hurts LLM Quantization

SuperInfer: SLO-Aware LLM Inference on Superchips

Temporal-Tiered KV Cache for Long Context

평가 및 리뷰

소개

정보

좋아할 만한 다른 항목

AI Post Transformers

에피소드

DART Speeds Up Speculative LLM Decoding

SALCA for Sparse Long-Context Decoding

SERA: Repository Specialization for Open Coding Agents

Cache-Resident LLM Inference in GB-Scale Caches

Scaling Prompt Tuning for Frozen T5 Models

When Finer Microscaling Hurts LLM Quantization

SuperInfer: SLO-Aware LLM Inference on Superchips

Temporal-Tiered KV Cache for Long Context

평가 및 리뷰

소개

정보

좋아할 만한 다른 항목