AI Post Transformers

mcgrof

3.7 (3)
TECHNOLOGY
UPDATED DAILY

AI-generated podcast where hosts Hal Turing and Dr. Ada Shannon discuss the latest research papers and reports in machine learning, AI systems, and optimization. Featuring honest critical analysis, proper citations, and nerdy humor.

1D AGO

Computation-Bandwidth-Memory Trade-offs for AI Infrastructure

This episode explores a systems paper that argues AI infrastructure should treat computation, interconnect bandwidth, and memory as a single joint design space rather than three separate bottlenecks. It explains the paper’s “AI Trinity” framework and walks through the main trade-offs: using extra compute to reduce communication, using networked or disaggregated memory to ease local memory limits, and using caching or stored intermediates to avoid recomputation. The discussion connects that framing to real AI practice, from distributed training bottlenecked by all-reduce bandwidth to inference constrained by KV-cache memory, while grounding it in broader ideas like scaling laws, the “Bitter Lesson,” FlashAttention’s IO-aware design, and the roofline model. A listener would find it interesting because it translates familiar pains—GPU memory ceilings, gradient traffic, and hardware inefficiency—into a clearer systems-level way of thinking about how modern AI actually scales. Sources: 1. Computation-Bandwidth-Memory Trade-offs: A Unified Paradigm for AI Infrastructure — Yuankai Fan, Qizhen Weng, Xuelong Li, 2025 http://arxiv.org/abs/2601.11577 2. ZeRO: Memory Optimizations Toward Training Trillion Parameter Models — Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, Yuxiong He, 2020 https://scholar.google.com/scholar?q=ZeRO:+Memory+Optimizations+Toward+Training+Trillion+Parameter+Models 3. Training Deep Nets with Sublinear Memory Cost — Tianqi Chen, Bing Xu, Chiyuan Zhang, Carlos Guestrin, 2016 https://scholar.google.com/scholar?q=Training+Deep+Nets+with+Sublinear+Memory+Cost 4. vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design — Minsoo Rhu, Natalia Gimelshein, Jason Clemons, Arslan Zulfiqar, Stephen W. Keckler, 2016 https://scholar.google.com/scholar?q=vDNN:+Virtualized+Deep+Neural+Networks+for+Scalable,+Memory-Efficient+Neural+Network+Design 5. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism — Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, Bryan Catanzaro, 2019 https://scholar.google.com/scholar?q=Megatron-LM:+Training+Multi-Billion+Parameter+Language+Models+Using+Model+Parallelism 6. Reducing Activation Recomputation in Large Transformer Models — Benjamin L. Kirby, Jackson Kernion, et al., 2024 https://scholar.google.com/scholar?q=Reducing+Activation+Recomputation+in+Large+Transformer+Models 7. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness — Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré, 2022 https://scholar.google.com/scholar?q=FlashAttention:+Fast+and+Memory-Efficient+Exact+Attention+with+IO-Awareness 8. Split Computing for Mobile Deep Inference: Survey and Research Directions — Yiping Kang, Johan Hauswald, et al. / related survey literature, 2020 https://scholar.google.com/scholar?q=Split+Computing+for+Mobile+Deep+Inference:+Survey+and+Research+Directions 9. Learning-Based Video Compression — Oren Rippel, Lubomir Bourdev, 2017 https://scholar.google.com/scholar?q=Learning-Based+Video+Compression 10. Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training — Yujun Lin, Song Han, Huizi Mao, Yu Wang, William J. Dally, 2018 https://scholar.google.com/scholar?q=Deep+Gradient+Compression:+Reducing+the+Communication+Bandwidth+for+Distributed+Training 11. Accelerating Diffusion Models with Cache-Based or Feature Reuse Methods (e.g., DeepCache / related 2024 diffusion caching work) — Various, 2024 https://scholar.google.com/scholar?q=Accelerating+Diffusion+Models+with+Cache-Based+or+Feature+Reuse+Methods+(e.g.,+DeepCache+/+related+2024+diffusion+caching+work) 12. RazorAttention: Efficient KV Cache Compression through Retrieval Heads — approx. recent LLM systems/serving authors, 2024/2025 https://scholar.google.com/scholar?q=RazorAttention:+Efficient+KV+Cache+Compression+through+Retrieval+Heads 13. StreamKV: Streaming Video Question-Answering with Segment-Based KV Cache Retrieval and Compression — approx. recent multimodal/LLM authors, 2024/2025 https://scholar.google.com/scholar?q=StreamKV:+Streaming+Video+Question-Answering+with+Segment-Based+KV+Cache+Retrieval+and+Compression 14. Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning — approx. recent LLM inference authors, 2024/2025 https://scholar.google.com/scholar?q=Not+All+Heads+Matter:+A+Head-Level+KV+Cache+Compression+Method+with+Integrated+Retrieval+and+Reasoning 15. In-Network Aggregation with Transport Transparency for Distributed Training — approx. systems/networking authors, 2023/2024 https://scholar.google.com/scholar?q=In-Network+Aggregation+with+Transport+Transparency+for+Distributed+Training 16. GRID: Gradient Routing with In-Network Aggregation for Distributed Training — approx. systems/networking authors, 2024/2025 https://scholar.google.com/scholar?q=GRID:+Gradient+Routing+with+In-Network+Aggregation+for+Distributed+Training 17. InArt: In-Network Aggregation with Route Selection for Accelerating Distributed Training — approx. systems/networking authors, 2024/2025 https://scholar.google.com/scholar?q=InArt:+In-Network+Aggregation+with+Route+Selection+for+Accelerating+Distributed+Training 18. PrivyNAS: Privacy-Aware Neural Architecture Search for Split Computing in Edge-Cloud Systems — approx. edge AI / NAS authors, 2024/2025 https://scholar.google.com/scholar?q=PrivyNAS:+Privacy-Aware+Neural+Architecture+Search+for+Split+Computing+in+Edge-Cloud+Systems 19. Advancements and Challenges in Privacy-Preserving Split Learning: Experimental Findings and Future Directions — approx. survey/review authors, 2024/2025 https://scholar.google.com/scholar?q=Advancements+and+Challenges+in+Privacy-Preserving+Split+Learning:+Experimental+Findings+and+Future+Directions 20. Lightweight User-Personalization Method for Closed Split Computing — approx. split-computing authors, 2024/2025 https://scholar.google.com/scholar?q=Lightweight+User-Personalization+Method+for+Closed+Split+Computing 21. AI Post Transformers: FlatAttention for Tile-Based Accelerator Inference — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-flatattention-for-tile-based-accelerator-56e6ca.mp3 22. AI Post Transformers: CXL Computational Memory Offloading for Lower Runtime — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-cxl-computational-memory-offloading-for-3b2124.mp3 23. AI Post Transformers: FAST26: Bidaw: Enhancing Key-Value Caching for Interactive LLM Serving via Bidirectional — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/fast26-bidaw-enhancing-key-value-caching-for-interactive-llm-serving-via-bidirec/ 24. AI Post Transformers: Episode: LookaheadKV: Fast and Accurate KV Cache Eviction by Glimpsing into the Future without Generation — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-25-lookaheadkv-fast-and-accurate-kv-9cfc9f.mp3 25. AI Post Transformers: Memory Sparse Attention for 100M-Token Scaling — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-07-memory-sparse-attention-for-100m-token-s-377cff.mp3 26. AI Post Transformers: Accelerating LLM Cold Starts with Programmable Page Cache — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-17-accelerating-llm-cold-starts-with-progra-0912d1.mp3 27. AI Post Transformers: Paris: Decentralized Open-Weight Diffusion Model — Hal Turing & Dr. Ada Shannon, 2025 https://podcast.do-not-panic.com/episodes/paris-decentralized-open-weight-diffusion-model/ Interactive Visualization: Computation-Bandwidth-Memory Trade-offs for AI Infrastructure
1D AGO

DRAM-Free In-Flash Computing for LLM Inference

This episode explores a 2025 arXiv paper proposing “KVNAND,” an on-device LLM inference system that stores both model weights and the attention KV cache in compute-enabled 3D NAND flash to reduce or eliminate reliance on external DRAM. The discussion explains why decode-time generation is often bottlenecked by memory movement rather than raw compute, and argues that the KV cache—not just model weights—has become a major systems problem for long-context inference. It also examines whether the paper’s “DRAM-free” claim is technically convincing, especially given how KV cache costs vary across attention designs like MHA, GQA, and MQA. A listener would find it interesting for its concrete look at hardware-software tradeoffs in local LLM deployment and its skepticism about whether flashy architectural claims hold up under realistic workloads. Sources: 1. KVNAND: Efficient On-Device Large Language Model Inference Using DRAM-Free In-Flash Computing — Lishuo Deng, Shaojie Xu, Jinwu Chen, Changwei Yan, Jiajie Wang, Zhe Jiang, Weiwei Shan, 2025 http://arxiv.org/abs/2512.03608 2. A Survey of Processing-in-Memory: Techniques, Applications, and Challenges — Seyed H. N. Fatemi Langroudi and others, 2024 https://scholar.google.com/scholar?q=A+Survey+of+Processing-in-Memory:+Techniques,+Applications,+and+Challenges 3. Computational Storage: Where Are We Today? — Keith Townsend, Nils Bjerregaard, Javier Gonzalez and others, 2022 https://scholar.google.com/scholar?q=Computational+Storage:+Where+Are+We+Today? 4. Cambricon-LLM: Memory-Efficient Large Language Model Inference with Compute-Enabled Flash Memory — Main authors from the Cambricon research team, 2024 https://scholar.google.com/scholar?q=Cambricon-LLM:+Memory-Efficient+Large+Language+Model+Inference+with+Compute-Enabled+Flash+Memory 5. Lincoln: Accelerating Long-Context LLM Inference with In-Flash Computing — Main authors from the Lincoln research team, 2024 https://scholar.google.com/scholar?q=Lincoln:+Accelerating+Long-Context+LLM+Inference+with+In-Flash+Computing 6. PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU — Ying Sheng, Yuzhang Wang, Beidi Chen and others, 2024 https://scholar.google.com/scholar?q=PowerInfer:+Fast+Large+Language+Model+Serving+with+a+Consumer-grade+GPU 7. LLM in a Flash: Efficient Large Language Model Inference with Limited Memory — Keiichi Yao, Siddharth Joshi, Priya Goyal and others, 2024 https://scholar.google.com/scholar?q=LLM+in+a+Flash:+Efficient+Large+Language+Model+Inference+with+Limited+Memory 8. Speculative Decoding for Accelerating Large Language Model Inference — Yaniv Leviathan, Matan Kalman, Yossi Matias, 2023 https://scholar.google.com/scholar?q=Speculative+Decoding+for+Accelerating+Large+Language+Model+Inference 9. PagedAttention: Efficient Memory Management for Large Language Model Serving with Paged KV Cache — Woosuk Kwon, Zhihong Shen, Siyuan Zhuang and others, 2023 https://scholar.google.com/scholar?q=PagedAttention:+Efficient+Memory+Management+for+Large+Language+Model+Serving+with+Paged+KV+Cache 10. Cambricon-LLM — Not specified in the provided excerpt, Likely 2024-2025 https://scholar.google.com/scholar?q=Cambricon-LLM 11. Lincoln — Not specified in the provided excerpt, Likely 2024-2025 https://scholar.google.com/scholar?q=Lincoln 12. LLaMA 2 — Hugo Touvron et al., 2023 https://scholar.google.com/scholar?q=LLaMA+2 13. Llama 3.1 — Meta AI, 2024 https://scholar.google.com/scholar?q=Llama+3.1 14. PagedAttention / vLLM — Woosuk Kwon et al., 2023 https://scholar.google.com/scholar?q=PagedAttention+/+vLLM 15. FlashAttention — Tri Dao et al., 2022 https://scholar.google.com/scholar?q=FlashAttention 16. MQA/GQA transformer variants such as GQA in Llama-family models — Various, 2023-2024 https://scholar.google.com/scholar?q=MQA/GQA+transformer+variants+such+as+GQA+in+Llama-family+models 17. Computational storage / near-data processing in SSDs — Various, 2019-2024 https://scholar.google.com/scholar?q=Computational+storage+/+near-data+processing+in+SSDs 18. RazorAttention: Efficient KV Cache Compression through Retrieval Heads — approx. recent LLM systems/ML authors, 2024/2025 https://scholar.google.com/scholar?q=RazorAttention:+Efficient+KV+Cache+Compression+through+Retrieval+Heads 19. Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning — approx. recent LLM efficiency authors, 2024/2025 https://scholar.google.com/scholar?q=Not+All+Heads+Matter:+A+Head-Level+KV+Cache+Compression+Method+with+Integrated+Retrieval+and+Reasoning 20. AhaKV: Adaptive Holistic Attention-Driven KV Cache Eviction for Efficient Inference of Large Language Models — approx. recent LLM inference authors, 2024/2025 https://scholar.google.com/scholar?q=AhaKV:+Adaptive+Holistic+Attention-Driven+KV+Cache+Eviction+for+Efficient+Inference+of+Large+Language+Models 21. G-KV: Decoding-Time KV Cache Eviction with Global Attention — approx. recent LLM efficiency authors, 2024/2025 https://scholar.google.com/scholar?q=G-KV:+Decoding-Time+KV+Cache+Eviction+with+Global+Attention 22. LLMs Know What to Drop: Self-Attention Guided KV Cache Eviction for Efficient Long-Context Inference — approx. recent LLM inference authors, 2024/2025 https://scholar.google.com/scholar?q=LLMs+Know+What+to+Drop:+Self-Attention+Guided+KV+Cache+Eviction+for+Efficient+Long-Context+Inference 23. Harnessing Your DRAM and SSD for Sustainable and Accessible LLM Inference with Mixed-Precision and Multi-Level Caching — approx. recent systems authors, 2024/2025 https://scholar.google.com/scholar?q=Harnessing+Your+DRAM+and+SSD+for+Sustainable+and+Accessible+LLM+Inference+with+Mixed-Precision+and+Multi-Level+Caching 24. Efficient LLM Inference Using Dynamic Input Pruning and Cache-Aware Masking — approx. recent mobile/edge inference authors, 2024/2025 https://scholar.google.com/scholar?q=Efficient+LLM+Inference+Using+Dynamic+Input+Pruning+and+Cache-Aware+Masking 25. SLED: A Speculative LLM Decoding Framework for Efficient Edge Serving — approx. recent edge serving authors, 2024/2025 https://scholar.google.com/scholar?q=SLED:+A+Speculative+LLM+Decoding+Framework+for+Efficient+Edge+Serving 26. DSSD: Efficient Edge-Device LLM Deployment and Collaborative Inference via Distributed Split Speculative Decoding — approx. recent edge/cloud systems authors, 2024/2025 https://scholar.google.com/scholar?q=DSSD:+Efficient+Edge-Device+LLM+Deployment+and+Collaborative+Inference+via+Distributed+Split+Speculative+Decoding 27. AI Post Transformers: SolidAttention: Co-Designing Sparse Attention and SSD I/O — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-18-solidattention-co-designing-sparse-atten-5a8622.mp3 28. AI Post Transformers: LookaheadKV: Fast and Accurate KV Cache Eviction by Glimpsing into the Future without Generation — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-25-lookaheadkv-fast-and-accurate-kv-9cfc9f.mp3 29. AI Post Transformers: Lookahead Q-Cache for Consistent KV Eviction — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-25-lookahead-q-cache-for-consistent-kv-evic-d97b09.mp3 30. AI Post Transformers: QuantSpec: Hierarchical KV Cache for Self-Speculative Decoding — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/quantspec-hierarchical-kv-cache-for-self-speculative-decoding/ 31. AI Post Transformers: CXL-SpecKV: Bridging the LLM Memory Wall with Speculative FPGA Disaggregation — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/cxl-speckv-bridging-the-llm-memory-wall-with-speculative-fpga-disaggregation/ 32. AI Post Transformers: FAST26: Bidaw: Enhancing Key-Value Caching for Interactive LLM Serving via Bidirectional — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/fast26-bidaw-enhancing-key-value-caching-for-interactive-llm-serving-via-bidirec/ 33. AI Post Transformers: FlexGen: High-Throughput LLM Inference on a Single GPU — Hal Turing & Dr. Ada Shannon, 2025 https://podcast.do-not-panic.com/episodes/flexgen-high-throughput-llm-inference-on-a-single-gpu/ 34. AI Post Transformers: Splitwise: Phase-Split LLM Inference — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-26-splitwise-phase-split-llm-inference-e8945b.mp3 Interactive Visualization: DRAM-Free In-Flash Computing for LLM Inference
3D AGO

Cache Mechanism for Agent RAG Systems

This episode explores a 2025 paper on cache management for agentic RAG systems, asking whether an annotation-free cache can preserve most of the value of a massive retrieval corpus while using far less storage and reducing latency. It explains how RAG, agent memory, vector databases, embeddings, and approximate nearest neighbor search fit together, arguing that retrieval performance is not just a modeling issue but a core systems constraint for real-world agents. The discussion situates the paper in the broader history of retrieval and agent research, from Word2Vec and BERT to Dense Passage Retrieval, ReAct, and FAISS, showing why externalized knowledge remains useful even as language models grow larger. Listeners would find it interesting because it focuses on a practical but consequential question: how to make retrieval-heavy AI agents cheaper, faster, and more deployable outside large cloud infrastructures. Sources: 1. Cache Mechanism for Agent RAG Systems — Shuhang Lin, Zhencan Peng, Lingyao Li, Xiao Lin, Xi Zhu, Yongfeng Zhang, 2025 http://arxiv.org/abs/2511.02919 2. PlanRAG — Lee et al., 2024 https://scholar.google.com/scholar?q=PlanRAG 3. Generate-then-Ground — Shi et al., 2024 https://scholar.google.com/scholar?q=Generate-then-Ground 4. RAP — Kagaya et al., 2024 https://scholar.google.com/scholar?q=RAP 5. RAT — Wang et al., 2024 https://scholar.google.com/scholar?q=RAT 6. Mei et al. (system engineering / large knowledge repositories) — Mei et al., 2025 https://scholar.google.com/scholar?q=Mei+et+al.+(system+engineering+/+large+knowledge+repositories) 7. Guo et al. on RAG-powered agent architectures — Guo et al., 2025 https://scholar.google.com/scholar?q=Guo+et+al.+on+RAG-powered+agent+architectures 8. Long Context vs. RAG for LLMs: An Evaluation and Revisits — approx. recent LLM/RAG evaluation authors, 2024/2025 https://scholar.google.com/scholar?q=Long+Context+vs.+RAG+for+LLMs:+An+Evaluation+and+Revisits 9. Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? — approx. recent long-context LLM systems authors, 2024/2025 https://scholar.google.com/scholar?q=Can+Long-Context+Language+Models+Subsume+Retrieval,+RAG,+SQL,+and+More? 10. Predicting Retrieval Utility and Answer Quality in Retrieval-Augmented Generation — approx. recent RAG evaluation/prediction authors, 2024/2025 https://scholar.google.com/scholar?q=Predicting+Retrieval+Utility+and+Answer+Quality+in+Retrieval-Augmented+Generation 11. Relevance Filtering for Embedding-Based Retrieval — approx. recent dense retrieval / IR authors, 2024/2025 https://scholar.google.com/scholar?q=Relevance+Filtering+for+Embedding-Based+Retrieval 12. Volatility-Driven Decay: Adaptive Memory Retention for RAG Systems Under Unknown Drift — approx. recent continual RAG / memory authors, 2025 https://scholar.google.com/scholar?q=Volatility-Driven+Decay:+Adaptive+Memory+Retention+for+RAG+Systems+Under+Unknown+Drift 13. On the Role of Long-Tail Knowledge in Retrieval Augmented Large Language Models — approx. recent RAG robustness authors, 2024/2025 https://scholar.google.com/scholar?q=On+the+Role+of+Long-Tail+Knowledge+in+Retrieval+Augmented+Large+Language+Models 14. Graph-Based Retriever Captures the Long Tail of Biomedical Knowledge — approx. recent biomedical retrieval authors, 2024/2025 https://scholar.google.com/scholar?q=Graph-Based+Retriever+Captures+the+Long+Tail+of+Biomedical+Knowledge 15. FIT-RAG: Black-Box RAG with Factual Information and Token Reduction — approx. recent black-box RAG authors, 2024/2025 https://scholar.google.com/scholar?q=FIT-RAG:+Black-Box+RAG+with+Factual+Information+and+Token+Reduction 16. AI Post Transformers: QVCache for Semantic Caching in ANN Search — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-qvcache-for-semantic-caching-in-ann-sear-415304.mp3 17. AI Post Transformers: Episode: From Prefix Cache to Fusion RAG Cache: Accelerating LLM Inference in Retrieval-Augmented Generation — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-22-from-prefix-cache-to-fusion-rag-9c5d39.mp3 18. AI Post Transformers: MEMSEARCHER: Reinforcement Learning for LLM Memory Management — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-memsearcher-reinforcement-learning-for-l-e9ad84.mp3 19. AI Post Transformers: Doc-to-LoRA: Internalizing Context as LoRA — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-29-doc-to-lora-internalizing-context-as-lor-8dd5ec.mp3 20. AI Post Transformers: ColBERT and ColBERT v2 — Hal Turing & Dr. Ada Shannon, 2025 https://podcast.do-not-panic.com/episodes/colbert-and-colbert-v2/ Interactive Visualization: Cache Mechanism for Agent RAG Systems
3D AGO

Memory Sparse Attention for 100M-Token Scaling

This episode explores a paper proposing Memory Sparse Attention, an end-to-end trainable memory architecture designed to scale language models from ordinary long-context settings to 100 million tokens. The discussion explains why standard dense self-attention becomes infeasible at extreme lengths, distinguishes simple context-window extension from true “lifetime-scale” memory, and situates the approach among alternatives like parameter-based memory, recurrent compression, and external retrieval systems such as RAG. It argues that the paper’s core idea is selective, trainable access to a small set of relevant memory segments rather than treating all past tokens as one continuous stream, while also noting the authors’ ambitious systems claims around practical inference. A listener would find it interesting for its clear framing of what makes ultra-long-context modeling hard, and for its skeptical but concrete examination of whether this architecture meaningfully bridges the gap between long prompts and persistent memory. Sources: 1. MSA: Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens — Yu Chen, Runkai Chen, Sheng Yi, Xinda Zhao, Xiaohong Li, Jianjin Zhang, Jun Sun, Chuanrui Hu, Yunyun Han, Lidong Bing, Yafeng Deng, Tianqiao Chen, 2026 http://arxiv.org/abs/2603.23516 2. Efficient Memory Management for Large Language Model Serving with PagedAttention — Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, et al., 2023 https://scholar.google.com/scholar?q=Efficient+Memory+Management+for+Large+Language+Model+Serving+with+PagedAttention 3. Ring Attention with Blockwise Transformers for Near-Infinite Context — Aidan N. Gomez, Sean Dao, and collaborators, 2023 https://scholar.google.com/scholar?q=Ring+Attention+with+Blockwise+Transformers+for+Near-Infinite+Context 4. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism — Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, Bryan Catanzaro, 2019 https://scholar.google.com/scholar?q=Megatron-LM:+Training+Multi-Billion+Parameter+Language+Models+Using+Model+Parallelism 5. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning — Tri Dao, 2023 https://scholar.google.com/scholar?q=FlashAttention-2:+Faster+Attention+with+Better+Parallelism+and+Work+Partitioning 6. RoFormer: Enhanced Transformer with Rotary Position Embedding — Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, Yunfeng Liu, 2021 https://scholar.google.com/scholar?q=RoFormer:+Enhanced+Transformer+with+Rotary+Position+Embedding 7. Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation — Ofir Press, Noah A. Smith, Mike Lewis, 2021 https://scholar.google.com/scholar?q=Train+Short,+Test+Long:+Attention+with+Linear+Biases+Enables+Input+Length+Extrapolation 8. Extending Context Window of Large Language Models via Positional Interpolation — Shouyuan Chen, Sherman Wong, Liangcheng Luo, et al., 2023 https://scholar.google.com/scholar?q=Extending+Context+Window+of+Large+Language+Models+via+Positional+Interpolation 9. YaRN: Efficient Context Window Extension of Large Language Models — Bowen Peng, Jeffrey Quesnelle, Honglu Fan, Enming Luo, 2023 https://scholar.google.com/scholar?q=YaRN:+Efficient+Context+Window+Extension+of+Large+Language+Models 10. Titans: Learning to Memorize at Test Time — Ali Behrouz, Peilin Zhong, Vahab Mirrokni, 2025 https://scholar.google.com/scholar?q=Titans:+Learning+to+Memorize+at+Test+Time 11. Infini-attention: Infinite Context for Efficient Transformers — Hao Liu, Wilson Yan, Matei Zaharia, Pieter Abbeel, 2024 https://scholar.google.com/scholar?q=Infini-attention:+Infinite+Context+for+Efficient+Transformers 12. LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens — Yucheng Ding, Li Dong, et al., 2024 https://scholar.google.com/scholar?q=LongRoPE:+Extending+LLM+Context+Window+Beyond+2+Million+Tokens 13. MemGPT: Towards LLMs as Operating Systems — Charles Packer, Vivian Fang, Shishir G. Patil, Kevin Lin, Sarah Wooders, Joseph E. Gonzalez, 2024 https://scholar.google.com/scholar?q=MemGPT:+Towards+LLMs+as+Operating+Systems 14. RAG: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks — Patrick Lewis, Ethan Perez, Aleksandara Piktus, et al., 2020 https://scholar.google.com/scholar?q=RAG:+Retrieval-Augmented+Generation+for+Knowledge-Intensive+NLP+Tasks 15. KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache — Zhenyu Liu, et al., 2024 https://scholar.google.com/scholar?q=KIVI:+A+Tuning-Free+Asymmetric+2bit+Quantization+for+KV+Cache 16. PagedAttention / vLLM: Efficient Memory Management for Large Language Model Serving — Woosuk Kwon, Zhuohan Li, et al., 2023 https://scholar.google.com/scholar?q=PagedAttention+/+vLLM:+Efficient+Memory+Management+for+Large+Language+Model+Serving 17. Memorizing Transformers — Yannic Kilcher? no; actually Jack W. Rae, Sebastian Borgeaud, Trevor Cai, et al., 2022 https://scholar.google.com/scholar?q=Memorizing+Transformers 18. TransformerFAM / Focused Attention Memory variants for long-context retrieval — Various 2024-2025 authors, 2024-2025 https://scholar.google.com/scholar?q=TransformerFAM+/+Focused+Attention+Memory+variants+for+long-context+retrieval 19. OpenRAG: Optimizing RAG End-to-End via In-Context Retrieval Learning — approx. 2025 authors unclear from snippet, 2025 https://scholar.google.com/scholar?q=OpenRAG:+Optimizing+RAG+End-to-End+via+In-Context+Retrieval+Learning 20. Beyond RAG for Agent Memory: Retrieval by Decoupling and Aggregation — approx. 2025 authors unclear from snippet, 2025 https://scholar.google.com/scholar?q=Beyond+RAG+for+Agent+Memory:+Retrieval+by+Decoupling+and+Aggregation 21. MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention — approx. 2024/2025 authors unclear from snippet, 2024/2025 https://scholar.google.com/scholar?q=MInference+1.0:+Accelerating+Pre-filling+for+Long-Context+LLMs+via+Dynamic+Sparse+Attention 22. SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention — approx. 2025 authors unclear from snippet, 2025 https://scholar.google.com/scholar?q=SampleAttention:+Near-Lossless+Acceleration+of+Long+Context+LLM+Inference+with+Adaptive+Structured+Sparse+Attention 23. FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference — approx. 2025 authors unclear from snippet, 2025 https://scholar.google.com/scholar?q=FlexPrefill:+A+Context-Aware+Sparse+Attention+Mechanism+for+Efficient+Long-Sequence+Inference 24. Kvlink: Accelerating Large Language Models via Efficient KV Cache Reuse — approx. 2025 authors unclear from snippet, 2025 https://scholar.google.com/scholar?q=Kvlink:+Accelerating+Large+Language+Models+via+Efficient+KV+Cache+Reuse 25. HyperRAG: Enhancing Quality-Efficiency Tradeoffs in Retrieval-Augmented Generation with Reranker KV-Cache Reuse — approx. 2025 authors unclear from snippet, 2025 https://scholar.google.com/scholar?q=HyperRAG:+Enhancing+Quality-Efficiency+Tradeoffs+in+Retrieval-Augmented+Generation+with+Reranker+KV-Cache+Reuse 26. ProphetKV: User-Query-Driven Selective Recomputation for Efficient KV Cache Reuse in Retrieval-Augmented Generation — approx. 2025 authors unclear from snippet, 2025 https://scholar.google.com/scholar?q=ProphetKV:+User-Query-Driven+Selective+Recomputation+for+Efficient+KV+Cache+Reuse+in+Retrieval-Augmented+Generation 27. Hierarchical Local-Global Transformer With Dynamic Positional Encoding for Document-Level Machine Translation — approx. 2024/2025 authors unclear from snippet, 2024/2025 https://scholar.google.com/scholar?q=Hierarchical+Local-Global+Transformer+With+Dynamic+Positional+Encoding+for+Document-Level+Machine+Translation 28. AI Post Transformers: MEMSEARCHER: Reinforcement Learning for LLM Memory Management — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-memsearcher-reinforcement-learning-for-l-e9ad84.mp3 29. AI Post Transformers: Kimi Linear: Efficient Expressive Attention Architecture — Hal Turing & Dr. Ada Shannon, 2025 https://podcast.do-not-panic.com/episodes/kimi-linear-efficient-expressive-attention-architecture/ 30. AI Post Transformers: Lookahead Q-Cache for Consistent KV Eviction — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-25-lookahead-q-cache-for-consistent-kv-evic-d97b09.mp3 31. AI Post Transformers: CacheSlide: Position-Aware KV Cache Reuse for Agent LLMs — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-16-cacheslide-position-aware-kv-cache-reuse-cd59c7.mp3 32. AI Post Transformers: Doc-to-LoRA: Internalizing Context as LoRA — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-29-doc-to-lora-internalizing-context-as-lor-8dd5ec.mp3 33. AI Post Transformers: From Prefix Cache to Fusion RAG Cache: Accelerating LLM Inference in Retrieval-Augmented Generation — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-22-from-prefix-cache-to-fusion-rag-9c5d39.mp3 34. AI Post Transformers: TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-25-turboquant-online-vector-quantiz-1967b7.mp3
3D AGO

TriAttention for Efficient Long-Context KV Compression

This episode explores TriAttention, a new method for reducing KV-cache memory during long-context inference by modeling how attention behaves under Rotary Positional Embeddings rather than relying on recent attention patterns alone. It explains why common compression methods can fail for long reasoning tasks: under RoPE, queries at different positions are rotated into different coordinate systems, so a small window of recent post-RoPE queries is a poor predictor of which earlier tokens will matter later. The discussion highlights the paper’s dual contribution as both a systems result for making 32K-token-style reasoning more practical and a mechanistic argument that transformer attention has analyzable structure rather than being purely empirical. Listeners interested in efficient LLM serving, long-context reasoning, or the inner geometry of attention will find it compelling because it connects deployment bottlenecks with a concrete theoretical explanation. Sources: 1. TriAttention: Efficient Long Reasoning with Trigonometric KV Compression — Weian Mao, Xi Lin, Wei Huang, Yuxin Xie, Tianfu Fu, Bohan Zhuang, Song Han, Yukang Chen, 2026 http://arxiv.org/abs/2604.04921 2. RoFormer: Enhanced Transformer with Rotary Position Embedding — Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, Yunfeng Liu, 2021 https://scholar.google.com/scholar?q=RoFormer:+Enhanced+Transformer+with+Rotary+Position+Embedding 3. A Mathematical Framework for Transformer Circuits — Nelson Elhage, Nicholas Joseph, Ajeya Cotra, Kaidi Cao, Jared Kaplan, et al., 2021 https://scholar.google.com/scholar?q=A+Mathematical+Framework+for+Transformer+Circuits 4. StreamingLLM: Efficient Streaming Language Models with Attention Sinks — Guangxuan Xiao, Yao Fu, Kuanlun Guo, Xuefei Ning, et al., 2023 https://scholar.google.com/scholar?q=StreamingLLM:+Efficient+Streaming+Language+Models+with+Attention+Sinks 5. TriAttention: Efficient Long Reasoning with Trigonometric KV Compression — Weian Mao, Xi Lin, Wei Huang, Yuxin Xie, Tianfu Fu, Bohan Zhuang, Song Han, Yukang Chen, 2026 https://scholar.google.com/scholar?q=TriAttention:+Efficient+Long+Reasoning+with+Trigonometric+KV+Compression 6. What Makes Rotary Positional Encodings Useful? — Federico Barbero, et al., 2025 https://scholar.google.com/scholar?q=What+Makes+Rotary+Positional+Encodings+Useful? 7. Attention Sinks and Massive Activation Values in Transformers — Xiaozhi Xiao, et al., 2025 https://scholar.google.com/scholar?q=Attention+Sinks+and+Massive+Activation+Values+in+Transformers 8. Heavy Hitter Oracle for Efficient Generative Inference of Large Language Models — Zirui Liu, et al., 2023 https://scholar.google.com/scholar?q=Heavy+Hitter+Oracle+for+Efficient+Generative+Inference+of+Large+Language+Models 9. H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models — Zirui Liu, et al., 2023 https://scholar.google.com/scholar?q=H2O:+Heavy-Hitter+Oracle+for+Efficient+Generative+Inference+of+Large+Language+Models 10. PyramidKV — Zhang, et al., 2024 https://scholar.google.com/scholar?q=PyramidKV 11. SnapKV — Li, et al., 2024 https://scholar.google.com/scholar?q=SnapKV 12. R-KV — Zhang, et al., 2025 https://scholar.google.com/scholar?q=R-KV 13. Vision Transformer Interpretability via Attention Rollout — Samira Abnar, Willem Zuidema, 2020 https://scholar.google.com/scholar?q=Vision+Transformer+Interpretability+via+Attention+Rollout 14. An Analysis of Attention Weights as a Proxy for Explanation — Sarthak Jain, Byron C. Wallace, 2019 https://scholar.google.com/scholar?q=An+Analysis+of+Attention+Weights+as+a+Proxy+for+Explanation 15. RazorAttention: Efficient KV Cache Compression Through Retrieval Heads — approx. Tang et al., 2024/2025 https://scholar.google.com/scholar?q=RazorAttention:+Efficient+KV+Cache+Compression+Through+Retrieval+Heads 16. Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning — approx. 2025 head-aware KV compression paper, 2025 https://scholar.google.com/scholar?q=Not+All+Heads+Matter:+A+Head-Level+KV+Cache+Compression+Method+with+Integrated+Retrieval+and+Reasoning 17. FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference — approx. 2025, 2025 https://scholar.google.com/scholar?q=FreeKV:+Boosting+KV+Cache+Retrieval+for+Efficient+LLM+Inference 18. RAP: KV-Cache Compression via RoPE-Aligned Pruning — approx. 2025, 2025 https://scholar.google.com/scholar?q=RAP:+KV-Cache+Compression+via+RoPE-Aligned+Pruning 19. EliteKV: Scalable KV Cache Compression via RoPE Frequency Selection and Joint Low-Rank Projection — approx. 2025, 2025 https://scholar.google.com/scholar?q=EliteKV:+Scalable+KV+Cache+Compression+via+RoPE+Frequency+Selection+and+Joint+Low-Rank+Projection 20. Asymmetric KV Cache Compression using State-Aware Sparsity and Quantization — approx. 2025, 2025 https://scholar.google.com/scholar?q=Asymmetric+KV+Cache+Compression+using+State-Aware+Sparsity+and+Quantization 21. Efficient Streaming Language Models with Attention Sinks — Xiao et al., 2023 https://scholar.google.com/scholar?q=Efficient+Streaming+Language+Models+with+Attention+Sinks 22. When Attention Sink Emerges in Language Models: An Empirical View — approx. 2024/2025, 2024/2025 https://scholar.google.com/scholar?q=When+Attention+Sink+Emerges+in+Language+Models:+An+Empirical+View 23. AI Post Transformers: Lookahead Q-Cache for Consistent KV Eviction — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-25-lookahead-q-cache-for-consistent-kv-evic-d97b09.mp3 24. AI Post Transformers: LookaheadKV: Fast and Accurate KV Cache Eviction by Glimpsing into the Future without Generation — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-25-lookaheadkv-fast-and-accurate-kv-9cfc9f.mp3 25. AI Post Transformers: TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-25-turboquant-online-vector-quantiz-1967b7.mp3 26. AI Post Transformers: Kimi Linear: Efficient Expressive Attention Architecture — Hal Turing & Dr. Ada Shannon, 2025 https://podcast.do-not-panic.com/episodes/kimi-linear-efficient-expressive-attention-architecture/ 27. AI Post Transformers: Real Context Size and Context Rot — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-07-real-context-size-and-context-rot-56cbb4.mp3 Interactive Visualization: TriAttention for Efficient Long-Context KV Compression
3D AGO

When Spectral Gradient Updates Help Deep Learning

This episode explores a theory paper that asks when spectral matrix updates should outperform standard Euclidean gradient methods in deep networks and transformers. It explains how spectral updates replace a gradient matrix with its polar factor—preserving singular-vector directions while flattening singular values—and argues that this geometry can help when incoming activations have low stable rank while gradients have high nuclear-rank-like spread. The discussion connects this criterion to practical excitement around spectral-style optimizers such as Muon, while contrasting them with curvature-based methods like K-FAC and Shampoo. Listeners would find it interesting because the episode turns a seemingly niche optimizer trick into a concrete, testable claim about the hidden geometry of neural network training. Sources: 1. When do spectral gradient updates help in deep learning? — Damek Davis, Dmitriy Drusvyatskiy, 2025 http://arxiv.org/abs/2512.04299 2. Shampoo: Preconditioned Stochastic Tensor Optimization — Vineet Gupta, Tomer Koren, Yoram Singer and others, 2018 https://scholar.google.com/scholar?q=Shampoo:+Preconditioned+Stochastic+Tensor+Optimization 3. K-FAC: Kronecker-Factored Approximate Curvature for Neural Network Optimization — James Martens, Roger Grosse, 2015 https://scholar.google.com/scholar?q=K-FAC:+Kronecker-Factored+Approximate+Curvature+for+Neural+Network+Optimization 4. Muon: An optimizer for hidden layers in neural networks — Keller Jordan and collaborators, 2024 https://scholar.google.com/scholar?q=Muon:+An+optimizer+for+hidden+layers+in+neural+networks 5. When do spectral gradient updates help in deep learning? — Damek Davis, Dmitriy Drusvyatskiy, 2026 https://scholar.google.com/scholar?q=When+do+spectral+gradient+updates+help+in+deep+learning? 6. Deep Transformers without Shortcuts: Modifying Self-attention for Faithful Signal Propagation — Anonymous/various authors depending on version; commonly cited in transformer dynamics discussions, 2021 https://scholar.google.com/scholar?q=Deep+Transformers+without+Shortcuts:+Modifying+Self-attention+for+Faithful+Signal+Propagation 7. On the Softmax Bottleneck of Recurrent Language Models — Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, William W. Cohen, Yoshua Bengio, 2018 https://scholar.google.com/scholar?q=On+the+Softmax+Bottleneck+of+Recurrent+Language+Models 8. Representation Degeneration Problem in Training Natural Language Generation Models — Junxian He, Daniel Spokoyny, Graham Neubig, Taylor Berg-Kirkpatrick, 2020 https://scholar.google.com/scholar?q=Representation+Degeneration+Problem+in+Training+Natural+Language+Generation+Models 9. Neural Collapse: A Terminal Phase of Deep Learning Training — Vardan Papyan, X. Y. Han, David L. Donoho, 2020 https://scholar.google.com/scholar?q=Neural+Collapse:+A+Terminal+Phase+of+Deep+Learning+Training 10. Understanding Dimensional Collapse in Contrastive Self-supervised Learning — Tianyu Hua, Wenxiao Wang, Zihang Dai and others, 2021 https://scholar.google.com/scholar?q=Understanding+Dimensional+Collapse+in+Contrastive+Self-supervised+Learning 11. The Intrinsic Dimension of Objective Landscapes — Chunyuan Li, Heerad Farkhoor, Rosanne Liu, Jason Yosinski, 2018 https://scholar.google.com/scholar?q=The+Intrinsic+Dimension+of+Objective+Landscapes 12. Random Features for Large-Scale Kernel Machines — Ali Rahimi, Benjamin Recht, 2007 https://scholar.google.com/scholar?q=Random+Features+for+Large-Scale+Kernel+Machines 13. A Random Matrix Perspective on Random Features for Compositional Kernels — Florent Krzakala, Lenka Zdeborová, and collaborators in the random-features theory community, 2019 https://scholar.google.com/scholar?q=A+Random+Matrix+Perspective+on+Random+Features+for+Compositional+Kernels 14. The Surprising Effectiveness of Random Features for Structured Data — Various authors across theory and applied ML; representative random-feature comparison literature, 2010s-2020s https://scholar.google.com/scholar?q=The+Surprising+Effectiveness+of+Random+Features+for+Structured+Data 15. Spectral Gradient Descent — Yair Carmon, John C. Duchi, Oliver Hinder, Aaron Sidford, 2021 https://scholar.google.com/scholar?q=Spectral+Gradient+Descent 16. A Kronecker-factored approximate Fisher matrix for convolution layers — Roger Grosse, Jimmy Ba, et al., 2016 https://scholar.google.com/scholar?q=A+Kronecker-factored+approximate+Fisher+matrix+for+convolution+layers 17. Feature Learning in Infinite-Width Neural Networks — Mario Geiger, Stefano Spigler, Arthur Jacot, Matthieu Wyart, 2020 https://scholar.google.com/scholar?q=Feature+Learning+in+Infinite-Width+Neural+Networks 18. Neural Collapse: A Review and Synthesis — Vardan Papyan, X.Y. Han, David L. Donoho, 2023 https://scholar.google.com/scholar?q=Neural+Collapse:+A+Review+and+Synthesis 19. Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning — Arora et al., 2021 https://scholar.google.com/scholar?q=Intrinsic+Dimensionality+Explains+the+Effectiveness+of+Language+Model+Fine-Tuning 20. Understanding transformers for time series: Rank structure, flow-of-ranks, and compressibility — approx. recent transformer interpretability / theory authors, recent https://scholar.google.com/scholar?q=Understanding+transformers+for+time+series:+Rank+structure,+flow-of-ranks,+and+compressibility 21. Tuning stable rank shrinkage: Aiming at the overlooked structural risk in fine-tuning — approx. recent fine-tuning / representation learning authors, recent https://scholar.google.com/scholar?q=Tuning+stable+rank+shrinkage:+Aiming+at+the+overlooked+structural+risk+in+fine-tuning 22. Unraveling the gradient descent dynamics of transformers — approx. recent optimization theory authors, recent https://scholar.google.com/scholar?q=Unraveling+the+gradient+descent+dynamics+of+transformers 23. AI Post Transformers: Adam: A Method for Stochastic Optimization — Hal Turing & Dr. Ada Shannon, 2025 https://podcast.do-not-panic.com/episodes/adam-a-method-for-stochastic-optimization/ 24. AI Post Transformers: AdamW: Decoupled Weight Decay Regularization for Adaptive Gradient Algorithms — Hal Turing & Dr. Ada Shannon, 2025 https://podcast.do-not-panic.com/episodes/adamw-decoupled-weight-decay-regularization-for-adaptive-gradient-algorithms/ 25. AI Post Transformers: In-Context Learning as Implicit Learning Algorithms — Hal Turing & Dr. Ada Shannon, 2025 https://podcast.do-not-panic.com/episodes/in-context-learning-as-implicit-learning-algorithms/ Interactive Visualization: When Spectral Gradient Updates Help Deep Learning
4D AGO

Real Context Size and Context Rot

In this episode, Hal Turing and Dr. Ada Shannon return to a term they used in their Recursive Language Models conversation without fully defining it: context rot. Using Chroma Research’s 2025 write-up as the main anchor, they explain context rot as the degraded, uneven, and unreliable use of information as prompts get longer—even on simple tasks. The discussion makes the central distinction the industry often blurs: advertised context capacity is not the same as usable context. A model may accept 128K or even a million tokens without crashing, but that does not mean it can reliably retrieve, connect, and reason over what was placed inside that buffer. They pair Chroma’s failure analysis with RULER, the 2024 NVIDIA-led benchmark paper asking a more practical question: what is a model’s real context size, meaning the longest prompt length at which performance remains satisfactory? The episode walks through why older long-context tests, especially vanilla needle-in-a-haystack retrieval, were too flattering. Hal and Ada discuss how simple retrieval benchmarks mostly measure lexical lookup, while stronger evaluations must test reference tracing, aggregation across documents, resilience to distraction, and whether the model is actually using the supplied prompt rather than answering from parametric knowledge stored in its weights. They also briefly credit the Gemini 1.5 technical report for explicitly calling on the field to build harder long-context benchmarks, then situate RULER alongside the benchmark ecosystem that followed, including LongBench and InfiniteBench, with a dedicated RULER episode coming soon. The larger thesis is that a giant context window should not be mistaken for memory. For retrieval-augmented generation, document-grounded assistants, and agent systems, a long prompt is at best an unstructured buffer—a cluttered desk or overstuffed backpack—not a real memory architecture. As the hosts argue, once context rot sets in, simply adding more tokens stops helping and can actively degrade reliability. If the goal is AI systems that truly remember and reason across large bodies of information, then memory and storage have to become first-class design elements: managed, tiered, retrievable, structured, and persistent, rather than just a bigger pile of tokens shoved into the prompt. Sources: 1. RULER: What's the Real Context Size of Your Long-Context Language Models? — Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, Boris Ginsburg, 2024 http://arxiv.org/abs/2404.06654 2. LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding — Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, Juanzi Li, 2023 http://arxiv.org/abs/2308.14508 3. Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned — Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, Matt Fredrikson and others, 2024 https://scholar.google.com/scholar?q=Red+Teaming+Language+Models+to+Reduce+Harms:+Methods,+Scaling+Behaviors,+and+Lessons+Learned 4. JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models — A team including researchers from academia and industry; commonly cited under the JailbreakBench project authorship, 2024 https://scholar.google.com/scholar?q=JailbreakBench:+An+Open+Robustness+Benchmark+for+Jailbreaking+Large+Language+Models 5. Holistic Evaluation of Language Models — Percy Liang, Rishi Bommasani, Tony Lee, Dmitriy Ryaboy and many collaborators, 2022 https://scholar.google.com/scholar?q=Holistic+Evaluation+of+Language+Models 6. Do Anything Now: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models — Researchers studying jailbreak prompt collections from public communities; commonly cited as a characterization study of DAN-style prompts, 2024 https://scholar.google.com/scholar?q=Do+Anything+Now:+Characterizing+and+Evaluating+In-The-Wild+Jailbreak+Prompts+on+Large+Language+Models 7. Lost in the Middle: How Language Models Use Long Contexts — Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, Percy Liang, 2024 https://scholar.google.com/scholar?q=Lost+in+the+Middle:+How+Language+Models+Use+Long+Contexts 8. Needle In A Haystack - Pressure Testing LLMs — Greg Kamradt, 2023 https://scholar.google.com/scholar?q=Needle+In+A+Haystack+-+Pressure+Testing+LLMs 9. LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding — Yucheng Bai, Xintong Lu, Lianghao Wang, Xiaoxuan Liu, Weisheng Wang, Bo Zheng, Hongting Lin, Xinyu Dai, Wayne Xin Zhao, Ruifeng Xu, 2024 https://scholar.google.com/scholar?q=LongBench:+A+Bilingual,+Multitask+Benchmark+for+Long+Context+Understanding 10. L-Eval: Instituting Standardized Evaluation for Long Context Language Models — Chenglong Su, Jiarui Fang, Haozhe Ji, et al., 2024 https://scholar.google.com/scholar?q=L-Eval:+Instituting+Standardized+Evaluation+for+Long+Context+Language+Models 11. InfiniteBench: Extending Long Context Evaluation Beyond 100K Tokens — Yifan Zhang, Weizhi Wang, et al., 2024 https://scholar.google.com/scholar?q=InfiniteBench:+Extending+Long+Context+Evaluation+Beyond+100K+Tokens 12. BAMBOO: A Comprehensive Benchmark for Evaluating Long Text Modeling Capacities of Large Language Models — Ying Sheng, et al., 2024 https://scholar.google.com/scholar?q=BAMBOO:+A+Comprehensive+Benchmark+for+Evaluating+Long+Text+Modeling+Capacities+of+Large+Language+Models 13. Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach — Tianle Cai, et al., 2024 https://scholar.google.com/scholar?q=Retrieval+Augmented+Generation+or+Long-Context+LLMs?+A+Comprehensive+Study+and+Hybrid+Approach 14. Rethinking the Role of Scaling Laws in the Long Context Performance of Large Language Models — Various 2024 long-context scaling studies cited around Liu et al./Young et al., 2024 https://scholar.google.com/scholar?q=Rethinking+the+Role+of+Scaling+Laws+in+the+Long+Context+Performance+of+Large+Language+Models 15. LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-Context Multitasks — approx. Bai et al. / THUDM-affiliated LongBench follow-up team, 2024 https://scholar.google.com/scholar?q=LongBench+v2:+Towards+Deeper+Understanding+and+Reasoning+on+Realistic+Long-Context+Multitasks 16. LongBench Pro: A More Realistic and Comprehensive Bilingual Long-Context Evaluation Benchmark — approx. LongBench/THUDM-style benchmark authors, 2024 https://scholar.google.com/scholar?q=LongBench+Pro:+A+More+Realistic+and+Comprehensive+Bilingual+Long-Context+Evaluation+Benchmark 17. Why Does the Effective Context Length of LLMs Fall Short? — approx. unknown from snippet, 2024 https://scholar.google.com/scholar?q=Why+Does+the+Effective+Context+Length+of+LLMs+Fall+Short? 18. BABILong-ITA: A New Benchmark for Testing Large Language Models Effective Context Length and a Context Extension Method — approx. unknown from snippet, 2024 https://scholar.google.com/scholar?q=BABILong-ITA:+A+New+Benchmark+for+Testing+Large+Language+Models+Effective+Context+Length+and+a+Context+Extension+Method 19. Precursors, Proxies, and Predictive Models for Long-Horizon Tasks — approx. unknown from snippet, 2024 https://scholar.google.com/scholar?q=Precursors,+Proxies,+and+Predictive+Models+for+Long-Horizon+Tasks 20. The What, Why, and How of Context Length Extension Techniques in Large Language Models--A Detailed Survey — approx. unknown from snippet, 2024 https://scholar.google.com/scholar?q=The+What,+Why,+and+How+of+Context+Length+Extension+Techniques+in+Large+Language+Models--A+Detailed+Survey 21. AI Post Transformers: Doc-to-LoRA: Internalizing Context as LoRA — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-29-doc-to-lora-internalizing-context-as-lor-8dd5ec.mp3 22. AI Post Transformers: MEMSEARCHER: Reinforcement Learning for LLM Memory Management — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-memsearcher-reinforcement-learning-for-l-e9ad84.mp3 23. AI Post Transformers: From Prefix Cache to Fusion RAG Cache: Accelerating LLM Inference in Retrieval-Augmented Generation — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-22-from-prefix-cache-to-fusion-rag-9c5d39.mp3 24. AI Post Transformers: AI Agent Traps and Prompt Injection — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-02-ai-agent-traps-and-prompt-injection-7ce4ba.mp3 25. AI Post Transformers: LookaheadKV: Fast and Accurate KV Cache Eviction by Glimpsing into the Future without Generation — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-25-lookaheadkv-fast-and-accurate-kv-9cfc9f.mp3 26. AI Post Transformers: Lookahead Q-Cache for Consistent KV Eviction — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-25-lookahead-q-cache-for-consistent-kv-evic-d97b09.mp3 27. AI Post Transformers: Splitwise: Phase-Split LLM Inference — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-26-splitwise-phase-split-llm-inference-e8945b.mp3 Interactive Visualization: Real Context Size and Context Rot
4D AGO

Speculative Decoding in Real vLLM Serving

This episode explores whether speculative decoding’s widely cited inference speedups survive real deployment conditions, using a January 2026 UC Berkeley paper that evaluates the method inside vLLM rather than in idealized toy benchmarks. It explains the core mechanics of draft-and-verify decoding, then digs into why acceptance length, verification cost, scheduler behavior, batching, KV-cache management, and long generations can erase much of the theoretical advantage in production serving stacks. The discussion also clarifies the difference between speculative decoding and multi-token prediction, situating approaches like MEDUSA and EAGLE within the broader effort to reduce autoregressive bottlenecks. Listeners interested in LLM systems will find it compelling because it shifts the conversation from flashy benchmark bar charts to the practical question of what actually improves wall-clock latency for real workloads. Sources: 1. Speculative Decoding: Performance or Illusion? — Xiaoxuan Liu, Jiaxiang Yu, Jongseok Park, Ion Stoica, Alvin Cheung, 2025 http://arxiv.org/abs/2601.11580 2. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness — Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré, 2022 https://scholar.google.com/scholar?q=FlashAttention:+Fast+and+Memory-Efficient+Exact+Attention+with+IO-Awareness 3. Efficient Memory Management for Large Language Model Serving with PagedAttention — Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, et al., 2023 https://scholar.google.com/scholar?q=Efficient+Memory+Management+for+Large+Language+Model+Serving+with+PagedAttention 4. Ring Attention with Blockwise Transformers for Near-Infinite Context — William Bevington, et al., 2023 https://scholar.google.com/scholar?q=Ring+Attention+with+Blockwise+Transformers+for+Near-Infinite+Context 5. Speculative Decoding: Exploiting Speculative Execution for Accelerating Seq2seq Generation — Yaniv Leviathan, Matan Kalman, Yossi Matias, 2023 https://scholar.google.com/scholar?q=Speculative+Decoding:+Exploiting+Speculative+Execution+for+Accelerating+Seq2seq+Generation 6. Fast Inference from Transformers via Speculative Decoding — Yaniv Leviathan, Matan Kalman, Yossi Matias, 2023 https://scholar.google.com/scholar?q=Fast+Inference+from+Transformers+via+Speculative+Decoding 7. MEDUSA: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads — Tianle Cai, Yuhong Li, Zhengxu Chen, et al., 2024 https://scholar.google.com/scholar?q=MEDUSA:+Simple+LLM+Inference+Acceleration+Framework+with+Multiple+Decoding+Heads 8. EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty — Yuhui Li, et al., 2024 https://scholar.google.com/scholar?q=EAGLE:+Speculative+Sampling+Requires+Rethinking+Feature+Uncertainty 9. Speculative Decoding: Performance or Illusion? — Xiaoxuan Liu, Jiaxiang Yu, Jongseok Park, Ion Stoica, Alvin Cheung, 2026 https://scholar.google.com/scholar?q=Speculative+Decoding:+Performance+or+Illusion? 10. vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention — Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, et al., 2023 https://scholar.google.com/scholar?q=vLLM:+Easy,+Fast,+and+Cheap+LLM+Serving+with+PagedAttention 11. EAGLE-3 — Authors as cited in the paper's related work, 2024 https://scholar.google.com/scholar?q=EAGLE-3 12. Multi-Token Prediction — Liu et al.; Zeng et al., 2025 https://scholar.google.com/scholar?q=Multi-Token+Prediction 13. Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding — Xia et al., 2024 https://scholar.google.com/scholar?q=Unlocking+Efficiency+in+Large+Language+Model+Inference:+A+Comprehensive+Survey+of+Speculative+Decoding 14. A Systematic Study of Speculative Decoding in Computation-Bound Regimes — Liu et al., 2024 https://scholar.google.com/scholar?q=A+Systematic+Study+of+Speculative+Decoding+in+Computation-Bound+Regimes 15. N-Gram Speculative Decoding — Saxena; Somasundaram et al., 2023/2024 https://scholar.google.com/scholar?q=N-Gram+Speculative+Decoding 16. Determinism and Nondeterminism in LLM Inference — He, 2025 https://scholar.google.com/scholar?q=Determinism+and+Nondeterminism+in+LLM+Inference 17. Block Verification Accelerates Speculative Decoding — unknown from snippet, likely 2024-2025 https://scholar.google.com/scholar?q=Block+Verification+Accelerates+Speculative+Decoding 18. Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding — Zhang et al. / likely 2024, 2024 https://scholar.google.com/scholar?q=Draft+&+Verify:+Lossless+Large+Language+Model+Acceleration+via+Self-Speculative+Decoding 19. MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding — unknown from snippet, likely 2024-2025 https://scholar.google.com/scholar?q=MagicDec:+Breaking+the+Latency-Throughput+Tradeoff+for+Long+Context+Generation+with+Speculative+Decoding 20. SpecDec++: Boosting Speculative Decoding via Adaptive Candidate Lengths — unknown from snippet, likely 2024-2025 https://scholar.google.com/scholar?q=SpecDec++:+Boosting+Speculative+Decoding+via+Adaptive+Candidate+Lengths 21. Adaptive Speculative Decoding for Large Language Models — unknown from snippet, likely 2024 https://scholar.google.com/scholar?q=Adaptive+Speculative+Decoding+for+Large+Language+Models 22. Opt-Tree: Speculative Decoding with Adaptive Draft Tree Structure — unknown from snippet, likely 2024-2025 https://scholar.google.com/scholar?q=Opt-Tree:+Speculative+Decoding+with+Adaptive+Draft+Tree+Structure 23. Draft Model Knows When to Stop: Self-Verification Speculative Decoding for Long-Form Generation — unknown from snippet, likely 2024-2025 https://scholar.google.com/scholar?q=Draft+Model+Knows+When+to+Stop:+Self-Verification+Speculative+Decoding+for+Long-Form+Generation 24. Draft Model Knows When to Stop: A Self-Verification Length Policy for Speculative Decoding — unknown from snippet, likely 2025 https://scholar.google.com/scholar?q=Draft+Model+Knows+When+to+Stop:+A+Self-Verification+Length+Policy+for+Speculative+Decoding 25. AI Post Transformers: Adaptive Control for Batched Speculative Decoding in LLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/adaptive-control-for-batched-speculative-decoding-in-llm-serving/ 26. AI Post Transformers: Building Production-Ready Speculative Decoding with TensorRT-LLM — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/building-production-ready-speculative-decoding-with-tensorrt-llm/ 27. AI Post Transformers: Apple's Speculative Streaming: Fast LLM Inference without Auxiliary Models — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/apples-speculative-streaming-fast-llm-inference-without-auxiliary-models/ 28. AI Post Transformers: Episode: Speculative Speculative Decoding — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-25-speculative-speculative-decoding-1b7a10.mp3 29. AI Post Transformers: Continuous Batching for LLM Inference: Throughput and Latency Gains — Hal Turing & Dr. Ada Shannon, 2025 https://podcast.do-not-panic.com/episodes/continuous-batching-for-llm-inference-throughput-and-latency-gains/ 30. AI Post Transformers: Splitwise: Phase-Split LLM Inference — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-26-splitwise-phase-split-llm-inference-e8945b.mp3 Interactive Visualization: Speculative Decoding in Real vLLM Serving

See All (518)

3.7

out of 5

3 Ratings

Creator

mcgrof
Years Active

2025 - 2026
Episodes

518
Rating

Clean
Show Website

AI Post Transformers

Investing

Investing

Updated 21h ago

AI Post Transformers

Computation-Bandwidth-Memory Trade-offs for AI Infrastructure

DRAM-Free In-Flash Computing for LLM Inference

Cache Mechanism for Agent RAG Systems

Memory Sparse Attention for 100M-Token Scaling

TriAttention for Efficient Long-Context KV Compression

When Spectral Gradient Updates Help Deep Learning

Real Context Size and Context Rot

Speculative Decoding in Real vLLM Serving

Ratings & Reviews

About

Information

You Might Also Like

AI Post Transformers

Episodes

Computation-Bandwidth-Memory Trade-offs for AI Infrastructure

DRAM-Free In-Flash Computing for LLM Inference

Cache Mechanism for Agent RAG Systems

Memory Sparse Attention for 100M-Token Scaling

TriAttention for Efficient Long-Context KV Compression

When Spectral Gradient Updates Help Deep Learning

Real Context Size and Context Rot

Speculative Decoding in Real vLLM Serving

Ratings & Reviews

About

Information

You Might Also Like