AI Post Transformers

mcgrof

3.7 (3)
科技
一日一更

AI-generated podcast where hosts Hal Turing and Dr. Ada Shannon discuss the latest research papers and reports in machine learning, AI systems, and optimization. Featuring honest critical analysis, proper citations, and nerdy humor.

1天前

Cache Mechanism for Agent RAG Systems

This episode explores a 2025 paper on cache management for agentic RAG systems, asking whether an annotation-free cache can preserve most of the value of a massive retrieval corpus while using far less storage and reducing latency. It explains how RAG, agent memory, vector databases, embeddings, and approximate nearest neighbor search fit together, arguing that retrieval performance is not just a modeling issue but a core systems constraint for real-world agents. The discussion situates the paper in the broader history of retrieval and agent research, from Word2Vec and BERT to Dense Passage Retrieval, ReAct, and FAISS, showing why externalized knowledge remains useful even as language models grow larger. Listeners would find it interesting because it focuses on a practical but consequential question: how to make retrieval-heavy AI agents cheaper, faster, and more deployable outside large cloud infrastructures. Sources: 1. Cache Mechanism for Agent RAG Systems — Shuhang Lin, Zhencan Peng, Lingyao Li, Xiao Lin, Xi Zhu, Yongfeng Zhang, 2025 http://arxiv.org/abs/2511.02919 2. PlanRAG — Lee et al., 2024 https://scholar.google.com/scholar?q=PlanRAG 3. Generate-then-Ground — Shi et al., 2024 https://scholar.google.com/scholar?q=Generate-then-Ground 4. RAP — Kagaya et al., 2024 https://scholar.google.com/scholar?q=RAP 5. RAT — Wang et al., 2024 https://scholar.google.com/scholar?q=RAT 6. Mei et al. (system engineering / large knowledge repositories) — Mei et al., 2025 https://scholar.google.com/scholar?q=Mei+et+al.+(system+engineering+/+large+knowledge+repositories) 7. Guo et al. on RAG-powered agent architectures — Guo et al., 2025 https://scholar.google.com/scholar?q=Guo+et+al.+on+RAG-powered+agent+architectures 8. Long Context vs. RAG for LLMs: An Evaluation and Revisits — approx. recent LLM/RAG evaluation authors, 2024/2025 https://scholar.google.com/scholar?q=Long+Context+vs.+RAG+for+LLMs:+An+Evaluation+and+Revisits 9. Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? — approx. recent long-context LLM systems authors, 2024/2025 https://scholar.google.com/scholar?q=Can+Long-Context+Language+Models+Subsume+Retrieval,+RAG,+SQL,+and+More? 10. Predicting Retrieval Utility and Answer Quality in Retrieval-Augmented Generation — approx. recent RAG evaluation/prediction authors, 2024/2025 https://scholar.google.com/scholar?q=Predicting+Retrieval+Utility+and+Answer+Quality+in+Retrieval-Augmented+Generation 11. Relevance Filtering for Embedding-Based Retrieval — approx. recent dense retrieval / IR authors, 2024/2025 https://scholar.google.com/scholar?q=Relevance+Filtering+for+Embedding-Based+Retrieval 12. Volatility-Driven Decay: Adaptive Memory Retention for RAG Systems Under Unknown Drift — approx. recent continual RAG / memory authors, 2025 https://scholar.google.com/scholar?q=Volatility-Driven+Decay:+Adaptive+Memory+Retention+for+RAG+Systems+Under+Unknown+Drift 13. On the Role of Long-Tail Knowledge in Retrieval Augmented Large Language Models — approx. recent RAG robustness authors, 2024/2025 https://scholar.google.com/scholar?q=On+the+Role+of+Long-Tail+Knowledge+in+Retrieval+Augmented+Large+Language+Models 14. Graph-Based Retriever Captures the Long Tail of Biomedical Knowledge — approx. recent biomedical retrieval authors, 2024/2025 https://scholar.google.com/scholar?q=Graph-Based+Retriever+Captures+the+Long+Tail+of+Biomedical+Knowledge 15. FIT-RAG: Black-Box RAG with Factual Information and Token Reduction — approx. recent black-box RAG authors, 2024/2025 https://scholar.google.com/scholar?q=FIT-RAG:+Black-Box+RAG+with+Factual+Information+and+Token+Reduction 16. AI Post Transformers: QVCache for Semantic Caching in ANN Search — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-qvcache-for-semantic-caching-in-ann-sear-415304.mp3 17. AI Post Transformers: Episode: From Prefix Cache to Fusion RAG Cache: Accelerating LLM Inference in Retrieval-Augmented Generation — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-22-from-prefix-cache-to-fusion-rag-9c5d39.mp3 18. AI Post Transformers: MEMSEARCHER: Reinforcement Learning for LLM Memory Management — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-memsearcher-reinforcement-learning-for-l-e9ad84.mp3 19. AI Post Transformers: Doc-to-LoRA: Internalizing Context as LoRA — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-29-doc-to-lora-internalizing-context-as-lor-8dd5ec.mp3 20. AI Post Transformers: ColBERT and ColBERT v2 — Hal Turing & Dr. Ada Shannon, 2025 https://podcast.do-not-panic.com/episodes/colbert-and-colbert-v2/ Interactive Visualization: Cache Mechanism for Agent RAG Systems
1天前

Memory Sparse Attention for 100M-Token Scaling

This episode explores a paper proposing Memory Sparse Attention, an end-to-end trainable memory architecture designed to scale language models from ordinary long-context settings to 100 million tokens. The discussion explains why standard dense self-attention becomes infeasible at extreme lengths, distinguishes simple context-window extension from true “lifetime-scale” memory, and situates the approach among alternatives like parameter-based memory, recurrent compression, and external retrieval systems such as RAG. It argues that the paper’s core idea is selective, trainable access to a small set of relevant memory segments rather than treating all past tokens as one continuous stream, while also noting the authors’ ambitious systems claims around practical inference. A listener would find it interesting for its clear framing of what makes ultra-long-context modeling hard, and for its skeptical but concrete examination of whether this architecture meaningfully bridges the gap between long prompts and persistent memory. Sources: 1. MSA: Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens — Yu Chen, Runkai Chen, Sheng Yi, Xinda Zhao, Xiaohong Li, Jianjin Zhang, Jun Sun, Chuanrui Hu, Yunyun Han, Lidong Bing, Yafeng Deng, Tianqiao Chen, 2026 http://arxiv.org/abs/2603.23516 2. Efficient Memory Management for Large Language Model Serving with PagedAttention — Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, et al., 2023 https://scholar.google.com/scholar?q=Efficient+Memory+Management+for+Large+Language+Model+Serving+with+PagedAttention 3. Ring Attention with Blockwise Transformers for Near-Infinite Context — Aidan N. Gomez, Sean Dao, and collaborators, 2023 https://scholar.google.com/scholar?q=Ring+Attention+with+Blockwise+Transformers+for+Near-Infinite+Context 4. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism — Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, Bryan Catanzaro, 2019 https://scholar.google.com/scholar?q=Megatron-LM:+Training+Multi-Billion+Parameter+Language+Models+Using+Model+Parallelism 5. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning — Tri Dao, 2023 https://scholar.google.com/scholar?q=FlashAttention-2:+Faster+Attention+with+Better+Parallelism+and+Work+Partitioning 6. RoFormer: Enhanced Transformer with Rotary Position Embedding — Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, Yunfeng Liu, 2021 https://scholar.google.com/scholar?q=RoFormer:+Enhanced+Transformer+with+Rotary+Position+Embedding 7. Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation — Ofir Press, Noah A. Smith, Mike Lewis, 2021 https://scholar.google.com/scholar?q=Train+Short,+Test+Long:+Attention+with+Linear+Biases+Enables+Input+Length+Extrapolation 8. Extending Context Window of Large Language Models via Positional Interpolation — Shouyuan Chen, Sherman Wong, Liangcheng Luo, et al., 2023 https://scholar.google.com/scholar?q=Extending+Context+Window+of+Large+Language+Models+via+Positional+Interpolation 9. YaRN: Efficient Context Window Extension of Large Language Models — Bowen Peng, Jeffrey Quesnelle, Honglu Fan, Enming Luo, 2023 https://scholar.google.com/scholar?q=YaRN:+Efficient+Context+Window+Extension+of+Large+Language+Models 10. Titans: Learning to Memorize at Test Time — Ali Behrouz, Peilin Zhong, Vahab Mirrokni, 2025 https://scholar.google.com/scholar?q=Titans:+Learning+to+Memorize+at+Test+Time 11. Infini-attention: Infinite Context for Efficient Transformers — Hao Liu, Wilson Yan, Matei Zaharia, Pieter Abbeel, 2024 https://scholar.google.com/scholar?q=Infini-attention:+Infinite+Context+for+Efficient+Transformers 12. LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens — Yucheng Ding, Li Dong, et al., 2024 https://scholar.google.com/scholar?q=LongRoPE:+Extending+LLM+Context+Window+Beyond+2+Million+Tokens 13. MemGPT: Towards LLMs as Operating Systems — Charles Packer, Vivian Fang, Shishir G. Patil, Kevin Lin, Sarah Wooders, Joseph E. Gonzalez, 2024 https://scholar.google.com/scholar?q=MemGPT:+Towards+LLMs+as+Operating+Systems 14. RAG: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks — Patrick Lewis, Ethan Perez, Aleksandara Piktus, et al., 2020 https://scholar.google.com/scholar?q=RAG:+Retrieval-Augmented+Generation+for+Knowledge-Intensive+NLP+Tasks 15. KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache — Zhenyu Liu, et al., 2024 https://scholar.google.com/scholar?q=KIVI:+A+Tuning-Free+Asymmetric+2bit+Quantization+for+KV+Cache 16. PagedAttention / vLLM: Efficient Memory Management for Large Language Model Serving — Woosuk Kwon, Zhuohan Li, et al., 2023 https://scholar.google.com/scholar?q=PagedAttention+/+vLLM:+Efficient+Memory+Management+for+Large+Language+Model+Serving 17. Memorizing Transformers — Yannic Kilcher? no; actually Jack W. Rae, Sebastian Borgeaud, Trevor Cai, et al., 2022 https://scholar.google.com/scholar?q=Memorizing+Transformers 18. TransformerFAM / Focused Attention Memory variants for long-context retrieval — Various 2024-2025 authors, 2024-2025 https://scholar.google.com/scholar?q=TransformerFAM+/+Focused+Attention+Memory+variants+for+long-context+retrieval 19. OpenRAG: Optimizing RAG End-to-End via In-Context Retrieval Learning — approx. 2025 authors unclear from snippet, 2025 https://scholar.google.com/scholar?q=OpenRAG:+Optimizing+RAG+End-to-End+via+In-Context+Retrieval+Learning 20. Beyond RAG for Agent Memory: Retrieval by Decoupling and Aggregation — approx. 2025 authors unclear from snippet, 2025 https://scholar.google.com/scholar?q=Beyond+RAG+for+Agent+Memory:+Retrieval+by+Decoupling+and+Aggregation 21. MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention — approx. 2024/2025 authors unclear from snippet, 2024/2025 https://scholar.google.com/scholar?q=MInference+1.0:+Accelerating+Pre-filling+for+Long-Context+LLMs+via+Dynamic+Sparse+Attention 22. SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention — approx. 2025 authors unclear from snippet, 2025 https://scholar.google.com/scholar?q=SampleAttention:+Near-Lossless+Acceleration+of+Long+Context+LLM+Inference+with+Adaptive+Structured+Sparse+Attention 23. FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference — approx. 2025 authors unclear from snippet, 2025 https://scholar.google.com/scholar?q=FlexPrefill:+A+Context-Aware+Sparse+Attention+Mechanism+for+Efficient+Long-Sequence+Inference 24. Kvlink: Accelerating Large Language Models via Efficient KV Cache Reuse — approx. 2025 authors unclear from snippet, 2025 https://scholar.google.com/scholar?q=Kvlink:+Accelerating+Large+Language+Models+via+Efficient+KV+Cache+Reuse 25. HyperRAG: Enhancing Quality-Efficiency Tradeoffs in Retrieval-Augmented Generation with Reranker KV-Cache Reuse — approx. 2025 authors unclear from snippet, 2025 https://scholar.google.com/scholar?q=HyperRAG:+Enhancing+Quality-Efficiency+Tradeoffs+in+Retrieval-Augmented+Generation+with+Reranker+KV-Cache+Reuse 26. ProphetKV: User-Query-Driven Selective Recomputation for Efficient KV Cache Reuse in Retrieval-Augmented Generation — approx. 2025 authors unclear from snippet, 2025 https://scholar.google.com/scholar?q=ProphetKV:+User-Query-Driven+Selective+Recomputation+for+Efficient+KV+Cache+Reuse+in+Retrieval-Augmented+Generation 27. Hierarchical Local-Global Transformer With Dynamic Positional Encoding for Document-Level Machine Translation — approx. 2024/2025 authors unclear from snippet, 2024/2025 https://scholar.google.com/scholar?q=Hierarchical+Local-Global+Transformer+With+Dynamic+Positional+Encoding+for+Document-Level+Machine+Translation 28. AI Post Transformers: MEMSEARCHER: Reinforcement Learning for LLM Memory Management — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-memsearcher-reinforcement-learning-for-l-e9ad84.mp3 29. AI Post Transformers: Kimi Linear: Efficient Expressive Attention Architecture — Hal Turing & Dr. Ada Shannon, 2025 https://podcast.do-not-panic.com/episodes/kimi-linear-efficient-expressive-attention-architecture/ 30. AI Post Transformers: Lookahead Q-Cache for Consistent KV Eviction — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-25-lookahead-q-cache-for-consistent-kv-evic-d97b09.mp3 31. AI Post Transformers: CacheSlide: Position-Aware KV Cache Reuse for Agent LLMs — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-16-cacheslide-position-aware-kv-cache-reuse-cd59c7.mp3 32. AI Post Transformers: Doc-to-LoRA: Internalizing Context as LoRA — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-29-doc-to-lora-internalizing-context-as-lor-8dd5ec.mp3 33. AI Post Transformers: From Prefix Cache to Fusion RAG Cache: Accelerating LLM Inference in Retrieval-Augmented Generation — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-22-from-prefix-cache-to-fusion-rag-9c5d39.mp3 34. AI Post Transformers: TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-25-turboquant-online-vector-quantiz-1967b7.mp3
1天前

TriAttention for Efficient Long-Context KV Compression

This episode explores TriAttention, a new method for reducing KV-cache memory during long-context inference by modeling how attention behaves under Rotary Positional Embeddings rather than relying on recent attention patterns alone. It explains why common compression methods can fail for long reasoning tasks: under RoPE, queries at different positions are rotated into different coordinate systems, so a small window of recent post-RoPE queries is a poor predictor of which earlier tokens will matter later. The discussion highlights the paper’s dual contribution as both a systems result for making 32K-token-style reasoning more practical and a mechanistic argument that transformer attention has analyzable structure rather than being purely empirical. Listeners interested in efficient LLM serving, long-context reasoning, or the inner geometry of attention will find it compelling because it connects deployment bottlenecks with a concrete theoretical explanation. Sources: 1. TriAttention: Efficient Long Reasoning with Trigonometric KV Compression — Weian Mao, Xi Lin, Wei Huang, Yuxin Xie, Tianfu Fu, Bohan Zhuang, Song Han, Yukang Chen, 2026 http://arxiv.org/abs/2604.04921 2. RoFormer: Enhanced Transformer with Rotary Position Embedding — Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, Yunfeng Liu, 2021 https://scholar.google.com/scholar?q=RoFormer:+Enhanced+Transformer+with+Rotary+Position+Embedding 3. A Mathematical Framework for Transformer Circuits — Nelson Elhage, Nicholas Joseph, Ajeya Cotra, Kaidi Cao, Jared Kaplan, et al., 2021 https://scholar.google.com/scholar?q=A+Mathematical+Framework+for+Transformer+Circuits 4. StreamingLLM: Efficient Streaming Language Models with Attention Sinks — Guangxuan Xiao, Yao Fu, Kuanlun Guo, Xuefei Ning, et al., 2023 https://scholar.google.com/scholar?q=StreamingLLM:+Efficient+Streaming+Language+Models+with+Attention+Sinks 5. TriAttention: Efficient Long Reasoning with Trigonometric KV Compression — Weian Mao, Xi Lin, Wei Huang, Yuxin Xie, Tianfu Fu, Bohan Zhuang, Song Han, Yukang Chen, 2026 https://scholar.google.com/scholar?q=TriAttention:+Efficient+Long+Reasoning+with+Trigonometric+KV+Compression 6. What Makes Rotary Positional Encodings Useful? — Federico Barbero, et al., 2025 https://scholar.google.com/scholar?q=What+Makes+Rotary+Positional+Encodings+Useful? 7. Attention Sinks and Massive Activation Values in Transformers — Xiaozhi Xiao, et al., 2025 https://scholar.google.com/scholar?q=Attention+Sinks+and+Massive+Activation+Values+in+Transformers 8. Heavy Hitter Oracle for Efficient Generative Inference of Large Language Models — Zirui Liu, et al., 2023 https://scholar.google.com/scholar?q=Heavy+Hitter+Oracle+for+Efficient+Generative+Inference+of+Large+Language+Models 9. H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models — Zirui Liu, et al., 2023 https://scholar.google.com/scholar?q=H2O:+Heavy-Hitter+Oracle+for+Efficient+Generative+Inference+of+Large+Language+Models 10. PyramidKV — Zhang, et al., 2024 https://scholar.google.com/scholar?q=PyramidKV 11. SnapKV — Li, et al., 2024 https://scholar.google.com/scholar?q=SnapKV 12. R-KV — Zhang, et al., 2025 https://scholar.google.com/scholar?q=R-KV 13. Vision Transformer Interpretability via Attention Rollout — Samira Abnar, Willem Zuidema, 2020 https://scholar.google.com/scholar?q=Vision+Transformer+Interpretability+via+Attention+Rollout 14. An Analysis of Attention Weights as a Proxy for Explanation — Sarthak Jain, Byron C. Wallace, 2019 https://scholar.google.com/scholar?q=An+Analysis+of+Attention+Weights+as+a+Proxy+for+Explanation 15. RazorAttention: Efficient KV Cache Compression Through Retrieval Heads — approx. Tang et al., 2024/2025 https://scholar.google.com/scholar?q=RazorAttention:+Efficient+KV+Cache+Compression+Through+Retrieval+Heads 16. Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning — approx. 2025 head-aware KV compression paper, 2025 https://scholar.google.com/scholar?q=Not+All+Heads+Matter:+A+Head-Level+KV+Cache+Compression+Method+with+Integrated+Retrieval+and+Reasoning 17. FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference — approx. 2025, 2025 https://scholar.google.com/scholar?q=FreeKV:+Boosting+KV+Cache+Retrieval+for+Efficient+LLM+Inference 18. RAP: KV-Cache Compression via RoPE-Aligned Pruning — approx. 2025, 2025 https://scholar.google.com/scholar?q=RAP:+KV-Cache+Compression+via+RoPE-Aligned+Pruning 19. EliteKV: Scalable KV Cache Compression via RoPE Frequency Selection and Joint Low-Rank Projection — approx. 2025, 2025 https://scholar.google.com/scholar?q=EliteKV:+Scalable+KV+Cache+Compression+via+RoPE+Frequency+Selection+and+Joint+Low-Rank+Projection 20. Asymmetric KV Cache Compression using State-Aware Sparsity and Quantization — approx. 2025, 2025 https://scholar.google.com/scholar?q=Asymmetric+KV+Cache+Compression+using+State-Aware+Sparsity+and+Quantization 21. Efficient Streaming Language Models with Attention Sinks — Xiao et al., 2023 https://scholar.google.com/scholar?q=Efficient+Streaming+Language+Models+with+Attention+Sinks 22. When Attention Sink Emerges in Language Models: An Empirical View — approx. 2024/2025, 2024/2025 https://scholar.google.com/scholar?q=When+Attention+Sink+Emerges+in+Language+Models:+An+Empirical+View 23. AI Post Transformers: Lookahead Q-Cache for Consistent KV Eviction — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-25-lookahead-q-cache-for-consistent-kv-evic-d97b09.mp3 24. AI Post Transformers: LookaheadKV: Fast and Accurate KV Cache Eviction by Glimpsing into the Future without Generation — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-25-lookaheadkv-fast-and-accurate-kv-9cfc9f.mp3 25. AI Post Transformers: TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-25-turboquant-online-vector-quantiz-1967b7.mp3 26. AI Post Transformers: Kimi Linear: Efficient Expressive Attention Architecture — Hal Turing & Dr. Ada Shannon, 2025 https://podcast.do-not-panic.com/episodes/kimi-linear-efficient-expressive-attention-architecture/ 27. AI Post Transformers: Real Context Size and Context Rot — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-07-real-context-size-and-context-rot-56cbb4.mp3 Interactive Visualization: TriAttention for Efficient Long-Context KV Compression
1天前

When Spectral Gradient Updates Help Deep Learning

This episode explores a theory paper that asks when spectral matrix updates should outperform standard Euclidean gradient methods in deep networks and transformers. It explains how spectral updates replace a gradient matrix with its polar factor—preserving singular-vector directions while flattening singular values—and argues that this geometry can help when incoming activations have low stable rank while gradients have high nuclear-rank-like spread. The discussion connects this criterion to practical excitement around spectral-style optimizers such as Muon, while contrasting them with curvature-based methods like K-FAC and Shampoo. Listeners would find it interesting because the episode turns a seemingly niche optimizer trick into a concrete, testable claim about the hidden geometry of neural network training. Sources: 1. When do spectral gradient updates help in deep learning? — Damek Davis, Dmitriy Drusvyatskiy, 2025 http://arxiv.org/abs/2512.04299 2. Shampoo: Preconditioned Stochastic Tensor Optimization — Vineet Gupta, Tomer Koren, Yoram Singer and others, 2018 https://scholar.google.com/scholar?q=Shampoo:+Preconditioned+Stochastic+Tensor+Optimization 3. K-FAC: Kronecker-Factored Approximate Curvature for Neural Network Optimization — James Martens, Roger Grosse, 2015 https://scholar.google.com/scholar?q=K-FAC:+Kronecker-Factored+Approximate+Curvature+for+Neural+Network+Optimization 4. Muon: An optimizer for hidden layers in neural networks — Keller Jordan and collaborators, 2024 https://scholar.google.com/scholar?q=Muon:+An+optimizer+for+hidden+layers+in+neural+networks 5. When do spectral gradient updates help in deep learning? — Damek Davis, Dmitriy Drusvyatskiy, 2026 https://scholar.google.com/scholar?q=When+do+spectral+gradient+updates+help+in+deep+learning? 6. Deep Transformers without Shortcuts: Modifying Self-attention for Faithful Signal Propagation — Anonymous/various authors depending on version; commonly cited in transformer dynamics discussions, 2021 https://scholar.google.com/scholar?q=Deep+Transformers+without+Shortcuts:+Modifying+Self-attention+for+Faithful+Signal+Propagation 7. On the Softmax Bottleneck of Recurrent Language Models — Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, William W. Cohen, Yoshua Bengio, 2018 https://scholar.google.com/scholar?q=On+the+Softmax+Bottleneck+of+Recurrent+Language+Models 8. Representation Degeneration Problem in Training Natural Language Generation Models — Junxian He, Daniel Spokoyny, Graham Neubig, Taylor Berg-Kirkpatrick, 2020 https://scholar.google.com/scholar?q=Representation+Degeneration+Problem+in+Training+Natural+Language+Generation+Models 9. Neural Collapse: A Terminal Phase of Deep Learning Training — Vardan Papyan, X. Y. Han, David L. Donoho, 2020 https://scholar.google.com/scholar?q=Neural+Collapse:+A+Terminal+Phase+of+Deep+Learning+Training 10. Understanding Dimensional Collapse in Contrastive Self-supervised Learning — Tianyu Hua, Wenxiao Wang, Zihang Dai and others, 2021 https://scholar.google.com/scholar?q=Understanding+Dimensional+Collapse+in+Contrastive+Self-supervised+Learning 11. The Intrinsic Dimension of Objective Landscapes — Chunyuan Li, Heerad Farkhoor, Rosanne Liu, Jason Yosinski, 2018 https://scholar.google.com/scholar?q=The+Intrinsic+Dimension+of+Objective+Landscapes 12. Random Features for Large-Scale Kernel Machines — Ali Rahimi, Benjamin Recht, 2007 https://scholar.google.com/scholar?q=Random+Features+for+Large-Scale+Kernel+Machines 13. A Random Matrix Perspective on Random Features for Compositional Kernels — Florent Krzakala, Lenka Zdeborová, and collaborators in the random-features theory community, 2019 https://scholar.google.com/scholar?q=A+Random+Matrix+Perspective+on+Random+Features+for+Compositional+Kernels 14. The Surprising Effectiveness of Random Features for Structured Data — Various authors across theory and applied ML; representative random-feature comparison literature, 2010s-2020s https://scholar.google.com/scholar?q=The+Surprising+Effectiveness+of+Random+Features+for+Structured+Data 15. Spectral Gradient Descent — Yair Carmon, John C. Duchi, Oliver Hinder, Aaron Sidford, 2021 https://scholar.google.com/scholar?q=Spectral+Gradient+Descent 16. A Kronecker-factored approximate Fisher matrix for convolution layers — Roger Grosse, Jimmy Ba, et al., 2016 https://scholar.google.com/scholar?q=A+Kronecker-factored+approximate+Fisher+matrix+for+convolution+layers 17. Feature Learning in Infinite-Width Neural Networks — Mario Geiger, Stefano Spigler, Arthur Jacot, Matthieu Wyart, 2020 https://scholar.google.com/scholar?q=Feature+Learning+in+Infinite-Width+Neural+Networks 18. Neural Collapse: A Review and Synthesis — Vardan Papyan, X.Y. Han, David L. Donoho, 2023 https://scholar.google.com/scholar?q=Neural+Collapse:+A+Review+and+Synthesis 19. Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning — Arora et al., 2021 https://scholar.google.com/scholar?q=Intrinsic+Dimensionality+Explains+the+Effectiveness+of+Language+Model+Fine-Tuning 20. Understanding transformers for time series: Rank structure, flow-of-ranks, and compressibility — approx. recent transformer interpretability / theory authors, recent https://scholar.google.com/scholar?q=Understanding+transformers+for+time+series:+Rank+structure,+flow-of-ranks,+and+compressibility 21. Tuning stable rank shrinkage: Aiming at the overlooked structural risk in fine-tuning — approx. recent fine-tuning / representation learning authors, recent https://scholar.google.com/scholar?q=Tuning+stable+rank+shrinkage:+Aiming+at+the+overlooked+structural+risk+in+fine-tuning 22. Unraveling the gradient descent dynamics of transformers — approx. recent optimization theory authors, recent https://scholar.google.com/scholar?q=Unraveling+the+gradient+descent+dynamics+of+transformers 23. AI Post Transformers: Adam: A Method for Stochastic Optimization — Hal Turing & Dr. Ada Shannon, 2025 https://podcast.do-not-panic.com/episodes/adam-a-method-for-stochastic-optimization/ 24. AI Post Transformers: AdamW: Decoupled Weight Decay Regularization for Adaptive Gradient Algorithms — Hal Turing & Dr. Ada Shannon, 2025 https://podcast.do-not-panic.com/episodes/adamw-decoupled-weight-decay-regularization-for-adaptive-gradient-algorithms/ 25. AI Post Transformers: In-Context Learning as Implicit Learning Algorithms — Hal Turing & Dr. Ada Shannon, 2025 https://podcast.do-not-panic.com/episodes/in-context-learning-as-implicit-learning-algorithms/ Interactive Visualization: When Spectral Gradient Updates Help Deep Learning
2天前

Real Context Size and Context Rot

In this episode, Hal Turing and Dr. Ada Shannon return to a term they used in their Recursive Language Models conversation without fully defining it: context rot. Using Chroma Research’s 2025 write-up as the main anchor, they explain context rot as the degraded, uneven, and unreliable use of information as prompts get longer—even on simple tasks. The discussion makes the central distinction the industry often blurs: advertised context capacity is not the same as usable context. A model may accept 128K or even a million tokens without crashing, but that does not mean it can reliably retrieve, connect, and reason over what was placed inside that buffer. They pair Chroma’s failure analysis with RULER, the 2024 NVIDIA-led benchmark paper asking a more practical question: what is a model’s real context size, meaning the longest prompt length at which performance remains satisfactory? The episode walks through why older long-context tests, especially vanilla needle-in-a-haystack retrieval, were too flattering. Hal and Ada discuss how simple retrieval benchmarks mostly measure lexical lookup, while stronger evaluations must test reference tracing, aggregation across documents, resilience to distraction, and whether the model is actually using the supplied prompt rather than answering from parametric knowledge stored in its weights. They also briefly credit the Gemini 1.5 technical report for explicitly calling on the field to build harder long-context benchmarks, then situate RULER alongside the benchmark ecosystem that followed, including LongBench and InfiniteBench, with a dedicated RULER episode coming soon. The larger thesis is that a giant context window should not be mistaken for memory. For retrieval-augmented generation, document-grounded assistants, and agent systems, a long prompt is at best an unstructured buffer—a cluttered desk or overstuffed backpack—not a real memory architecture. As the hosts argue, once context rot sets in, simply adding more tokens stops helping and can actively degrade reliability. If the goal is AI systems that truly remember and reason across large bodies of information, then memory and storage have to become first-class design elements: managed, tiered, retrievable, structured, and persistent, rather than just a bigger pile of tokens shoved into the prompt. Sources: 1. RULER: What's the Real Context Size of Your Long-Context Language Models? — Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, Boris Ginsburg, 2024 http://arxiv.org/abs/2404.06654 2. LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding — Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, Juanzi Li, 2023 http://arxiv.org/abs/2308.14508 3. Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned — Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, Matt Fredrikson and others, 2024 https://scholar.google.com/scholar?q=Red+Teaming+Language+Models+to+Reduce+Harms:+Methods,+Scaling+Behaviors,+and+Lessons+Learned 4. JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models — A team including researchers from academia and industry; commonly cited under the JailbreakBench project authorship, 2024 https://scholar.google.com/scholar?q=JailbreakBench:+An+Open+Robustness+Benchmark+for+Jailbreaking+Large+Language+Models 5. Holistic Evaluation of Language Models — Percy Liang, Rishi Bommasani, Tony Lee, Dmitriy Ryaboy and many collaborators, 2022 https://scholar.google.com/scholar?q=Holistic+Evaluation+of+Language+Models 6. Do Anything Now: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models — Researchers studying jailbreak prompt collections from public communities; commonly cited as a characterization study of DAN-style prompts, 2024 https://scholar.google.com/scholar?q=Do+Anything+Now:+Characterizing+and+Evaluating+In-The-Wild+Jailbreak+Prompts+on+Large+Language+Models 7. Lost in the Middle: How Language Models Use Long Contexts — Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, Percy Liang, 2024 https://scholar.google.com/scholar?q=Lost+in+the+Middle:+How+Language+Models+Use+Long+Contexts 8. Needle In A Haystack - Pressure Testing LLMs — Greg Kamradt, 2023 https://scholar.google.com/scholar?q=Needle+In+A+Haystack+-+Pressure+Testing+LLMs 9. LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding — Yucheng Bai, Xintong Lu, Lianghao Wang, Xiaoxuan Liu, Weisheng Wang, Bo Zheng, Hongting Lin, Xinyu Dai, Wayne Xin Zhao, Ruifeng Xu, 2024 https://scholar.google.com/scholar?q=LongBench:+A+Bilingual,+Multitask+Benchmark+for+Long+Context+Understanding 10. L-Eval: Instituting Standardized Evaluation for Long Context Language Models — Chenglong Su, Jiarui Fang, Haozhe Ji, et al., 2024 https://scholar.google.com/scholar?q=L-Eval:+Instituting+Standardized+Evaluation+for+Long+Context+Language+Models 11. InfiniteBench: Extending Long Context Evaluation Beyond 100K Tokens — Yifan Zhang, Weizhi Wang, et al., 2024 https://scholar.google.com/scholar?q=InfiniteBench:+Extending+Long+Context+Evaluation+Beyond+100K+Tokens 12. BAMBOO: A Comprehensive Benchmark for Evaluating Long Text Modeling Capacities of Large Language Models — Ying Sheng, et al., 2024 https://scholar.google.com/scholar?q=BAMBOO:+A+Comprehensive+Benchmark+for+Evaluating+Long+Text+Modeling+Capacities+of+Large+Language+Models 13. Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach — Tianle Cai, et al., 2024 https://scholar.google.com/scholar?q=Retrieval+Augmented+Generation+or+Long-Context+LLMs?+A+Comprehensive+Study+and+Hybrid+Approach 14. Rethinking the Role of Scaling Laws in the Long Context Performance of Large Language Models — Various 2024 long-context scaling studies cited around Liu et al./Young et al., 2024 https://scholar.google.com/scholar?q=Rethinking+the+Role+of+Scaling+Laws+in+the+Long+Context+Performance+of+Large+Language+Models 15. LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-Context Multitasks — approx. Bai et al. / THUDM-affiliated LongBench follow-up team, 2024 https://scholar.google.com/scholar?q=LongBench+v2:+Towards+Deeper+Understanding+and+Reasoning+on+Realistic+Long-Context+Multitasks 16. LongBench Pro: A More Realistic and Comprehensive Bilingual Long-Context Evaluation Benchmark — approx. LongBench/THUDM-style benchmark authors, 2024 https://scholar.google.com/scholar?q=LongBench+Pro:+A+More+Realistic+and+Comprehensive+Bilingual+Long-Context+Evaluation+Benchmark 17. Why Does the Effective Context Length of LLMs Fall Short? — approx. unknown from snippet, 2024 https://scholar.google.com/scholar?q=Why+Does+the+Effective+Context+Length+of+LLMs+Fall+Short? 18. BABILong-ITA: A New Benchmark for Testing Large Language Models Effective Context Length and a Context Extension Method — approx. unknown from snippet, 2024 https://scholar.google.com/scholar?q=BABILong-ITA:+A+New+Benchmark+for+Testing+Large+Language+Models+Effective+Context+Length+and+a+Context+Extension+Method 19. Precursors, Proxies, and Predictive Models for Long-Horizon Tasks — approx. unknown from snippet, 2024 https://scholar.google.com/scholar?q=Precursors,+Proxies,+and+Predictive+Models+for+Long-Horizon+Tasks 20. The What, Why, and How of Context Length Extension Techniques in Large Language Models--A Detailed Survey — approx. unknown from snippet, 2024 https://scholar.google.com/scholar?q=The+What,+Why,+and+How+of+Context+Length+Extension+Techniques+in+Large+Language+Models--A+Detailed+Survey 21. AI Post Transformers: Doc-to-LoRA: Internalizing Context as LoRA — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-29-doc-to-lora-internalizing-context-as-lor-8dd5ec.mp3 22. AI Post Transformers: MEMSEARCHER: Reinforcement Learning for LLM Memory Management — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-memsearcher-reinforcement-learning-for-l-e9ad84.mp3 23. AI Post Transformers: From Prefix Cache to Fusion RAG Cache: Accelerating LLM Inference in Retrieval-Augmented Generation — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-22-from-prefix-cache-to-fusion-rag-9c5d39.mp3 24. AI Post Transformers: AI Agent Traps and Prompt Injection — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-02-ai-agent-traps-and-prompt-injection-7ce4ba.mp3 25. AI Post Transformers: LookaheadKV: Fast and Accurate KV Cache Eviction by Glimpsing into the Future without Generation — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-25-lookaheadkv-fast-and-accurate-kv-9cfc9f.mp3 26. AI Post Transformers: Lookahead Q-Cache for Consistent KV Eviction — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-25-lookahead-q-cache-for-consistent-kv-evic-d97b09.mp3 27. AI Post Transformers: Splitwise: Phase-Split LLM Inference — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-26-splitwise-phase-split-llm-inference-e8945b.mp3 Interactive Visualization: Real Context Size and Context Rot
2天前

Speculative Decoding in Real vLLM Serving

This episode explores whether speculative decoding’s widely cited inference speedups survive real deployment conditions, using a January 2026 UC Berkeley paper that evaluates the method inside vLLM rather than in idealized toy benchmarks. It explains the core mechanics of draft-and-verify decoding, then digs into why acceptance length, verification cost, scheduler behavior, batching, KV-cache management, and long generations can erase much of the theoretical advantage in production serving stacks. The discussion also clarifies the difference between speculative decoding and multi-token prediction, situating approaches like MEDUSA and EAGLE within the broader effort to reduce autoregressive bottlenecks. Listeners interested in LLM systems will find it compelling because it shifts the conversation from flashy benchmark bar charts to the practical question of what actually improves wall-clock latency for real workloads. Sources: 1. Speculative Decoding: Performance or Illusion? — Xiaoxuan Liu, Jiaxiang Yu, Jongseok Park, Ion Stoica, Alvin Cheung, 2025 http://arxiv.org/abs/2601.11580 2. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness — Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré, 2022 https://scholar.google.com/scholar?q=FlashAttention:+Fast+and+Memory-Efficient+Exact+Attention+with+IO-Awareness 3. Efficient Memory Management for Large Language Model Serving with PagedAttention — Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, et al., 2023 https://scholar.google.com/scholar?q=Efficient+Memory+Management+for+Large+Language+Model+Serving+with+PagedAttention 4. Ring Attention with Blockwise Transformers for Near-Infinite Context — William Bevington, et al., 2023 https://scholar.google.com/scholar?q=Ring+Attention+with+Blockwise+Transformers+for+Near-Infinite+Context 5. Speculative Decoding: Exploiting Speculative Execution for Accelerating Seq2seq Generation — Yaniv Leviathan, Matan Kalman, Yossi Matias, 2023 https://scholar.google.com/scholar?q=Speculative+Decoding:+Exploiting+Speculative+Execution+for+Accelerating+Seq2seq+Generation 6. Fast Inference from Transformers via Speculative Decoding — Yaniv Leviathan, Matan Kalman, Yossi Matias, 2023 https://scholar.google.com/scholar?q=Fast+Inference+from+Transformers+via+Speculative+Decoding 7. MEDUSA: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads — Tianle Cai, Yuhong Li, Zhengxu Chen, et al., 2024 https://scholar.google.com/scholar?q=MEDUSA:+Simple+LLM+Inference+Acceleration+Framework+with+Multiple+Decoding+Heads 8. EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty — Yuhui Li, et al., 2024 https://scholar.google.com/scholar?q=EAGLE:+Speculative+Sampling+Requires+Rethinking+Feature+Uncertainty 9. Speculative Decoding: Performance or Illusion? — Xiaoxuan Liu, Jiaxiang Yu, Jongseok Park, Ion Stoica, Alvin Cheung, 2026 https://scholar.google.com/scholar?q=Speculative+Decoding:+Performance+or+Illusion? 10. vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention — Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, et al., 2023 https://scholar.google.com/scholar?q=vLLM:+Easy,+Fast,+and+Cheap+LLM+Serving+with+PagedAttention 11. EAGLE-3 — Authors as cited in the paper's related work, 2024 https://scholar.google.com/scholar?q=EAGLE-3 12. Multi-Token Prediction — Liu et al.; Zeng et al., 2025 https://scholar.google.com/scholar?q=Multi-Token+Prediction 13. Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding — Xia et al., 2024 https://scholar.google.com/scholar?q=Unlocking+Efficiency+in+Large+Language+Model+Inference:+A+Comprehensive+Survey+of+Speculative+Decoding 14. A Systematic Study of Speculative Decoding in Computation-Bound Regimes — Liu et al., 2024 https://scholar.google.com/scholar?q=A+Systematic+Study+of+Speculative+Decoding+in+Computation-Bound+Regimes 15. N-Gram Speculative Decoding — Saxena; Somasundaram et al., 2023/2024 https://scholar.google.com/scholar?q=N-Gram+Speculative+Decoding 16. Determinism and Nondeterminism in LLM Inference — He, 2025 https://scholar.google.com/scholar?q=Determinism+and+Nondeterminism+in+LLM+Inference 17. Block Verification Accelerates Speculative Decoding — unknown from snippet, likely 2024-2025 https://scholar.google.com/scholar?q=Block+Verification+Accelerates+Speculative+Decoding 18. Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding — Zhang et al. / likely 2024, 2024 https://scholar.google.com/scholar?q=Draft+&+Verify:+Lossless+Large+Language+Model+Acceleration+via+Self-Speculative+Decoding 19. MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding — unknown from snippet, likely 2024-2025 https://scholar.google.com/scholar?q=MagicDec:+Breaking+the+Latency-Throughput+Tradeoff+for+Long+Context+Generation+with+Speculative+Decoding 20. SpecDec++: Boosting Speculative Decoding via Adaptive Candidate Lengths — unknown from snippet, likely 2024-2025 https://scholar.google.com/scholar?q=SpecDec++:+Boosting+Speculative+Decoding+via+Adaptive+Candidate+Lengths 21. Adaptive Speculative Decoding for Large Language Models — unknown from snippet, likely 2024 https://scholar.google.com/scholar?q=Adaptive+Speculative+Decoding+for+Large+Language+Models 22. Opt-Tree: Speculative Decoding with Adaptive Draft Tree Structure — unknown from snippet, likely 2024-2025 https://scholar.google.com/scholar?q=Opt-Tree:+Speculative+Decoding+with+Adaptive+Draft+Tree+Structure 23. Draft Model Knows When to Stop: Self-Verification Speculative Decoding for Long-Form Generation — unknown from snippet, likely 2024-2025 https://scholar.google.com/scholar?q=Draft+Model+Knows+When+to+Stop:+Self-Verification+Speculative+Decoding+for+Long-Form+Generation 24. Draft Model Knows When to Stop: A Self-Verification Length Policy for Speculative Decoding — unknown from snippet, likely 2025 https://scholar.google.com/scholar?q=Draft+Model+Knows+When+to+Stop:+A+Self-Verification+Length+Policy+for+Speculative+Decoding 25. AI Post Transformers: Adaptive Control for Batched Speculative Decoding in LLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/adaptive-control-for-batched-speculative-decoding-in-llm-serving/ 26. AI Post Transformers: Building Production-Ready Speculative Decoding with TensorRT-LLM — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/building-production-ready-speculative-decoding-with-tensorrt-llm/ 27. AI Post Transformers: Apple's Speculative Streaming: Fast LLM Inference without Auxiliary Models — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/apples-speculative-streaming-fast-llm-inference-without-auxiliary-models/ 28. AI Post Transformers: Episode: Speculative Speculative Decoding — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-25-speculative-speculative-decoding-1b7a10.mp3 29. AI Post Transformers: Continuous Batching for LLM Inference: Throughput and Latency Gains — Hal Turing & Dr. Ada Shannon, 2025 https://podcast.do-not-panic.com/episodes/continuous-batching-for-llm-inference-throughput-and-latency-gains/ 30. AI Post Transformers: Splitwise: Phase-Split LLM Inference — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-26-splitwise-phase-split-llm-inference-e8945b.mp3 Interactive Visualization: Speculative Decoding in Real vLLM Serving
5天前

FlatAttention for Tile-Based Accelerator Inference

This episode explores a 2026 paper on “FlatAttention,” which argues that attention inference should be co-designed with on-chip communication primitives to fully exploit tile-based accelerators rather than reusing GPU-style kernels. It explains how these accelerators differ from GPUs: computation is spread across many tiles with local SRAM and an on-chip network, making data placement, multicast, and reduction central to performance. The discussion highlights why attention has become a growing inference bottleneck—especially for long-context models and MoE systems—and contrasts prefill vs. decode behavior, KV-cache movement costs, and variants like MHA, MQA, GQA, and MLA. Listeners would find it interesting for its careful framing of both the promise and the fairness concerns of hardware-software co-design, especially in comparison to FlashAttention’s IO-aware optimization on GPUs. Sources: 1. FlatAttention: Dataflow and Fabric Collectives Co-Optimization for Large Attention-Based Model Inference on Tile-Based Accelerators — Chi Zhang, Luca Colagrande, Renzo Andri, Luca Benini, 2026 http://arxiv.org/abs/2604.02110 2. In-Datacenter Performance Analysis of a Tensor Processing Unit — Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, and others, 2017 https://scholar.google.com/scholar?q=In-Datacenter+Performance+Analysis+of+a+Tensor+Processing+Unit 3. A Domain-Specific Supercomputer for Training Deep Neural Networks — Norman P. Jouppi, George Kurian, Sheng Li, and others, 2021 https://scholar.google.com/scholar?q=A+Domain-Specific+Supercomputer+for+Training+Deep+Neural+Networks 4. A Wafer-Scale Engine for Deep Learning — Sean Lie, Andrew H. Putnam, David Firestone, and Cerebras Systems team, 2021 https://scholar.google.com/scholar?q=A+Wafer-Scale+Engine+for+Deep+Learning 5. Scaling Graph Neural Networks with the Graphcore IPU — James H. Smith, et al. (Graphcore-affiliated authors in IPU architecture/application literature), 2022 https://scholar.google.com/scholar?q=Scaling+Graph+Neural+Networks+with+the+Graphcore+IPU 6. Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices — Yu-Hsin Chen, Tushar Krishna, Joel S. Emer, Vivienne Sze, 2019 https://scholar.google.com/scholar?q=Eyeriss+v2:+A+Flexible+Accelerator+for+Emerging+Deep+Neural+Networks+on+Mobile+Devices 7. In-Network Computing for Machine Learning: Opportunities and Challenges — various survey authors in networking/ML systems literature; representative surveys include works by Mohammad Alizadeh, Yibo Zhu, and collaborators, 2021 https://scholar.google.com/scholar?q=In-Network+Computing+for+Machine+Learning:+Opportunities+and+Challenges 8. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism — Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, Bryan Catanzaro, 2019 https://scholar.google.com/scholar?q=Megatron-LM:+Training+Multi-Billion+Parameter+Language+Models+Using+Model+Parallelism 9. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness — Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré, 2022 https://scholar.google.com/scholar?q=FlashAttention:+Fast+and+Memory-Efficient+Exact+Attention+with+IO-Awareness 10. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning — Tri Dao, 2023 https://scholar.google.com/scholar?q=FlashAttention-2:+Faster+Attention+with+Better+Parallelism+and+Work+Partitioning 11. FlashAttention-3 — Tri Dao and collaborators, 2024 https://scholar.google.com/scholar?q=FlashAttention-3 12. FlashMLA — DeepSeek team, 2025 https://scholar.google.com/scholar?q=FlashMLA 13. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints — Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, Sumit Sanghai, 2023 https://scholar.google.com/scholar?q=GQA:+Training+Generalized+Multi-Query+Transformer+Models+from+Multi-Head+Checkpoints 14. Fast Transformer Decoding: One Write-Head is All You Need — Noam Shazeer, 2019 https://scholar.google.com/scholar?q=Fast+Transformer+Decoding:+One+Write-Head+is+All+You+Need 15. DeepSeek-V3 Technical Report — DeepSeek-AI, 2024 https://scholar.google.com/scholar?q=DeepSeek-V3+Technical+Report 16. Attention Is All You Need — Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin, 2017 https://scholar.google.com/scholar?q=Attention+Is+All+You+Need 17. Wafer-Scale Deep Learning — Daniel Lie, Gary Lauterbach, Sean Lie and collaborators at Cerebras, 2021 https://scholar.google.com/scholar?q=Wafer-Scale+Deep+Learning 18. Distributed Deep Learning on a Wafer-Scale Engine — Cerebras Systems authors, 2022 https://scholar.google.com/scholar?q=Distributed+Deep+Learning+on+a+Wafer-Scale+Engine 19. LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference — approx. enterprise systems / LLM serving authors, 2024 https://scholar.google.com/scholar?q=LMCache:+An+Efficient+KV+Cache+Layer+for+Enterprise-Scale+LLM+Inference 20. HotPrefix: Hotness-Aware KV Cache Scheduling for Efficient Prefix Sharing in LLM Inference Systems — approx. LLM systems authors, 2024 https://scholar.google.com/scholar?q=HotPrefix:+Hotness-Aware+KV+Cache+Scheduling+for+Efficient+Prefix+Sharing+in+LLM+Inference+Systems 21. Selective KV-Cache Sharing to Mitigate Timing Side-Channels in LLM Inference — approx. security / systems authors, 2024 https://scholar.google.com/scholar?q=Selective+KV-Cache+Sharing+to+Mitigate+Timing+Side-Channels+in+LLM+Inference 22. MoE-Gen: High-Throughput MoE Inference on a Single GPU with Module-Based Batching — approx. MoE inference systems authors, 2024 https://scholar.google.com/scholar?q=MoE-Gen:+High-Throughput+MoE+Inference+on+a+Single+GPU+with+Module-Based+Batching 23. Accelerating Distributed MoE Training and Inference with Lina — approx. distributed systems / ML systems authors, 2024 https://scholar.google.com/scholar?q=Accelerating+Distributed+MoE+Training+and+Inference+with+Lina 24. Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (MoE) Inference — approx. MoE deployment authors, 2024 https://scholar.google.com/scholar?q=Towards+MoE+Deployment:+Mitigating+Inefficiencies+in+Mixture-of-Expert+(MoE)+Inference 25. MAS-Attention: Memory-Aware Stream Processing for Attention Acceleration on Resource-Constrained Edge Devices — approx. accelerator architecture authors, 2024 https://scholar.google.com/scholar?q=MAS-Attention:+Memory-Aware+Stream+Processing+for+Attention+Acceleration+on+Resource-Constrained+Edge+Devices 26. REATA: An Efficient Vision Transformer Accelerator Featuring a Resource-Optimized Attention Design on Versal ACAP — approx. FPGA / accelerator authors, 2024 https://scholar.google.com/scholar?q=REATA:+An+Efficient+Vision+Transformer+Accelerator+Featuring+a+Resource-Optimized+Attention+Design+on+Versal+ACAP 27. Concerto: Automatic Communication Optimization and Scheduling for Large-Scale Deep Learning — approx. systems / compiler authors, 2024 https://scholar.google.com/scholar?q=Concerto:+Automatic+Communication+Optimization+and+Scheduling+for+Large-Scale+Deep+Learning 28. AI Post Transformers: Splitwise: Phase-Split LLM Inference — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-26-splitwise-phase-split-llm-inference-e8945b.mp3 29. AI Post Transformers: SolidAttention: Co-Designing Sparse Attention and SSD I/O — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-18-solidattention-co-designing-sparse-atten-5a8622.mp3 30. AI Post Transformers: LookaheadKV: Fast and Accurate KV Cache Eviction by Glimpsing into the Future without Generation — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-25-lookaheadkv-fast-and-accurate-kv-9cfc9f.mp3 31. AI Post Transformers: From Prefix Cache to Fusion RAG Cache: Accelerating LLM Inference in Retrieval-Augmented Generation — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-22-from-prefix-cache-to-fusion-rag-9c5d39.mp3 32. AI Post Transformers: Continuous Batching for LLM Inference: Throughput and Latency Gains — Hal Turing & Dr. Ada Shannon, 2025 https://podcast.do-not-panic.com/episodes/continuous-batching-for-llm-inference-throughput-and-latency-gains/ 33. AI Post Transformers: SGLang: Efficient Language Model Program Execution — Hal Turing & Dr. Ada Shannon, 2025 https://podcast.do-not-panic.com/episodes/sglang-efficient-language-model-program-execution/ 34. AI Post Transformers: Speculative Speculative Decoding — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-25-speculative-speculative-decoding-1b7a10.mp3 35. AI Post Transformers: Jet-Nemotron and PostNAS for Faster Long Context — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-24-jet-nemotron-and-postnas-for-faster-long-436381.mp3 36. AI Post Transformers: FlexGen: High-Throughput LLM Inference on a Single GPU — Hal Turing & Dr. Ada Shannon, 2025 https://podcast.do-not-panic.com/episodes/flexgen-high-throughput-llm-inference-on-a-single-gpu/ Interactive Visualization: FlatAttention for Tile-Based Accelerator Inference
5天前

IMO-Bench for Robust Mathematical Reasoning

This episode explores a new benchmark suite, IMO-Bench, designed to test whether AI systems can do genuinely robust mathematical reasoning at Olympiad difficulty rather than merely produce correct final answers. It breaks down the benchmark into three distinct tasks—short-answer problem solving, full proof generation, and automatic proof grading—and argues that this decomposition better captures real mathematical competence than answer-centric evaluations like GSM8K or MATH, which may now be saturated or overly teachable. The discussion highlights why IMO-style problems are especially revealing: they require discovering invariants, constructions, and contradiction arguments that resist routine pattern matching and expose whether models can sustain long-horizon reasoning and self-correction. Listeners would find it interesting because it tackles a central question in AI evaluation—whether current benchmarks are measuring true reasoning or just benchmark-specific performance—and examines the promise and risks of using model-based autograders to scale proof assessment. Sources: 1. Towards Robust Mathematical Reasoning — Thang Luong, Dawsen Hwang, Hoang H. Nguyen, Golnaz Ghiasi, Yuri Chervonyi, Insuk Seo, Junsu Kim, Garrett Bingham, Jonathan Lee, Swaroop Mishra, Alex Zhai, Clara Huiyi Hu, Henryk Michalewski, Jimin Kim, Jeonghyun Ahn, Junhwi Bae, Xingyou Song, Trieu H. Trinh, Quoc V. Le, Junehyuk Jung, 2025 http://arxiv.org/abs/2511.01846 2. Training Verifiers to Solve Math Word Problems — Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, John Schulman, 2021 https://scholar.google.com/scholar?q=Training+Verifiers+to+Solve+Math+Word+Problems 3. Measuring Mathematical Problem Solving With the MATH Dataset — Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, Jacob Steinhardt, 2021 https://scholar.google.com/scholar?q=Measuring+Mathematical+Problem+Solving+With+the+MATH+Dataset 4. Solving Quantitative Reasoning Problems with Language Models — Aakanksha Chowdhery and collaborators at Google Research, 2022 https://scholar.google.com/scholar?q=Solving+Quantitative+Reasoning+Problems+with+Language+Models 5. FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI — Elliot Glazer and collaborators, 2024 https://scholar.google.com/scholar?q=FrontierMath:+A+Benchmark+for+Evaluating+Advanced+Mathematical+Reasoning+in+AI 6. Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models — Suzgun Mirac, et al. (BIG-bench collaboration), 2022 https://scholar.google.com/scholar?q=Beyond+the+Imitation+Game:+Quantifying+and+Extrapolating+the+Capabilities+of+Language+Models 7. Holistic Evaluation of Language Models — Percy Liang, Rishi Bommasani, Tony Lee, Dmitriy Turbiner, and collaborators, 2022 https://scholar.google.com/scholar?q=Holistic+Evaluation+of+Language+Models 8. Dynabench: Rethinking Benchmarking in NLP — Douwe Kiela, Max Bartolo, Yixin Nie, Divyansh Kaushik, Atticus Geiger, and collaborators, 2021 https://scholar.google.com/scholar?q=Dynabench:+Rethinking+Benchmarking+in+NLP 9. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena — Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Tianle Li, et al., 2023 https://scholar.google.com/scholar?q=Judging+LLM-as-a-Judge+with+MT-Bench+and+Chatbot+Arena 10. G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment — Jun Gao, Huanle Liu, et al., 2023 https://scholar.google.com/scholar?q=G-Eval:+NLG+Evaluation+using+GPT-4+with+Better+Human+Alignment 11. Automatic Evaluation of Mathematical Proofs in Natural Language: A Survey — Various survey authors in educational technology and AI, 2020-2024 https://scholar.google.com/scholar?q=Automatic+Evaluation+of+Mathematical+Proofs+in+Natural+Language:+A+Survey 12. Towards Robust Mathematical Reasoning — Thang Luong, Dawsen Hwang, Hoang H. Nguyen, Golnaz Ghiasi, Yuri Chervonyi, Insuk Seo, Junsu Kim, Garrett Bingham, Jonathan Lee, Swaroop Mishra, Alex Zhai, Clara Huiyi Hu, Henryk Michalewski, Jimin Kim, Jeonghyun Ahn, Junhwi Bae, Xingyou Song, Trieu H. Trinh, Quoc V. Le, Junehyuk Jung, 2025 https://scholar.google.com/scholar?q=Towards+Robust+Mathematical+Reasoning 13. Draft, Sketch, and Prove: Guiding Formal Theorem Provers with Informal Proofs — Various authors in neural theorem proving and autoformalization, 2022-2024 https://scholar.google.com/scholar?q=Draft,+Sketch,+and+Prove:+Guiding+Formal+Theorem+Provers+with+Informal+Proofs 14. Solving Olympiad Geometry without Human Demonstrations — Trieu H. Trinh, Yuhuai Wu, Quoc V. Le, He He, et al., 2024 https://scholar.google.com/scholar?q=Solving+Olympiad+Geometry+without+Human+Demonstrations 15. LeanDojo: Theorem Proving with Retrieval-Augmented Language Models — Kaiyu Yang, Aidan O'Gara, et al., 2023 https://scholar.google.com/scholar?q=LeanDojo:+Theorem+Proving+with+Retrieval-Augmented+Language+Models 16. FrontierMath — Glazer et al., 2024 https://scholar.google.com/scholar?q=FrontierMath 17. Humanity's Last Exam — Phan et al., 2025 https://scholar.google.com/scholar?q=Humanity's+Last+Exam 18. GSM8K: Training Verifiers to Solve Math Word Problems — Cobbe et al., 2021 https://scholar.google.com/scholar?q=GSM8K:+Training+Verifiers+to+Solve+Math+Word+Problems 19. Gemini Deep Think at IMO 2025 — Luong and Lockhart, 2025 https://scholar.google.com/scholar?q=Gemini+Deep+Think+at+IMO+2025 20. Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination — approx. 2025, authors unclear from snippet, 2025 https://scholar.google.com/scholar?q=Reasoning+or+Memorization?+Unreliable+Results+of+Reinforcement+Learning+Due+to+Data+Contamination 21. Right Is Not Enough: The Pitfalls of Outcome Supervision in Training LLMs for Math Reasoning — approx. 2025, authors unclear from snippet, 2025 https://scholar.google.com/scholar?q=Right+Is+Not+Enough:+The+Pitfalls+of+Outcome+Supervision+in+Training+LLMs+for+Math+Reasoning 22. Improve Mathematical Reasoning in Language Models by Automated Process Supervision — approx. 2025, authors unclear from snippet, 2025 https://scholar.google.com/scholar?q=Improve+Mathematical+Reasoning+in+Language+Models+by+Automated+Process+Supervision 23. MM-PRM: Enhancing Multimodal Mathematical Reasoning with Scalable Step-Level Supervision — approx. 2025, authors unclear from snippet, 2025 https://scholar.google.com/scholar?q=MM-PRM:+Enhancing+Multimodal+Mathematical+Reasoning+with+Scalable+Step-Level+Supervision 24. Solving Inequality Proofs with Large Language Models — approx. 2025, authors unclear from snippet, 2025 https://scholar.google.com/scholar?q=Solving+Inequality+Proofs+with+Large+Language+Models 25. Beyond Gold Standards: Epistemic Ensemble of LLM Judges for Formal Mathematical Reasoning — approx. 2025, authors unclear from snippet, 2025 https://scholar.google.com/scholar?q=Beyond+Gold+Standards:+Epistemic+Ensemble+of+LLM+Judges+for+Formal+Mathematical+Reasoning 26. A Survey on Deep Learning for Theorem Proving — approx. survey authors unclear from snippet, recent https://scholar.google.com/scholar?q=A+Survey+on+Deep+Learning+for+Theorem+Proving 27. Proving Theorems Recursively — approx. 2025, authors unclear from snippet, 2025 https://scholar.google.com/scholar?q=Proving+Theorems+Recursively 28. DICE: Detecting In-distribution Contamination in LLM's Fine-tuning Phase for Math Reasoning — approx. 2025, authors unclear from snippet, 2025 https://scholar.google.com/scholar?q=DICE:+Detecting+In-distribution+Contamination+in+LLM's+Fine-tuning+Phase+for+Math+Reasoning 29. AI Post Transformers: Schoenfeld Theory Applied to Large Reasoning Models — Hal Turing & Dr. Ada Shannon, Sat, https://podcast.do-not-panic.com/episodes/schoenfeld-theory-applied-to-large-reasoning-models/ 30. AI Post Transformers: LLM Benchmark Robustness to Linguistic Variation — Hal Turing & Dr. Ada Shannon, Tue, https://podcast.do-not-panic.com/episodes/llm-benchmark-robustness-to-linguistic-variation/ 31. AI Post Transformers: Generalist Reward Modeling with Inference-Time Scaling — Hal Turing & Dr. Ada Shannon, Tue, https://podcast.do-not-panic.com/episodes/generalist-reward-modeling-with-inference-time-scaling/ 32. AI Post Transformers: Evolving Language Models Without Labels: EVOL-RL — Hal Turing & Dr. Ada Shannon, Fri, https://podcast.do-not-panic.com/episodes/evolving-language-models-without-labels-evol-rl/ Interactive Visualization: IMO-Bench for Robust Mathematical Reasoning

查看全部 516 集

3.7

共 5 分

3 个评分

创作者

mcgrof
活跃年份

2025年 - 2026年
单集

516
分级

儿童适宜
节目网站

AI Post Transformers

投资

投资

23小时前更新

AI Post Transformers

Cache Mechanism for Agent RAG Systems

Memory Sparse Attention for 100M-Token Scaling

TriAttention for Efficient Long-Context KV Compression

When Spectral Gradient Updates Help Deep Learning

Real Context Size and Context Rot

Speculative Decoding in Real vLLM Serving

FlatAttention for Tile-Based Accelerator Inference

IMO-Bench for Robust Mathematical Reasoning

评分及评论

关于

信息

你可能还喜欢

AI Post Transformers

单集

Cache Mechanism for Agent RAG Systems

Memory Sparse Attention for 100M-Token Scaling

TriAttention for Efficient Long-Context KV Compression

When Spectral Gradient Updates Help Deep Learning

Real Context Size and Context Rot

Speculative Decoding in Real vLLM Serving

FlatAttention for Tile-Based Accelerator Inference

IMO-Bench for Robust Mathematical Reasoning

评分及评论

关于

信息

你可能还喜欢