AI Post Transformers

SALCA for Sparse Long-Context Decoding

This episode explores why long-context LLM decoding becomes memory-bandwidth bound: once prompt prefill is done, each new token must repeatedly scan an ever-growing KV cache, making inference limited more by data movement than raw compute. It explains sparse attention as the idea that only a small fraction of prior tokens matter for each step, and uses top-k recall to frame the core challenge of preserving the right token ranking while cutting memory traffic. The discussion centers on Salca’s main argument: a sparsity-aware accelerator can make sparse decoding practical by combining dominant-channel feature selection with asymmetric ultra-low-bit query/key prediction, reducing predictor traffic to roughly one-eighth of a standard 4-bit filtering baseline. A listener would find it interesting because it connects transformer inference theory, serving-system bottlenecks, and custom chip design into a concrete case for faster, more energy-efficient long-context generation. Sources: 1. SALCA for Sparse Long-Context Decoding https://arxiv.org/pdf/2604.24820 2. Fast Transformer Decoding: One Write-Head is All You Need — Noam Shazeer, 2019 https://scholar.google.com/scholar?q=Fast+Transformer+Decoding:+One+Write-Head+is+All+You+Need 3. Efficiently Scaling Transformer Inference — Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Jeff Dean, et al., 2022 https://scholar.google.com/scholar?q=Efficiently+Scaling+Transformer+Inference 4. Efficient Memory Management for Large Language Model Serving with PagedAttention — Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Joseph E. Gonzalez, Hao Zhang, Ion Stoica, 2023 https://scholar.google.com/scholar?q=Efficient+Memory+Management+for+Large+Language+Model+Serving+with+PagedAttention 5. QUEST: Query-Aware Sparsity for Efficient Long-Context LLM Inference — Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, Song Han, 2024 https://scholar.google.com/scholar?q=QUEST:+Query-Aware+Sparsity+for+Efficient+Long-Context+LLM+Inference 6. SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning — Hanrui Wang, Zhekai Zhang, Song Han, 2020 https://scholar.google.com/scholar?q=SpAtten:+Efficient+Sparse+Attention+Architecture+with+Cascade+Token+and+Head+Pruning 7. Energon: Towards Efficient Acceleration of Transformers Using Dynamic Sparse Attention — Zhe Zhou, Junlin Liu, Zhenyu Gu, Guangyu Sun, 2021 https://scholar.google.com/scholar?q=Energon:+Towards+Efficient+Acceleration+of+Transformers+Using+Dynamic+Sparse+Attention 8. S2-Attention: Hardware-Aware Context Sharding Among Attention Heads — Xihui Lin, Yunan Zhang, Suyu Ge, Liliang Ren, Barun Patra, Vishrav Chaudhary, Hao Peng, Xia Song, 2024 https://scholar.google.com/scholar?q=S2-Attention:+Hardware-Aware+Context+Sharding+Among+Attention+Heads 9. SnapMLA: Efficient Long-Context MLA Decoding via Hardware-Aware FP8 Quantized Pipelining — Yifan Zhang, Zunhai Su, Shuhao Hu, Rui Yang, Wei Wu, Yulei Qian, Yuchen Xie, Xunliang Cai, 2026 https://scholar.google.com/scholar?q=SnapMLA:+Efficient+Long-Context+MLA+Decoding+via+Hardware-Aware+FP8+Quantized+Pipelining 10. ParisKV: Fast and Drift-Robust KV-Cache Retrieval for Long-Context LLMs (https://arxiv.org/abs/2602.07721) — Yanlin Qi et al., 2026 https://scholar.google.com/scholar?q=ParisKV:+Fast+and+Drift-Robust+KV-Cache+Retrieval+for+Long-Context+LLMs+(https://arxiv.org/abs/2602.07721) 11. LouisKV: Efficient KV Cache Retrieval for Long Input-Output Sequences (https://arxiv.org/abs/2510.11292) — Wenbo Wu et al., 2025 https://scholar.google.com/scholar?q=LouisKV:+Efficient+KV+Cache+Retrieval+for+Long+Input-Output+Sequences+(https://arxiv.org/abs/2510.11292) 12. Efficient Low Rank Attention for Long-Context Inference in Large Language Models (https://arxiv.org/abs/2510.23649) — Tenghui Li et al., 2025 https://scholar.google.com/scholar?q=Efficient+Low+Rank+Attention+for+Long-Context+Inference+in+Large+Language+Models+(https://arxiv.org/abs/2510.23649) 13. LLMs Know What to Drop: Self-Attention Guided KV Cache Eviction for Efficient Long-Context Inference (https://arxiv.org/abs/2503.08879) — Guangtao Wang et al., 2025 https://scholar.google.com/scholar?q=LLMs+Know+What+to+Drop:+Self-Attention+Guided+KV+Cache+Eviction+for+Efficient+Long-Context+Inference+(https://arxiv.org/abs/2503.08879) 14. Double-P: Hierarchical Top-P Sparse Attention for Long-Context LLMs (https://arxiv.org/abs/2602.05191) — Wentao Ni et al., 2026 https://scholar.google.com/scholar?q=Double-P:+Hierarchical+Top-P+Sparse+Attention+for+Long-Context+LLMs+(https://arxiv.org/abs/2602.05191) 15. ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference (https://arxiv.org/abs/2502.00299) — Xiang Liu et al., 2025 https://scholar.google.com/scholar?q=ChunkKV:+Semantic-Preserving+KV+Cache+Compression+for+Efficient+Long-Context+LLM+Inference+(https://arxiv.org/abs/2502.00299) 16. ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification (https://arxiv.org/abs/2405.14256) — Yefei He et al., 2024 https://scholar.google.com/scholar?q=ZipCache:+Accurate+and+Efficient+KV+Cache+Quantization+with+Salient+Token+Identification+(https://arxiv.org/abs/2405.14256) 17. Accurate KV Cache Quantization with Outlier Tokens Tracing (https://arxiv.org/abs/2505.10938) — Yi Su et al., 2025 https://scholar.google.com/scholar?q=Accurate+KV+Cache+Quantization+with+Outlier+Tokens+Tracing+(https://arxiv.org/abs/2505.10938) 18. A Little Goes a Long Way: Efficient Long Context Training and Inference with Partial Contexts (https://arxiv.org/abs/2410.01485) — Suyu Ge et al., 2024 https://scholar.google.com/scholar?q=A+Little+Goes+a+Long+Way:+Efficient+Long+Context+Training+and+Inference+with+Partial+Contexts+(https://arxiv.org/abs/2410.01485) 19. AI Post Transformers: Memory-Bound, Not Bandwidth-Limited Batch-1 LLM Decode — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-06-02-memory-bound-not-bandwidth-limited-batch-114799.mp3 20. AI Post Transformers: FlatAttention for Tile-Based Accelerator Inference — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-flatattention-for-tile-based-accelerator-56e6ca.mp3 21. AI Post Transformers: MiniMax Sparse Attention at Million-Token Scale — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-06-13-minimax-sparse-attention-at-million-toke-300108.mp3 22. AI Post Transformers: ScoutAttention for Efficient KV Cache Offloading — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-24-scoutattention-for-efficient-kv-cache-of-b26699.mp3 23. AI Post Transformers: Deep Kernel Fusion for Transformer Decoding — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-15-deep-kernel-fusion-for-transformer-decod-b1a703.mp3 24. AI Post Transformers: Stochastic KV Routing for Cache Sharing — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-29-stochastic-kv-routing-for-cache-sharing-5fef63.mp3