AI Post Transformers

mcgrof

٠٫٠ (٠)
التكنولوجيا
يتم التحديث يوميًا

AI-generated podcast where hosts Hal Turing and Dr. Ada Shannon discuss the latest research papers and reports in machine learning, AI systems, and optimization. Featuring honest critical analysis, proper citations, and nerdy humor.

قبل يومين

PackKV Lossy Compression for KV Caches

This episode explores PackKV, a method for shrinking the transformer KV cache during long-context inference by combining low-bit quantization with GPU-friendly repacking and lossy compression. It explains why KV cache growth can dominate memory use in large models, using examples where cache size exceeds model weights, and frames the problem as a systems bottleneck driven more by memory traffic than raw computation. The discussion compares PackKV to prior approaches such as KV quantization, token pruning, and offloading to CPU memory, highlighting the paper’s argument that compression is only useful if decompression is tightly integrated into the inference pipeline. A listener would find it interesting because it turns a seemingly low-level optimization into a broader claim about how future long-context LLM performance may depend as much on memory layout and kernel design as on model architecture. Sources: 1. PackKV: Reducing KV Cache Memory Footprint through LLM-Aware Lossy Compression — Bo Jiang, Taolue Yang, Youyuan Liu, Xubin He, Sheng Di, Sian Jin, 2025 http://arxiv.org/abs/2512.24449 2. Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time — Zichang Liu, Aditya Desai, Fangshuo Liao, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, Anshumali Shrivastava, et al., 2023 https://scholar.google.com/scholar?q=Scissorhands:+Exploiting+the+Persistence+of+Importance+Hypothesis+for+LLM+KV+Cache+Compression+at+Test+Time 3. H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models — Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Zhao Song, Yuandong Tian, Clark Barrett, Zhangyang Wang, Beidi Chen, et al., 2023 https://scholar.google.com/scholar?q=H2O:+Heavy-Hitter+Oracle+for+Efficient+Generative+Inference+of+Large+Language+Models 4. KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache — Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, Xia Hu, 2024 https://scholar.google.com/scholar?q=KIVI:+A+Tuning-Free+Asymmetric+2bit+Quantization+for+KV+Cache 5. SnapKV: LLM Knows What You are Looking for Before Generation — Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, Deming Chen, 2024 https://scholar.google.com/scholar?q=SnapKV:+LLM+Knows+What+You+are+Looking+for+Before+Generation 6. CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving — Yuyang Liu, Haotian Li, Yao Cheng, Siddhant Ray, Yizhou Huang, Qizhen Zhang, Kaixiang Du, Jinyang Yao, Shan Lu, Ganesh Ananthanarayanan et al., 2024 https://scholar.google.com/scholar?q=CacheGen:+KV+Cache+Compression+and+Streaming+for+Fast+Large+Language+Model+Serving 7. Q-Hitter: A Better Token Oracle for Efficient LLM Inference via Sparse-Quantized KV Cache — Zhuodong Zhang, Shang Liu, Ruobing Chen, Bhavya Kailkhura, Ben Chen, An Wang, 2024 https://scholar.google.com/scholar?q=Q-Hitter:+A+Better+Token+Oracle+for+Efficient+LLM+Inference+via+Sparse-Quantized+KV+Cache 8. PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling — Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Baobao Chang, Junjie Hu, Wen Xiao, 2025 https://scholar.google.com/scholar?q=PyramidKV:+Dynamic+KV+Cache+Compression+based+on+Pyramidal+Information+Funneling 9. Expected Attention: KV Cache Compression by Estimating Attention from Future Queries Distribution — Alessio Devoto, Maximilian Jeblick, Simon Jegou, 2025 https://scholar.google.com/scholar?q=Expected+Attention:+KV+Cache+Compression+by+Estimating+Attention+from+Future+Queries+Distribution 10. TurboQuant: Online Vector Quantization with Near-Optimal Distortion — Amir Zandieh, Majid Daliri, Majid Hadian, Vahab Mirrokni, 2025 https://scholar.google.com/scholar?q=TurboQuant:+Online+Vector+Quantization+with+Near-Optimal+Distortion 11. KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse — Jingbo Yang et al., 2025 https://scholar.google.com/scholar?q=KVLink:+Accelerating+Large+Language+Models+via+Efficient+KV+Cache+Reuse 12. HyperRAG: Enhancing Quality-Efficiency Tradeoffs in Retrieval-Augmented Generation with Reranker KV-Cache Reuse — Yuwei An et al., 2025 https://scholar.google.com/scholar?q=HyperRAG:+Enhancing+Quality-Efficiency+Tradeoffs+in+Retrieval-Augmented+Generation+with+Reranker+KV-Cache+Reuse 13. ProphetKV: User-Query-Driven Selective Recomputation for Efficient KV Cache Reuse in Retrieval-Augmented Generation — Shihao Wang et al., 2026 https://scholar.google.com/scholar?q=ProphetKV:+User-Query-Driven+Selective+Recomputation+for+Efficient+KV+Cache+Reuse+in+Retrieval-Augmented+Generation 14. ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification — Yefei He et al., 2024 https://scholar.google.com/scholar?q=ZipCache:+Accurate+and+Efficient+KV+Cache+Quantization+with+Salient+Token+Identification 15. KVSink: Understanding and Enhancing the Preservation of Attention Sinks in KV Cache Quantization for LLMs — Zunhai Su and Kehong Yuan, 2025 https://scholar.google.com/scholar?q=KVSink:+Understanding+and+Enhancing+the+Preservation+of+Attention+Sinks+in+KV+Cache+Quantization+for+LLMs 16. ThinK: Thinner Key Cache by Query-Driven Pruning — Yuhui Xu et al., 2024 https://scholar.google.com/scholar?q=ThinK:+Thinner+Key+Cache+by+Query-Driven+Pruning 17. KV-Compress: Paged KV-Cache Compression with Variable Compression Rates per Attention Head — Isaac Rehg, 2024 https://scholar.google.com/scholar?q=KV-Compress:+Paged+KV-Cache+Compression+with+Variable+Compression+Rates+per+Attention+Head 18. Paged Attention Meets FlexAttention: Unlocking Long-Context Efficiency in Deployed Inference — Thomas Joshi et al., 2025 https://scholar.google.com/scholar?q=Paged+Attention+Meets+FlexAttention:+Unlocking+Long-Context+Efficiency+in+Deployed+Inference 19. AI Post Transformers: TokenDance for Multi-Agent KV Cache Sharing — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-22-tokendance-for-multi-agent-kv-cache-shar-aa9b99.mp3 20. AI Post Transformers: KVSwap for Disk-Aware Long-Context On-Device Inference — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-16-kvswap-for-disk-aware-long-context-on-de-f3c15e.mp3 21. AI Post Transformers: CacheFlow and 3D-Parallel KV Cache Restoration — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-01-cacheflow-and-3d-parallel-kv-cache-resto-8db883.mp3 22. AI Post Transformers: TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-25-turboquant-online-vector-quantiz-1967b7.mp3 23. AI Post Transformers: FengHuang for Rack-Scale LLM Inference Memory — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-12-fenghuang-for-rack-scale-llm-inference-m-62708e.mp3 24. AI Post Transformers: Computation-Bandwidth-Memory Trade-offs for AI Infrastructure — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-09-computation-bandwidth-memory-trade-offs-a83f2b.mp3 Interactive Visualization: PackKV Lossy Compression for KV Caches
قبل يومين

Reinforcement Learning in 2025: An Overview

This episode explores a 2025 survey of reinforcement learning as a statement about how the field now organizes itself, covering value-based, policy-based, model-based, multi-agent, offline, and LLM-related RL. It explains core concepts like Markov decision processes, policies, value functions, delayed credit assignment, and the contrast between direct policy optimization and methods that estimate action values before deriving behavior. The discussion highlights why actor-critic methods became so central, how model-based RL uses world models to plan ahead, and why offline RL is difficult when agents must improve from fixed logged data rather than fresh interaction. Listeners would find it interesting because it turns a broad survey into a clear map of where reinforcement learning stands in 2025, including the tensions between elegant theory, unstable training, and the practical compromises that shaped modern RL. Sources: 1. Reinforcement Learning: An Overview — Kevin Murphy, 2024 http://arxiv.org/abs/2412.05265 2. Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems — Sergey Levine, Aviral Kumar, George Tucker, Justin Fu, 2020 https://scholar.google.com/scholar?q=Offline+Reinforcement+Learning:+Tutorial,+Review,+and+Perspectives+on+Open+Problems 3. D4RL: Datasets for Deep Data-Driven Reinforcement Learning — Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, Sergey Levine, 2020 https://scholar.google.com/scholar?q=D4RL:+Datasets+for+Deep+Data-Driven+Reinforcement+Learning 4. Conservative Q-Learning for Offline Reinforcement Learning — Aviral Kumar, Aurick Zhou, George Tucker, Sergey Levine, 2020 https://scholar.google.com/scholar?q=Conservative+Q-Learning+for+Offline+Reinforcement+Learning 5. Decision Transformer: Reinforcement Learning via Sequence Modeling — Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas, Igor Mordatch, 2021 https://scholar.google.com/scholar?q=Decision+Transformer:+Reinforcement+Learning+via+Sequence+Modeling 6. Reinforcement Learning: An Introduction — Richard S. Sutton and Andrew G. Barto, 2018 https://scholar.google.com/scholar?q=Reinforcement+Learning:+An+Introduction 7. Algorithms for Reinforcement Learning — Csaba Szepesvari, 2010 https://scholar.google.com/scholar?q=Algorithms+for+Reinforcement+Learning 8. Human-level control through deep reinforcement learning — Volodymyr Mnih, Koray Kavukcuoglu, David Silver and others, 2015 https://scholar.google.com/scholar?q=Human-level+control+through+deep+reinforcement+learning 9. Trust Region Policy Optimization — John Schulman, Sergey Levine, Philipp Moritz, Michael Jordan and Pieter Abbeel, 2015 https://scholar.google.com/scholar?q=Trust+Region+Policy+Optimization 10. Proximal Policy Optimization Algorithms — John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford and Oleg Klimov, 2017 https://scholar.google.com/scholar?q=Proximal+Policy+Optimization+Algorithms 11. Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model — Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert and others, 2020 https://scholar.google.com/scholar?q=Mastering+Atari,+Go,+Chess+and+Shogi+by+Planning+with+a+Learned+Model 12. Fine-Tuning Language Models from Human Preferences — Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu and others, 2019 https://scholar.google.com/scholar?q=Fine-Tuning+Language+Models+from+Human+Preferences 13. Direct Preference Optimization: Your Language Model is Secretly a Reward Model — Rafael Rafailov, Archit Sharma, Eric Mitchell and others, 2023 https://scholar.google.com/scholar?q=Direct+Preference+Optimization:+Your+Language+Model+is+Secretly+a+Reward+Model 14. On the Limited Generalization Capability of the Implicit Reward Model Induced by Direct Preference Optimization — Yong Lin et al., 2024 https://scholar.google.com/scholar?q=On+the+Limited+Generalization+Capability+of+the+Implicit+Reward+Model+Induced+by+Direct+Preference+Optimization 15. Q-learning Decision Transformer: Leveraging Dynamic Programming for Conditional Sequence Modelling in Offline RL — Taku Yamagata, Ahmed Khalil, Raul Santos-Rodriguez, 2023 https://scholar.google.com/scholar?q=Q-learning+Decision+Transformer:+Leveraging+Dynamic+Programming+for+Conditional+Sequence+Modelling+in+Offline+RL 16. Reinformer: Max-Return Sequence Modeling for Offline RL — Zifeng Zhuang et al., 2024 https://scholar.google.com/scholar?q=Reinformer:+Max-Return+Sequence+Modeling+for+Offline+RL 17. Pre-training Contextualized World Models with In-the-wild Videos for Reinforcement Learning — Jialong Wu, Haoyu Ma, Chaoyi Deng, Mingsheng Long, 2023 https://scholar.google.com/scholar?q=Pre-training+Contextualized+World+Models+with+In-the-wild+Videos+for+Reinforcement+Learning 18. PreLAR: World Model Pre-training with Learnable Action Representation — Lixuan Zhang, Meina Kan, Shiguang Shan, Xilin Chen, 2024 https://scholar.google.com/scholar?q=PreLAR:+World+Model+Pre-training+with+Learnable+Action+Representation 19. Ctrl-World: A Controllable Generative World Model for Robot Manipulation — Yanjiang Guo, Lucy Xiaoyang Shi, Jianyu Chen, Chelsea Finn, 2025 https://scholar.google.com/scholar?q=Ctrl-World:+A+Controllable+Generative+World+Model+for+Robot+Manipulation 20. Skill Transfer and Discovery for Sim-to-Real Learning: A Representation-Based Viewpoint — Haitong Ma, Zhaolin Ren, Bo Dai, Na Li, 2024 https://scholar.google.com/scholar?q=Skill+Transfer+and+Discovery+for+Sim-to-Real+Learning:+A+Representation-Based+Viewpoint 21. AI Post Transformers: DreamerV3 World Models Across 150 Tasks — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-20-dreamerv3-world-models-across-150-tasks-af5edb.mp3 22. AI Post Transformers: Experience-Based Learning Beyond Human Data — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-19-experience-based-learning-beyond-human-d-b0caa4.mp3 23. AI Post Transformers: Learning to Reason with 13 Parameters — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-14-learning-to-reason-with-13-parameters-54c87f.mp3 24. AI Post Transformers: ASI-Evolve for Data, Architectures, and RL — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-05-asi-evolve-for-data-architectures-and-rl-197b2b.mp3 Interactive Visualization: Reinforcement Learning in 2025: An Overview
قبل يومين

Training LLMs for Divide-and-Conquer Reasoning

This episode explores a paper arguing that language models can reason more effectively at test time if they are trained to use divide-and-conquer strategies instead of defaulting to a single linear chain of thought. It explains the core distinction between ordinary step-by-step reasoning and structured decomposition into subproblems, then situates that idea alongside prior work such as Tree of Thoughts, Least-to-Most prompting, self-consistency, and recent reasoning-focused post-training. The discussion highlights the paper’s main claim that current post-training regimes bias models toward linear reasoning habits, which can make naive divide-and-conquer prompting underperform unless the decomposition behavior itself is explicitly trained. A listener would find it interesting because it gets at a central question in modern AI: whether better inference-time scaling comes from simply generating longer reasoning traces, or from teaching models to search, branch, and recombine intermediate results in a more algorithmic way. Sources: 1. Training LLMs for Divide-and-Conquer Reasoning Elevates Test-Time Scalability — Xiao Liang, Zhong-Zhi Li, Zhenghao Lin, Eric Hancheng Jiang, Hengyuan Zhang, Yelong Shen, Kai-Wei Chang, Ying Nian Wu, Yeyun Gong, Weizhu Chen, 2026 http://arxiv.org/abs/2602.02477 2. Tree of Thoughts: Deliberate Problem Solving with Large Language Models — Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, Karthik Narasimhan, 2023 https://scholar.google.com/scholar?q=Tree+of+Thoughts:+Deliberate+Problem+Solving+with+Large+Language+Models 3. Parsel: Algorithmic Reasoning with Language Models by Composing Decompositions — Eric Zelikman, Qian Huang, Gabriel Poesia, Noah Goodman, Nick Haber, 2023 https://scholar.google.com/scholar?q=Parsel:+Algorithmic+Reasoning+with+Language+Models+by+Composing+Decompositions 4. Training LLMs for Divide-and-Conquer Reasoning Elevates Test-Time Scalability — Xiao Liang, Zhong-Zhi Li, Zhenghao Lin, Eric Hancheng Jiang, Hengyuan Zhang, Yelong Shen, Kai-Wei Chang, Ying Nian Wu, Yeyun Gong, Weizhu Chen, 2026 https://scholar.google.com/scholar?q=Training+LLMs+for+Divide-and-Conquer+Reasoning+Elevates+Test-Time+Scalability 5. Self-Consistency Improves Chain of Thought Reasoning in Language Models — Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, Denny Zhou, 2022 https://scholar.google.com/scholar?q=Self-Consistency+Improves+Chain+of+Thought+Reasoning+in+Language+Models 6. Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters — Charlie Snell, Jaehoon Lee, Kelvin Xu, Aviral Kumar, 2024 https://scholar.google.com/scholar?q=Scaling+LLM+Test-Time+Compute+Optimally+can+be+More+Effective+than+Scaling+Model+Parameters 7. Learning to Reason with LLMs — OpenAI, 2024 https://scholar.google.com/scholar?q=Learning+to+Reason+with+LLMs 8. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning — Daya Guo, Dejian Yang and the DeepSeek-AI team, 2025 https://scholar.google.com/scholar?q=DeepSeek-R1:+Incentivizing+Reasoning+Capability+in+LLMs+via+Reinforcement+Learning 9. Least-to-Most Prompting Enables Complex Reasoning in Large Language Models — Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, Ed Chi, 2022 https://scholar.google.com/scholar?q=Least-to-Most+Prompting+Enables+Complex+Reasoning+in+Large+Language+Models 10. Decomposed Prompting: A Modular Approach for Solving Complex Tasks — Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao Fu, Kyle Richardson, Peter Clark, Ashish Sabharwal, 2022 https://scholar.google.com/scholar?q=Decomposed+Prompting:+A+Modular+Approach+for+Solving+Complex+Tasks 11. DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition — Z. Z. Ren, Zhihong Shao, Junxiao Song, Huajian Xin and the DeepSeek team, 2025 https://scholar.google.com/scholar?q=DeepSeek-Prover-V2:+Advancing+Formal+Mathematical+Reasoning+via+Reinforcement+Learning+for+Subgoal+Decomposition 12. Decompose, Analyze and Rethink: Solving Intricate Problems with Human-like Reasoning Cycle — Shangzi Xue, Zhenya Huang, Jiayu Liu, Xin Lin, Yuting Ning, Binbin Jin, Xin Li, Qi Liu, 2024 https://scholar.google.com/scholar?q=Decompose,+Analyze+and+Rethink:+Solving+Intricate+Problems+with+Human-like+Reasoning+Cycle 13. Seed-Prover: Deep and Broad Reasoning for Automated Theorem Proving — Luoxin Chen, Jinming Gu, Liankai Huang, Wenhao Huang, Zhicheng Jiang, Allan Jie, Xiaoran Jin, Xing Jin, Chenggang Li, Kaijing Ma, Cheng Ren, Jiawei Shen, Wenlei Shi, Tong Sun, He Sun, Jiahui Wang, Siran Wang, Zhihong Wang, Chenrui Wei, Shufa Wei, Yonghui Wu, Yuchen Wu, et al., 2025 https://scholar.google.com/scholar?q=Seed-Prover:+Deep+and+Broad+Reasoning+for+Automated+Theorem+Proving 14. DAPO: An Open-Source LLM Reinforcement Learning System at Scale — Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, Wei-Ying Ma, Ya-Qin Zhang, Lin Yan, Mu Qiao, Yonghui Wu, Mingxuan Wang, 2025 https://scholar.google.com/scholar?q=DAPO:+An+Open-Source+LLM+Reinforcement+Learning+System+at+Scale 15. Continuous Chain of Thought Enables Parallel Exploration and Reasoning — Halil Alperen Gozeten et al., 2025 https://scholar.google.com/scholar?q=Continuous+Chain+of+Thought+Enables+Parallel+Exploration+and+Reasoning 16. How to Think Step-by-Step: A Mechanistic Understanding of Chain-of-Thought Reasoning — Subhabrata Dutta et al., 2024 https://scholar.google.com/scholar?q=How+to+Think+Step-by-Step:+A+Mechanistic+Understanding+of+Chain-of-Thought+Reasoning 17. Decompose-ToM: Enhancing Theory of Mind Reasoning in Large Language Models through Simulation and Task Decomposition — Sneheel Sarangi et al., 2025 https://scholar.google.com/scholar?q=Decompose-ToM:+Enhancing+Theory+of+Mind+Reasoning+in+Large+Language+Models+through+Simulation+and+Task+Decomposition 18. Select-Then-Decompose: From Empirical Analysis to Adaptive Selection Strategy for Task Decomposition in Large Language Models — Shuodi Liu et al., 2025 https://scholar.google.com/scholar?q=Select-Then-Decompose:+From+Empirical+Analysis+to+Adaptive+Selection+Strategy+for+Task+Decomposition+in+Large+Language+Models 19. LearNAT: Learning NL2SQL with AST-guided Task Decomposition for Large Language Models — Weibin Liao et al., 2025 https://scholar.google.com/scholar?q=LearNAT:+Learning+NL2SQL+with+AST-guided+Task+Decomposition+for+Large+Language+Models 20. Large Language Models Reasoning Abilities Under Non-Ideal Conditions After RL-Fine-Tuning — Chang Tian et al., 2025 https://scholar.google.com/scholar?q=Large+Language+Models+Reasoning+Abilities+Under+Non-Ideal+Conditions+After+RL-Fine-Tuning 21. AI Post Transformers: Learning to Reason with 13 Parameters — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-14-learning-to-reason-with-13-parameters-54c87f.mp3 22. AI Post Transformers: IMO-Bench for Robust Mathematical Reasoning — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-imo-bench-for-robust-mathematical-reason-143489.mp3 23. AI Post Transformers: Test-time Scaling for Multi-Agent Collaborative Reasoning — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-22-test-time-scaling-for-multi-agent-collab-082570.mp3 24. AI Post Transformers: Chain-of-Thought Reasoning: A Brittle Mirage? — Hal Turing & Dr. Ada Shannon, 2025 https://podcast.do-not-panic.com/episodes/chain-of-thought-reasoning-a-brittle-mirage/ 25. AI Post Transformers: NeurIPS 2025: Reinforcement Learning for Reasoning in Large Language Models with One Training Example — Hal Turing & Dr. Ada Shannon, 2025 https://podcast.do-not-panic.com/episodes/neurips-2025-reinforcement-learning-for-reasoning-in-large-language-models-with/ 26. AI Post Transformers: TraceRL: Reinforcement Learning for Diffusion Language Models — Hal Turing & Dr. Ada Shannon, 2025 https://podcast.do-not-panic.com/episodes/tracerl-reinforcement-learning-for-diffusion-language-models/ Interactive Visualization: Training LLMs for Divide-and-Conquer Reasoning
قبل يومين

Training Million-Token LLMs Beyond the Memory Barrier

This episode explores how the OOMB training system tries to break the memory bottleneck that makes million-token language model training impractical, focusing on why training long contexts is much harder than simply extending inference-time context windows. It explains the paper’s core ideas in plain language, including chunk-recurrent training that recomputes activations during backpropagation, O(1)-style activation memory, and the harder remaining problem of storing and moving the KV cache across extremely long sequences. The discussion also weighs the paper’s central claim with healthy skepticism, asking whether fitting multi-million-token training steps on a single GPU proves genuinely useful long-range learning or mainly demonstrates a strong systems optimization. Listeners would find it interesting because it connects deep learning mechanics, hardware limits, and competing long-context strategies like Ring Attention into a clear debate about what real progress in long-context LLMs should look like. Sources: 1. Out of the Memory Barrier: A Highly Memory Efficient Training System for LLMs with Million-Token Contexts — Wenhao Li, Daohai Yu, Gen Luo, Yuxin Zhang, Fei Chao, Rongrong Ji, Yifan Wu, Jiaxin Liu, Ziyang Gong, Zimu Liao, 2026 http://arxiv.org/abs/2602.02108 2. Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context — Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov, 2019 https://scholar.google.com/scholar?q=Transformer-XL:+Attentive+Language+Models+Beyond+a+Fixed-Length+Context 3. Recurrent Memory Transformer — Aydar Bulatov, Yuri Kuratov, Mikhail S. Burtsev, 2022 https://scholar.google.com/scholar?q=Recurrent+Memory+Transformer 4. Ring Attention with Blockwise Transformers for Near-Infinite Context — Hao Liu, Matei Zaharia, Pieter Abbeel, 2023 https://scholar.google.com/scholar?q=Ring+Attention+with+Blockwise+Transformers+for+Near-Infinite+Context 5. Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention — Tsendsuren Munkhdalai, Manaal Faruqui, Siddharth Gopal, 2024 https://scholar.google.com/scholar?q=Leave+No+Context+Behind:+Efficient+Infinite+Context+Transformers+with+Infini-attention 6. LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models — Yingfeng Chen et al., 2024 https://scholar.google.com/scholar?q=LongLoRA:+Efficient+Fine-tuning+of+Long-Context+Large+Language+Models 7. Efficient Memory Management for Large Language Model Serving with PagedAttention — Woosuk Kwon et al., 2023 https://scholar.google.com/scholar?q=Efficient+Memory+Management+for+Large+Language+Model+Serving+with+PagedAttention 8. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism — Mohammad Shoeybi et al., 2020 https://scholar.google.com/scholar?q=Megatron-LM:+Training+Multi-Billion+Parameter+Language+Models+Using+Model+Parallelism 9. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints — Joshua Ainslie et al., 2023 https://scholar.google.com/scholar?q=GQA:+Training+Generalized+Multi-Query+Transformer+Models+from+Multi-Head+Checkpoints 10. ZeRO-Offload: Democratizing Billion-Scale Model Training — Samyam Rajbhandari et al., 2020 https://scholar.google.com/scholar?q=ZeRO-Offload:+Democratizing+Billion-Scale+Model+Training 11. Long Context Compression with Activation Beacon — approx. Liu et al., 2024 https://scholar.google.com/scholar?q=Long+Context+Compression+with+Activation+Beacon 12. Boosting Long-Context Information Seeking via Query-Guided Activation Refilling — approx. unknown from snippet, 2024 or 2025 https://scholar.google.com/scholar?q=Boosting+Long-Context+Information+Seeking+via+Query-Guided+Activation+Refilling 13. Kvlink: Accelerating Large Language Models via Efficient KV Cache Reuse — approx. unknown from snippet, 2024 or 2025 https://scholar.google.com/scholar?q=Kvlink:+Accelerating+Large+Language+Models+via+Efficient+KV+Cache+Reuse 14. SuperOffload: Unleashing the Power of Large-Scale LLM Training on Superchips — approx. unknown from snippet, 2024 or 2025 https://scholar.google.com/scholar?q=SuperOffload:+Unleashing+the+Power+of+Large-Scale+LLM+Training+on+Superchips 15. SPPO: Efficient Long-Sequence LLM Training via Adaptive Sequence Pipeline Parallel Offloading — approx. unknown from snippet, 2024 or 2025 https://scholar.google.com/scholar?q=SPPO:+Efficient+Long-Sequence+LLM+Training+via+Adaptive+Sequence+Pipeline+Parallel+Offloading 16. Keep the Cost Down: A Review on Methods to Optimize LLM's KV-Cache Consumption — approx. unknown from snippet, 2024 or 2025 https://scholar.google.com/scholar?q=Keep+the+Cost+Down:+A+Review+on+Methods+to+Optimize+LLM's+KV-Cache+Consumption 17. AI Post Transformers: DeepSeek-V4 and Practical Million-Token Context — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-25-deepseek-v4-and-practical-million-token-6f4de1.mp3 18. AI Post Transformers: KVSwap for Disk-Aware Long-Context On-Device Inference — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-16-kvswap-for-disk-aware-long-context-on-de-f3c15e.mp3 19. AI Post Transformers: RetrievalAttention for Long-Context LLM Inference — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-17-retrievalattention-for-long-context-llm-ddf566.mp3 20. AI Post Transformers: CacheFlow and 3D-Parallel KV Cache Restoration — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-01-cacheflow-and-3d-parallel-kv-cache-resto-8db883.mp3 21. AI Post Transformers: Gated Linear Attention for Efficient Long Sequences — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-18-gated-linear-attention-for-efficient-lon-c858ab.mp3 22. AI Post Transformers: Parallelizing DeltaNet Linear Transformers over Sequence Length — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-18-parallelizing-deltanet-linear-transforme-2d0377.mp3 Interactive Visualization: Training Million-Token LLMs Beyond the Memory Barrier
قبل ٣ أيام

DeepWalk and the Rise of Graph Embeddings

This episode explores how DeepWalk helped launch modern graph representation learning by turning random walks over a social network into “sentences” and then applying the Skip-Gram ideas behind word2vec to learn dense node embeddings. It explains why that mattered in 2014: instead of relying on heavy spectral or matrix-factorization methods, DeepWalk offered an online, scalable way to learn reusable graph features that worked especially well when labeled data was scarce. The discussion digs into the paper’s main empirical claim that, on social-network benchmarks like BlogCatalog, Flickr, and YouTube, the method substantially improved node classification under sparse-label settings. It is interesting because the conversation goes beyond the headline result and asks what really drove the gains: the language-modeling objective, the community-biased random-walk sampler, or simply a better optimization setup for homophilous graphs. Sources: 1. DeepWalk and the Rise of Graph Embeddings https://arxiv.org/pdf/1403.6652 2. Distributed Representations of Words and Phrases and their Compositionality — Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean, 2013 https://scholar.google.com/scholar?q=Distributed+Representations+of+Words+and+Phrases+and+their+Compositionality 3. Efficient Estimation of Word Representations in Vector Space — Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean, 2013 https://scholar.google.com/scholar?q=Efficient+Estimation+of+Word+Representations+in+Vector+Space 4. Learning Latent Social Dimensions for Link Prediction — Lei Tang, Huan Liu, 2011 https://scholar.google.com/scholar?q=Learning+Latent+Social+Dimensions+for+Link+Prediction 5. LINE: Large-scale Information Network Embedding — Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, Qiaozhu Mei, 2015 https://scholar.google.com/scholar?q=LINE:+Large-scale+Information+Network+Embedding 6. node2vec: Scalable Feature Learning for Networks — Aditya Grover, Jure Leskovec, 2016 https://scholar.google.com/scholar?q=node2vec:+Scalable+Feature+Learning+for+Networks 7. Planetoid: Inductive Representation Learning on Large Graphs — Zhilin Yang, William W. Cohen, Ruslan Salakhutdinov, 2016 https://scholar.google.com/scholar?q=Planetoid:+Inductive+Representation+Learning+on+Large+Graphs 8. Node Embedding for Homophilous Graphs with ARGEW: Augmentation of Random walks by Graph Edge Weights — authors not shown in snippet, recent; exact year not shown https://scholar.google.com/scholar?q=Node+Embedding+for+Homophilous+Graphs+with+ARGEW:+Augmentation+of+Random+walks+by+Graph+Edge+Weights 9. Graph neural networks for graphs with heterophily: A survey — authors not shown in snippet, recent; exact year not shown https://scholar.google.com/scholar?q=Graph+neural+networks+for+graphs+with+heterophily:+A+survey 10. Learning attribute and homophily measures through random walks — authors not shown in snippet, recent; exact year not shown https://scholar.google.com/scholar?q=Learning+attribute+and+homophily+measures+through+random+walks 11. Graph Node Embedding by Neighborhood Prediction Based on Multiview Contrastive Learning — authors not shown in snippet, recent; exact year not shown https://scholar.google.com/scholar?q=Graph+Node+Embedding+by+Neighborhood+Prediction+Based+on+Multiview+Contrastive+Learning 12. Dynamic graph representation learning with neural networks: A survey — authors not shown in snippet, recent; exact year not shown https://scholar.google.com/scholar?q=Dynamic+graph+representation+learning+with+neural+networks:+A+survey 13. AI Post Transformers: GraphSAGE: Inductive Representation Learning on Large Graphs — Hal Turing & Dr. Ada Shannon, Mon, https://podcast.do-not-panic.com/episodes/graphsage-inductive-representation-learning-on-large-graphs/ Interactive Visualization: DeepWalk and the Rise of Graph Embeddings
قبل ٣ أيام

Geometric Memory in Deep Sequence Models

This episode explores whether deep sequence models store knowledge as simple associative lookups or as geometric memories that encode broader relational structure. It discusses a recent paper arguing that, after memorizing graph facts in their weights, sequence models can answer multi-hop path queries as if they were making a much shorter move through embedding space, with learned representations resembling graph-embedding methods like node2vec and DeepWalk. The conversation highlights why that matters mechanistically: it suggests some forms of reasoning may be amortized into the model’s parameters during training rather than reconstructed step by step at inference time. Listeners would find it interesting for its sharp debate over what counts as real reasoning versus a clever shortcut, and for its caution about how far results from synthetic graph settings should generalize to large language models in the wild. Sources: 1. Deep sequence models tend to memorize geometrically; it is unclear why — Shahriar Noroozizadeh, Vaishnavh Nagarajan, Elan Rosenfeld, Sanjiv Kumar, 2025 http://arxiv.org/abs/2510.26745 2. DeepWalk: Online Learning of Social Representations — Bryan Perozzi, Rami Al-Rfou, Steven Skiena, 2014 https://scholar.google.com/scholar?q=DeepWalk:+Online+Learning+of+Social+Representations 3. node2vec: Scalable Feature Learning for Networks — Aditya Grover, Jure Leskovec, 2016 https://scholar.google.com/scholar?q=node2vec:+Scalable+Feature+Learning+for+Networks 4. Birth of a Transformer: A Memory Viewpoint — Alberto Bietti, Vivien Cabannes, Diane Bouchacourt, Herve Jegou, Leon Bottou, 2023 https://scholar.google.com/scholar?q=Birth+of+a+Transformer:+A+Memory+Viewpoint 5. Deep sequence models tend to memorize geometrically; it is unclear why — Shahriar Noroozizadeh, Vaishnavh Nagarajan, Elan Rosenfeld, Sanjiv Kumar, 2025 https://scholar.google.com/scholar?q=Deep+sequence+models+tend+to+memorize+geometrically;+it+is+unclear+why 6. The Pitfalls of Next-Token Prediction — Gregor Bachmann, Vaishnavh Nagarajan, 2024 https://scholar.google.com/scholar?q=The+Pitfalls+of+Next-Token+Prediction 7. How Transformers Learn to Plan via Multi-Token Prediction — Jianhao Huang, Zhanpeng Zhou, Renqiu Xia, Baharan Mirzasoleiman, Weijie Su, Wei Huang, 2026 https://scholar.google.com/scholar?q=How+Transformers+Learn+to+Plan+via+Multi-Token+Prediction 8. DeepSeek-V3 Technical Report — DeepSeek-AI and collaborators, 2024 https://scholar.google.com/scholar?q=DeepSeek-V3+Technical+Report 9. Language Models, Graph Searching, and Supervision Adulteration: When More Supervision is Less and How to Make More More — Arvid Frydenlund, 2025 https://scholar.google.com/scholar?q=Language+Models,+Graph+Searching,+and+Supervision+Adulteration:+When+More+Supervision+is+Less+and+How+to+Make+More+More 10. Hopping Too Late: Exploring the Limitations of Large Language Models on Multi-Hop Queries — Eden Biran, Daniela Gottesman, Sohee Yang, Mor Geva, and Amir Globerson, 2024 https://scholar.google.com/scholar?q=Hopping+Too+Late:+Exploring+the+Limitations+of+Large+Language+Models+on+Multi-Hop+Queries 11. The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A" — Lukas Berglund, Meg Tong, Maximilian Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Korbak, and Owain Evans, 2024 https://scholar.google.com/scholar?q=The+Reversal+Curse:+LLMs+trained+on+"A+is+B"+fail+to+learn+"B+is+A" 12. In-Context Denoising with One-Layer Transformers: Connections between Attention and Associative Memory Retrieval — Matthew Smart, Alberto Bietti, Anirvan M. Sengupta, 2025 https://scholar.google.com/scholar?q=In-Context+Denoising+with+One-Layer+Transformers:+Connections+between+Attention+and+Associative+Memory+Retrieval 13. In-Context Learning as Conditioned Associative Memory Retrieval — Weimin Wu, Teng-Yun Hsiao, Jerry Yao-Chieh Hu, Wenxin Zhang, Han Liu, 2025 https://scholar.google.com/scholar?q=In-Context+Learning+as+Conditioned+Associative+Memory+Retrieval 14. Position-Aware Relational Transformer for Knowledge Graph Embedding — Guangyao Li, Zequn Sun, Wei Hu, Gong Cheng, et al., 2023 https://scholar.google.com/scholar?q=Position-Aware+Relational+Transformer+for+Knowledge+Graph+Embedding 15. Relational Transformer: Toward Zero-Shot Foundation Models for Relational Data — Rishabh Ranjan, Valter Hudovernik, Mark Znidar, Charilaos Kanatsoulis, Roshan Upendra, Mahmoud Mohammadi, Joe Meyer, Tom Palczewski, Carlos Guestrin, Jure Leskovec, 2025 https://scholar.google.com/scholar?q=Relational+Transformer:+Toward+Zero-Shot+Foundation+Models+for+Relational+Data 16. AI Post Transformers: In-Place Test-Time Training for Transformers — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-09-in-place-test-time-training-for-transfor-d0b976.mp3 17. AI Post Transformers: Mamba-3 for Efficient Sequence Modeling — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-16-mamba-3-for-efficient-sequence-modeling-97a22a.mp3 Interactive Visualization: Geometric Memory in Deep Sequence Models
قبل ٣ أيام

How Induction Heads Emerge in Transformers

This episode explores how transformers split prediction between knowledge stored in their weights and information inferred from the current prompt, using the paper’s synthetic “bigram world” to make those mechanisms visible. It explains the distinction between global statistical knowledge and true in-context knowledge, then walks through induction heads as a concrete circuit for recalling earlier patterns and continuing them later. The discussion highlights the paper’s main finding that models learn easy dataset-wide averages first, while context-sensitive induction behavior emerges later and requires the right architecture, with two-layer transformers succeeding where one-layer models fail. Listeners would find it interesting because it turns a vague claim about in-context learning into a causal, mechanistic story about how temporary memory may actually form during training. Sources: 1. Birth of a Transformer: A Memory Viewpoint — Alberto Bietti, Vivien Cabannes, Diane Bouchacourt, Herve Jegou, Leon Bottou, 2023 http://arxiv.org/abs/2306.00802 2. A Mathematical Framework for Transformer Circuits — Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, Chris Olah, 2021 https://scholar.google.com/scholar?q=A+Mathematical+Framework+for+Transformer+Circuits 3. In-context Learning and Induction Heads — Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, Chris Olah, 2022 https://scholar.google.com/scholar?q=In-context+Learning+and+Induction+Heads 4. Birth of a Transformer: A Memory Viewpoint — Alberto Bietti, Vivien Cabannes, Diane Bouchacourt, Herve Jegou, Leon Bottou, 2023 https://scholar.google.com/scholar?q=Birth+of+a+Transformer:+A+Memory+Viewpoint 5. What Learning Algorithm Is In-Context Learning? Investigations with Linear Models — Ekin Akyurek, Dale Schuurmans, Jacob Andreas, Tengyu Ma, Denny Zhou, 2023 https://scholar.google.com/scholar?q=What+Learning+Algorithm+Is+In-Context+Learning?+Investigations+with+Linear+Models 6. Data Distributional Properties Drive Emergent In-Context Learning in Transformers — Stephanie C. Y. Chan, Adam Santoro, Andrew Lampinen, Jane Wang, Aaditya Singh, Pierre Richemond, Jay McClelland, Felix Hill, 2022 https://scholar.google.com/scholar?q=Data+Distributional+Properties+Drive+Emergent+In-Context+Learning+in+Transformers 7. Transformer Feed-Forward Layers Are Key-Value Memories — Mor Geva, Roei Schuster, Jonathan Berant, Omer Levy, 2021 https://scholar.google.com/scholar?q=Transformer+Feed-Forward+Layers+Are+Key-Value+Memories 8. Dissecting Recall of Factual Associations in Auto-Regressive Language Models — Mor Geva, Jasmijn Bastings, Katja Filippova, Amir Globerson, 2023 https://scholar.google.com/scholar?q=Dissecting+Recall+of+Factual+Associations+in+Auto-Regressive+Language+Models 9. What needs to go right for an induction head? A mechanistic study of in-context learning circuits and their formation — Aaditya K. Singh, Ted Moskovitz, Felix Hill, Stephanie C. Y. Chan, Andrew M. Saxe, 2024 https://scholar.google.com/scholar?q=What+needs+to+go+right+for+an+induction+head?+A+mechanistic+study+of+in-context+learning+circuits+and+their+formation 10. Learning to grok: Emergence of in-context learning and skill composition in modular arithmetic tasks — Tianyu He, Darshil Doshi, Aritra Das, Andrey Gromov, 2024 https://scholar.google.com/scholar?q=Learning+to+grok:+Emergence+of+in-context+learning+and+skill+composition+in+modular+arithmetic+tasks 11. Selective Induction Heads: How Transformers Select Causal Structures In Context — Francesco D'Angelo, Francesco Croce, Nicolas Flammarion, 2025 https://scholar.google.com/scholar?q=Selective+Induction+Heads:+How+Transformers+Select+Causal+Structures+In+Context 12. Induction Head Toxicity Mechanistically Explains Repetition Curse in Large Language Models — Shuxun Wang, Qingyu Yin, Chak Tou Leong, Qiang Zhang, Linyi Yang, 2025 https://scholar.google.com/scholar?q=Induction+Head+Toxicity+Mechanistically+Explains+Repetition+Curse+in+Large+Language+Models 13. AI Post Transformers: In-Place Test-Time Training for Transformers — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-09-in-place-test-time-training-for-transfor-d0b976.mp3 14. AI Post Transformers: Linear Classifier Probes for Intermediate Layers — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-16-linear-classifier-probes-for-intermediat-927ae3.mp3 15. AI Post Transformers: Parallelizing DeltaNet Linear Transformers over Sequence Length — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-18-parallelizing-deltanet-linear-transforme-2d0377.mp3 16. AI Post Transformers: Gated Delta Networks for Long-Context Retrieval — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-17-gated-delta-networks-for-long-context-re-706d85.mp3 Interactive Visualization: How Induction Heads Emerge in Transformers
قبل ٣ أيام

Selective Classification with Deep Neural Networks

This episode explores selective classification in deep neural networks: adding a post-hoc reject option so a trained model can abstain when its confidence falls below a calibrated threshold. It explains the key concepts of coverage, selective risk, and the risk-coverage tradeoff, arguing that a model should be judged not just by how often it is right, but by how often it chooses to answer. The discussion centers on the paper’s SGR method, which uses a held-out calibration set to choose a threshold that keeps selective risk below a target with high probability under an i.i.d. assumption, and compares softmax response with MC-dropout as confidence scores. Listeners would find it interesting because it gets at a practical question in AI deployment: not whether a model is always confident, but whether it can reliably know when to defer. Sources: 1. Selective Classification for Deep Neural Networks — Yonatan Geifman, Ran El-Yaniv, 2017 http://arxiv.org/abs/1705.08500 2. SelectiveNet: A Deep Neural Network with an Integrated Reject Option — Yonatan Geifman, Ran El-Yaniv, 2019 http://arxiv.org/abs/1901.09192 3. On Optimum Recognition Error and Reject Tradeoff — C. K. Chow, 1970 https://scholar.google.com/scholar?q=On+Optimum+Recognition+Error+and+Reject+Tradeoff 4. Selective Classification for Deep Neural Networks — Yonatan Geifman and Ran El-Yaniv, 2017 https://scholar.google.com/scholar?q=Selective+Classification+for+Deep+Neural+Networks 5. SelectiveNet: A Deep Neural Network with an Integrated Reject Option — Yonatan Geifman and Ran El-Yaniv, 2019 https://scholar.google.com/scholar?q=SelectiveNet:+A+Deep+Neural+Network+with+an+Integrated+Reject+Option 6. Selective Classification via One-Sided Prediction — Aditya Gangrade, Anil Kag, and Venkatesh Saligrama, 2021 https://scholar.google.com/scholar?q=Selective+Classification+via+One-Sided+Prediction 7. Classification with Reject Option — Radu Herbei and Marten H. Wegkamp, 2006 https://scholar.google.com/scholar?q=Classification+with+Reject+Option 8. Learning with Rejection — Corinna Cortes, Giulia DeSalvo, and Mehryar Mohri, 2016 https://scholar.google.com/scholar?q=Learning+with+Rejection 9. Boosting with Abstention — Corinna Cortes, Giulia DeSalvo, and Mehryar Mohri, 2016 https://scholar.google.com/scholar?q=Boosting+with+Abstention 10. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning — Yarin Gal and Zoubin Ghahramani, 2016 https://scholar.google.com/scholar?q=Dropout+as+a+Bayesian+Approximation:+Representing+Model+Uncertainty+in+Deep+Learning 11. On Calibration of Modern Neural Networks — Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger, 2017 https://scholar.google.com/scholar?q=On+Calibration+of+Modern+Neural+Networks 12. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles — Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell, 2017 https://scholar.google.com/scholar?q=Simple+and+Scalable+Predictive+Uncertainty+Estimation+using+Deep+Ensembles 13. Can You Trust Your Model's Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift — Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, D. Sculley, Sebastian Nowozin, Joshua Dillon, Balaji Lakshminarayanan, and Jasper Snoek, 2019 https://scholar.google.com/scholar?q=Can+You+Trust+Your+Model's+Uncertainty?+Evaluating+Predictive+Uncertainty+Under+Dataset+Shift 14. A Novel Characterization of the Population Area Under the Risk Coverage Curve (AURC) and Rates of Finite Sample Estimators — Han Zhou, Jordy Van Landeghem, Teodora Popordanoska, and Matthew B. Blaschko, 2025 https://scholar.google.com/scholar?q=A+Novel+Characterization+of+the+Population+Area+Under+the+Risk+Coverage+Curve+(AURC)+and+Rates+of+Finite+Sample+Estimators 15. On the Foundations of Noise-Free Selective Classification — Ran El-Yaniv and Yair Wiener, 2010 https://scholar.google.com/scholar?q=On+the+Foundations+of+Noise-Free+Selective+Classification 16. Deep Residual Learning for Image Recognition — Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, 2016 https://scholar.google.com/scholar?q=Deep+Residual+Learning+for+Image+Recognition 17. Support Vector Machines with Embedded Reject Option — Giorgio Fumera and Fabio Roli, 2002 https://scholar.google.com/scholar?q=Support+Vector+Machines+with+Embedded+Reject+Option 18. A Deep Neural Network with an Integrated Reject Option — Yonatan Geifman and Ran El-Yaniv, 2019 https://scholar.google.com/scholar?q=A+Deep+Neural+Network+with+an+Integrated+Reject+Option 19. Augmenting the Softmax with Additional Confidence Scores for Improved Selective Classification with Out-of-Distribution Data — Guoxuan Xia, Christos-Savvas Bouganis, 2024 https://scholar.google.com/scholar?q=Augmenting+the+Softmax+with+Additional+Confidence+Scores+for+Improved+Selective+Classification+with+Out-of-Distribution+Data 20. GPify: Leveraging the Combined Strength of Normalizing Flow and Softmax For an Out-of-Distribution aware Confidence Score — Simon Kristoffersson Lind, Rudolph Triebel, Volker Kruger, 2026 https://scholar.google.com/scholar?q=GPify:+Leveraging+the+Combined+Strength+of+Normalizing+Flow+and+Softmax+For+an+Out-of-Distribution+aware+Confidence+Score 21. LogitAC: Logit Amplitude Constraints for Confidence Calibration and Out-of-Distribution Detection — Zongjing Cao, Yan Li, Byeong Seok Shin, 2024 https://scholar.google.com/scholar?q=LogitAC:+Logit+Amplitude+Constraints+for+Confidence+Calibration+and+Out-of-Distribution+Detection 22. Not all distributional shifts are equal: Fine-grained robust conformal inference — Jiahao Ai, Zhimei Ren, 2024 https://scholar.google.com/scholar?q=Not+all+distributional+shifts+are+equal:+Fine-grained+robust+conformal+inference 23. Wasserstein-regularized Conformal Prediction under General Distribution Shift — Rui Xu, Chao Chen, Yue Sun, Parvathinathan Venkitasubramaniam, Sihong Xie, 2025 https://scholar.google.com/scholar?q=Wasserstein-regularized+Conformal+Prediction+under+General+Distribution+Shift Interactive Visualization: Selective Classification with Deep Neural Networks

مشاهدة الكل (٥٩٨)

صناع العمل

mcgrof
سنوات النشاط

٢٠٢٥ - ٢٠٢٦
الحلقات

٥٩٨
التقييم

ملائم
موقع البرنامج على الويب

AI Post Transformers

التكنولوجيا

التكنولوجيا

يتم التحديث أسبوعيًا
التكنولوجيا

التكنولوجيا

مرتان في الأسبوع
التكنولوجيا

التكنولوجيا

يتم التحديث أسبوعيًا
المجتمع والثقافة

المجتمع والثقافة

يتم التحديث أسبوعيًا
طب

طب

يتم التحديث أسبوعيًا

AI Post Transformers

PackKV Lossy Compression for KV Caches

Reinforcement Learning in 2025: An Overview

Training LLMs for Divide-and-Conquer Reasoning

Training Million-Token LLMs Beyond the Memory Barrier

DeepWalk and the Rise of Graph Embeddings

Geometric Memory in Deep Sequence Models

How Induction Heads Emerge in Transformers

Selective Classification with Deep Neural Networks

حول

المعلومات

قد يعجبك أيضًا

AI Post Transformers

الحلقات

PackKV Lossy Compression for KV Caches

Reinforcement Learning in 2025: An Overview

Training LLMs for Divide-and-Conquer Reasoning

Training Million-Token LLMs Beyond the Memory Barrier

DeepWalk and the Rise of Graph Embeddings

Geometric Memory in Deep Sequence Models

How Induction Heads Emerge in Transformers

Selective Classification with Deep Neural Networks

حول

المعلومات

قد يعجبك أيضًا