AI Post Transformers

mcgrof

AI-generated podcast where hosts Hal Turing and Dr. Ada Shannon discuss the latest research papers and reports in machine learning, AI systems, and optimization. Featuring honest critical analysis, proper citations, and nerdy humor.

  1. -17 H

    TokenDance for Multi-Agent KV Cache Sharing

    This episode explores TokenDance, a systems approach for serving many LLM-based agents more efficiently by collectively sharing transformer KV caches across synchronized conversation rounds. It explains why multi-agent workloads are fundamentally different from ordinary chat serving: agents persist across rounds, accumulate large KV caches, and often follow an “all-gather” pattern where each agent receives a mostly shared prompt plus its own private history, making standard prefix-based cache reuse ineffective. The discussion argues that the key innovation is shifting cache reuse from individual requests to the entire round of agents as a collective object, enabling memory savings and better scalability on the same GPU. Listeners interested in agent systems, inference infrastructure, and practical bottlenecks beyond model architecture will find it compelling for its concrete diagnosis of memory management as the real constraint. Sources: 1. TokenDance: Scaling Multi-Agent LLM Serving via Collective KV Cache Sharing — Zhuohang Bian, Feiyang Wu, Chengrui Zhang, Hangcheng Dong, Yun Liang, Youwei Zhuo, 2026 http://arxiv.org/abs/2604.03143 2. TokenDance: Scaling Multi-Agent LLM Serving via Collective KV Cache Sharing — Zhuohang Bian, Feiyang Wu, Chengrui Zhang, Hangcheng Dong, Yun Liang, Youwei Zhuo, 2026 https://scholar.google.com/scholar?q=TokenDance:+Scaling+Multi-Agent+LLM+Serving+via+Collective+KV+Cache+Sharing 3. Efficient Memory Management for Large Language Model Serving with PagedAttention — Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Hao Zhang, et al., 2023 https://scholar.google.com/scholar?q=Efficient+Memory+Management+for+Large+Language+Model+Serving+with+PagedAttention 4. SGLang: Efficient Execution of Structured Language Model Programs — Lianmin Zheng, Weizhe Chen, Ying Sheng, Tianqi Chen, Ion Stoica, and collaborators, 2024 https://scholar.google.com/scholar?q=SGLang:+Efficient+Execution+of+Structured+Language+Model+Programs 5. Parrot: Efficient Serving of LLM-based Applications with Semantic Variable — Xiangyao Yu and collaborators, 2024 https://scholar.google.com/scholar?q=Parrot:+Efficient+Serving+of+LLM-based+Applications+with+Semantic+Variable 6. vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention — Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, et al., 2023 https://scholar.google.com/scholar?q=vLLM:+Easy,+Fast,+and+Cheap+LLM+Serving+with+PagedAttention 7. SGLang — SGLang team / related authors as cited in the paper, 2024 https://scholar.google.com/scholar?q=SGLang 8. Parrot — Authors as cited in the paper, 2024 https://scholar.google.com/scholar?q=Parrot 9. Autellix — Authors as cited in the paper, 2024 https://scholar.google.com/scholar?q=Autellix 10. Tokencake — Authors as cited in the paper, 2024 https://scholar.google.com/scholar?q=Tokencake 11. Generative Agents: Interactive Simulacra of Human Behavior — Joon Sung Park, Joseph O'Brien, Carrie Cai, Meredith Ringel Morris, Percy Liang, Michael S. Bernstein, 2023 https://scholar.google.com/scholar?q=Generative+Agents:+Interactive+Simulacra+of+Human+Behavior 12. Position-independent KV-cache reuse papers cited as [10, 34-36] — Authors as cited in the paper, 2024-2026 https://scholar.google.com/scholar?q=Position-independent+KV-cache+reuse+papers+cited+as+[10,+34-36] 13. OpenClaw — Authors as cited in the paper, 2024 https://scholar.google.com/scholar?q=OpenClaw 14. MoltBook — Authors as cited in the paper, 2024 https://scholar.google.com/scholar?q=MoltBook 15. DynTaskMAS: A Dynamic Task Graph-Driven Framework for Asynchronous and Parallel LLM-Based Multi-Agent Systems — approx. recent multi-agent systems authors, 2024/2025 https://scholar.google.com/scholar?q=DynTaskMAS:+A+Dynamic+Task+Graph-Driven+Framework+for+Asynchronous+and+Parallel+LLM-Based+Multi-Agent+Systems 16. Kairos: Low-Latency Multi-Agent Serving with Shared LLMs and Excessive Loads in the Public Cloud — approx. recent systems authors, 2024/2025 https://scholar.google.com/scholar?q=Kairos:+Low-Latency+Multi-Agent+Serving+with+Shared+LLMs+and+Excessive+Loads+in+the+Public+Cloud 17. CacheSlide: Unlocking Cross Position-Aware KV Cache Reuse for Accelerating LLM Serving — approx. recent LLM serving authors, 2024/2025 https://scholar.google.com/scholar?q=CacheSlide:+Unlocking+Cross+Position-Aware+KV+Cache+Reuse+for+Accelerating+LLM+Serving 18. Where Matters More Than What: Decoding-Aligned KV Cache Compression via Position-Aware Pseudo Queries — approx. recent KV compression authors, 2024/2025 https://scholar.google.com/scholar?q=Where+Matters+More+Than+What:+Decoding-Aligned+KV+Cache+Compression+via+Position-Aware+Pseudo+Queries 19. KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse — approx. recent KV reuse authors, 2024/2025 https://scholar.google.com/scholar?q=KVLink:+Accelerating+Large+Language+Models+via+Efficient+KV+Cache+Reuse 20. HyperRAG: Enhancing Quality-Efficiency Tradeoffs in Retrieval-Augmented Generation with Reranker KV-Cache Reuse — approx. recent RAG authors, 2024/2025 https://scholar.google.com/scholar?q=HyperRAG:+Enhancing+Quality-Efficiency+Tradeoffs+in+Retrieval-Augmented+Generation+with+Reranker+KV-Cache+Reuse 21. ProphetKV: User-Query-Driven Selective Recomputation for Efficient KV Cache Reuse in Retrieval-Augmented Generation — approx. recent RAG/KV authors, 2024/2025 https://scholar.google.com/scholar?q=ProphetKV:+User-Query-Driven+Selective+Recomputation+for+Efficient+KV+Cache+Reuse+in+Retrieval-Augmented+Generation 22. Eigen Attention: Attention in Low-Rank Space for KV Cache Compression — approx. recent KV compression authors, 2024/2025 https://scholar.google.com/scholar?q=Eigen+Attention:+Attention+in+Low-Rank+Space+for+KV+Cache+Compression 23. PALU: KV-Cache Compression with Low-Rank Projection — approx. recent systems/ML authors, 2024/2025 https://scholar.google.com/scholar?q=PALU:+KV-Cache+Compression+with+Low-Rank+Projection 24. LORC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy — approx. recent KV compression authors, 2024/2025 https://scholar.google.com/scholar?q=LORC:+Low-Rank+Compression+for+LLMs+KV+Cache+with+a+Progressive+Compression+Strategy 25. AI Post Transformers: CacheSlide: Position-Aware KV Cache Reuse for Agent LLMs — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-16-cacheslide-position-aware-kv-cache-reuse-cd59c7.mp3 26. AI Post Transformers: ContiguousKV for Faster LLM Prefill KV Reuse — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-20-contiguouskv-for-faster-llm-prefill-kv-r-59f545.mp3 27. AI Post Transformers: KV Cache TTL for Multi-Turn Agent Scheduling — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-09-kv-cache-ttl-for-multi-turn-agent-schedu-996bf1.mp3 28. AI Post Transformers: Continuous Batching for LLM Inference: Throughput and Latency Gains — Hal Turing & Dr. Ada Shannon, 2025 https://podcast.do-not-panic.com/episodes/continuous-batching-for-llm-inference-throughput-and-latency-gains/ 29. AI Post Transformers: Speculative Decoding in Real vLLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-speculative-decoding-in-real-vllm-servin-6f4e2b.mp3 30. AI Post Transformers: Splitwise: Phase-Split LLM Inference — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-26-splitwise-phase-split-llm-inference-e8945b.mp3 31. AI Post Transformers: FengHuang for Rack-Scale LLM Inference Memory — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-12-fenghuang-for-rack-scale-llm-inference-m-62708e.mp3 32. AI Post Transformers: From Prefix Cache to Fusion RAG Cache: Accelerating LLM Inference in Retrieval-Augmented Generation — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-22-from-prefix-cache-to-fusion-rag-9c5d39.mp3 Interactive Visualization: TokenDance for Multi-Agent KV Cache Sharing

  2. -1 J

    Benchmarking Test-Time Scaling for General LLM Agents

    This episode explores a paper that tests whether general LLM agents remain effective when search, coding, reasoning, and API/tool-use tasks are mixed together under one shared prompt, interface, and tool set rather than optimized benchmark-specific setups. It explains how the benchmark is built by unifying tasks from BrowseComp, WebVoyager, SWE-Bench Verified, Terminal-Bench, MathHay, Tau2-Bench, and MCP-Bench, forcing agents to infer the task type and select tools without domain-specific cues. The discussion highlights the paper’s core argument that conventional benchmarks can overstate capability by pre-structuring the environment, while a general setting better reflects real user requests and exposes weaknesses in planning, tool choice, and adaptation. Listeners would find it interesting for its clear look at test-time scaling in agents—giving the same model more turns or parallel attempts—and for its broader challenge to how agent intelligence should be evaluated. Sources: 1. Benchmark Test-Time Scaling of General LLM Agents — Xiaochuan Li, Ryan Ming, Pranav Setlur, Abhijay Paladugu, Andy Tang, Hao Kang, Shuai Shao, Rong Jin, Chenyan Xiong, 2026 http://arxiv.org/abs/2602.18998 2. SWE-Bench — Jimenez et al., 2023 https://scholar.google.com/scholar?q=SWE-Bench 3. Terminal-Bench — Aleithan et al., 2024 https://scholar.google.com/scholar?q=Terminal-Bench 4. BrowseComp — presumably cited in paper; exact citation not provided in excerpt, 2024/2025 https://scholar.google.com/scholar?q=BrowseComp 5. Mind2Web — Deng/He et al. or benchmark authors cited as Wei et al. 2025 / He et al. 2024 in excerpt context, 2024/2025 https://scholar.google.com/scholar?q=Mind2Web 6. WebVoyager — Zhou et al., 2023 https://scholar.google.com/scholar?q=WebVoyager 7. Tau2-Bench — not specified in excerpt, likely 2025/2026 https://scholar.google.com/scholar?q=Tau2-Bench 8. MCP-Bench — not specified in excerpt, likely 2025/2026 https://scholar.google.com/scholar?q=MCP-Bench 9. Self-Consistency Improves Chain of Thought Reasoning in Language Models — Wang et al., 2022 https://scholar.google.com/scholar?q=Self-Consistency+Improves+Chain+of+Thought+Reasoning+in+Language+Models 10. Training Verifiers to Solve Math Word Problems — Cobbe et al., 2021 https://scholar.google.com/scholar?q=Training+Verifiers+to+Solve+Math+Word+Problems 11. Let's Verify Step by Step — Lightman et al., 2023 https://scholar.google.com/scholar?q=Let's+Verify+Step+by+Step 12. Quiet-STaR / test-time reasoning scaling related work — Zelikman et al., 2024 https://scholar.google.com/scholar?q=Quiet-STaR+/+test-time+reasoning+scaling+related+work 13. Snell et al. test-time scaling work — Snell et al., 2024 https://scholar.google.com/scholar?q=Snell+et+al.+test-time+scaling+work 14. Toolformer — Schick et al., 2023 https://scholar.google.com/scholar?q=Toolformer 15. Gorilla / APIBench-style tool-use work — Patil et al., 2024 https://scholar.google.com/scholar?q=Gorilla+/+APIBench-style+tool-use+work 16. Beyond the Context Window: A Cost-Performance Analysis of Fact-Based Memory vs. Long-Context LLMs for Persistent Agents — approx. unknown from snippet, 2025/2026 https://scholar.google.com/scholar?q=Beyond+the+Context+Window:+A+Cost-Performance+Analysis+of+Fact-Based+Memory+vs.+Long-Context+LLMs+for+Persistent+Agents 17. Memory in the Age of AI Agents — approx. unknown from snippet, 2025/2026 https://scholar.google.com/scholar?q=Memory+in+the+Age+of+AI+Agents 18. Toward Conversational Agents with Context and Time Sensitive Long-Term Memory — approx. unknown from snippet, 2025/2026 https://scholar.google.com/scholar?q=Toward+Conversational+Agents+with+Context+and+Time+Sensitive+Long-Term+Memory 19. When LLM Judge Scores Look Good but Best-of-N Decisions Fail — approx. unknown from snippet, 2025/2026 https://scholar.google.com/scholar?q=When+LLM+Judge+Scores+Look+Good+but+Best-of-N+Decisions+Fail 20. When to Solve, When to Verify: Compute-Optimal Problem Solving and Generative Verification for LLM Reasoning — approx. unknown from snippet, 2025/2026 https://scholar.google.com/scholar?q=When+to+Solve,+When+to+Verify:+Compute-Optimal+Problem+Solving+and+Generative+Verification+for+LLM+Reasoning 21. Scalable Best-of-N Selection for Large Language Models via Self-Certainty — approx. unknown from snippet, 2025/2026 https://scholar.google.com/scholar?q=Scalable+Best-of-N+Selection+for+Large+Language+Models+via+Self-Certainty 22. AgentClinic: A Multimodal Agent Benchmark to Evaluate AI in Simulated Clinical Environments — approx. unknown from snippet, 2025/2026 https://scholar.google.com/scholar?q=AgentClinic:+A+Multimodal+Agent+Benchmark+to+Evaluate+AI+in+Simulated+Clinical+Environments 23. DABStep: Data Agent Benchmark for Multi-Step Reasoning — approx. unknown from snippet, 2025/2026 https://scholar.google.com/scholar?q=DABStep:+Data+Agent+Benchmark+for+Multi-Step+Reasoning 24. GTA1: GUI Test-Time Scaling Agent — approx. unknown from snippet, 2025/2026 https://scholar.google.com/scholar?q=GTA1:+GUI+Test-Time+Scaling+Agent 25. AI Post Transformers: SkillsBench for Evaluating Agent Skills — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-14-skillsbench-for-evaluating-agent-skills-58bb1e.mp3 26. AI Post Transformers: MEMSEARCHER: Reinforcement Learning for LLM Memory Management — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-memsearcher-reinforcement-learning-for-l-e9ad84.mp3 27. AI Post Transformers: Memory Sparse Attention for 100M-Token Scaling — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-07-memory-sparse-attention-for-100m-token-s-377cff.mp3 28. AI Post Transformers: IMO-Bench for Robust Mathematical Reasoning — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-imo-bench-for-robust-mathematical-reason-143489.mp3 29. AI Post Transformers: Simple Self-Distillation for Better Code Generation — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-02-simple-self-distillation-for-better-code-cc88e0.mp3 Interactive Visualization: Benchmarking Test-Time Scaling for General LLM Agents

  3. -1 J

    CacheBlend for Fast RAG Serving

    This episode explores a systems paper on speeding up retrieval-augmented generation by reusing KV caches for frequently repeated retrieved documents, even when those documents are not exact prompt prefixes. It explains why long RAG prompts make prefill the main latency bottleneck, why standard prefix caching only helps in narrow cases, and why naive non-prefix cache reuse can hurt quality by ignoring cross-chunk attention between the query and retrieved passages. The discussion centers on CacheBlend’s core argument: selectively recomputing only the parts of a reused chunk that need updated context could preserve answer quality while significantly improving time-to-first-token. Listeners would find it interesting for its practical focus on the tradeoff between real-world serving speed and faithful multi-document reasoning, rather than on new model architectures. Sources: 1. CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion — Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, Junchen Jiang, 2024 http://arxiv.org/abs/2405.16444 2. Prompt Cache: Modular Attention Reuse for Low-Latency Inference — Yao Fu, et al., 2024 https://scholar.google.com/scholar?q=Prompt+Cache:+Modular+Attention+Reuse+for+Low-Latency+Inference 3. CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving — Junxian He, et al., 2024 https://scholar.google.com/scholar?q=CacheGen:+KV+Cache+Compression+and+Streaming+for+Fast+Large+Language+Model+Serving 4. RadixAttention for Efficient KV Cache Sharing in LLM Serving — LMSYS / SGLang authors, 2024 https://scholar.google.com/scholar?q=RadixAttention+for+Efficient+KV+Cache+Sharing+in+LLM+Serving 5. vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention — Woosuk Kwon, et al., 2023 https://scholar.google.com/scholar?q=vLLM:+Easy,+Fast,+and+Cheap+LLM+Serving+with+PagedAttention 6. Memorizing Transformers — Angeliki Lazaridou, et al., 2022 https://scholar.google.com/scholar?q=Memorizing+Transformers 7. FlashAttention — Tri Dao, et al., 2022 https://scholar.google.com/scholar?q=FlashAttention 8. A Survey on Retrieval-Augmented Text Generation — Zhiheng Gao, et al., 2024 https://scholar.google.com/scholar?q=A+Survey+on+Retrieval-Augmented+Text+Generation 9. Kvlink: Accelerating Large Language Models via Efficient KV Cache Reuse — approx. recent systems/LLM serving authors, 2024/2025 https://scholar.google.com/scholar?q=Kvlink:+Accelerating+Large+Language+Models+via+Efficient+KV+Cache+Reuse 10. An Experimental Study of KV Cache Reuse Strategies in Chunk-Level Caching Systems — approx. recent systems authors, 2024/2025 https://scholar.google.com/scholar?q=An+Experimental+Study+of+KV+Cache+Reuse+Strategies+in+Chunk-Level+Caching+Systems 11. Efficient Streaming Language Models with Attention Sinks — Xiao et al. / approximate, 2024 https://scholar.google.com/scholar?q=Efficient+Streaming+Language+Models+with+Attention+Sinks 12. Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation — approx. survey authors, 2024/2025 https://scholar.google.com/scholar?q=Attention+Sink+in+Transformers:+A+Survey+on+Utilization,+Interpretation,+and+Mitigation 13. Long Context vs. RAG for LLMs: An Evaluation and Revisits — approx. recent RAG evaluation authors, 2024 https://scholar.google.com/scholar?q=Long+Context+vs.+RAG+for+LLMs:+An+Evaluation+and+Revisits 14. Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG — approx. recent RAG authors, 2024 https://scholar.google.com/scholar?q=Long-Context+LLMs+Meet+RAG:+Overcoming+Challenges+for+Long+Inputs+in+RAG 15. KV Cache Offloading for Context-Intensive Tasks — approx. recent systems authors, 2024/2025 https://scholar.google.com/scholar?q=KV+Cache+Offloading+for+Context-Intensive+Tasks 16. KVSwap: Disk-Aware KV Cache Offloading for Long-Context On-Device Inference — approx. recent systems authors, 2024/2025 https://scholar.google.com/scholar?q=KVSwap:+Disk-Aware+KV+Cache+Offloading+for+Long-Context+On-Device+Inference 17. AI Post Transformers: Episode: From Prefix Cache to Fusion RAG Cache: Accelerating LLM Inference in Retrieval-Augmented Generation — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-22-from-prefix-cache-to-fusion-rag-9c5d39.mp3 18. AI Post Transformers: CacheSlide: Position-Aware KV Cache Reuse for Agent LLMs — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-16-cacheslide-position-aware-kv-cache-reuse-cd59c7.mp3 19. AI Post Transformers: Prefill-as-a-Service for Cross-Datacenter KV Cache — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-19-prefill-as-a-service-for-cross-datacente-7560be.mp3 20. AI Post Transformers: KVSwap for Disk-Aware Long-Context On-Device Inference — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-16-kvswap-for-disk-aware-long-context-on-de-f3c15e.mp3 21. AI Post Transformers: FengHuang for Rack-Scale LLM Inference Memory — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-12-fenghuang-for-rack-scale-llm-inference-m-62708e.mp3 22. AI Post Transformers: Speculative Decoding in Real vLLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-speculative-decoding-in-real-vllm-servin-6f4e2b.mp3 Interactive Visualization: CacheBlend for Fast RAG Serving

  4. -1 J

    Directly Trained Spiking DQNs for Atari

    This episode explores a 2023 paper on Deep Spiking Q-Networks, asking whether a directly trained spiking version of DQN can compete with earlier conversion-based spiking reinforcement learning methods on Atari while retaining the energy-efficiency promise of spiking neural networks. It explains the technical foundations behind spiking networks, including leaky integrate-and-fire neurons, surrogate-gradient training, and why SNNs remain difficult to train and awkward on conventional GPU hardware despite their appeal for neuromorphic chips like TrueNorth and Loihi. The discussion also situates the paper against the legacy of the original DeepMind DQN work, arguing that the paper’s title deliberately invites scrutiny over whether it truly matches the breadth and ambition of the classic Atari benchmark. Listeners would find it interesting for its clear framing of both the hype and the hard practical questions around neuromorphic AI: not just whether spiking RL works, but where, on what hardware, and under what conditions its efficiency claims actually matter. Sources: 1. Human-Level Control through Directly-Trained Deep Spiking Q-Networks — Guisong Liu, Wenjie Deng, Xiurui Xie, Li Huang, Huajin Tang, 2021 http://arxiv.org/abs/2201.07211 2. Spiking Neural Networks for Machine Learning: An Overview — Wolfgang Maass and others; overview literature includes major contributors such as Thomas Pfeil, Emre Neftci, and Surya Ganguli across the field, Recent overview genre, especially 2023 https://scholar.google.com/scholar?q=Spiking+Neural+Networks+for+Machine+Learning:+An+Overview 3. Training Spiking Neural Networks Using Lessons From Deep Learning — Guillaume Bellec, Darjan Salaj, Anand Subramoney, Robert Legenstein, Wolfgang Maass, 2018 https://scholar.google.com/scholar?q=Training+Spiking+Neural+Networks+Using+Lessons+From+Deep+Learning 4. Spiking Neural Networks in the Fourth Generation of Artificial Intelligence — Zhaofei Yu, Hanle Zheng, Yujie Wu, and others, 2023 https://scholar.google.com/scholar?q=Spiking+Neural+Networks+in+the+Fourth+Generation+of+Artificial+Intelligence 5. The Remarkable Robustness of Surrogate Gradient Learning for Instilling Complex Function in Spiking Neural Networks — Friedemann Zenke, Tim Vogels, 2021 https://scholar.google.com/scholar?q=The+Remarkable+Robustness+of+Surrogate+Gradient+Learning+for+Instilling+Complex+Function+in+Spiking+Neural+Networks 6. Human-level control through deep reinforcement learning — Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei Rusu, Joel Veness, Marc Bellemare, Alex Graves, Martin Riedmiller, Andreas Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, Demis Hassabis, 2015 https://scholar.google.com/scholar?q=Human-level+control+through+deep+reinforcement+learning 7. Asynchronous Methods for Deep Reinforcement Learning — Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Tim Harley, Timothy Lillicrap, David Silver, Koray Kavukcuoglu, 2016 https://scholar.google.com/scholar?q=Asynchronous+Methods+for+Deep+Reinforcement+Learning 8. Deep Reinforcement Learning: An Overview — Yuxi Li, 2017 https://scholar.google.com/scholar?q=Deep+Reinforcement+Learning:+An+Overview 9. Reinforcement Learning: An Introduction — Richard S. Sutton, Andrew G. Barto, 1998; 2nd edition 2018 https://scholar.google.com/scholar?q=Reinforcement+Learning:+An+Introduction 10. Surrogate Gradient Learning in Spiking Neural Networks: Bringing the Power of Gradient-Based Optimization to Spiking Neural Networks — Emre O. Neftci, Hesham Mostafa, Friedemann Zenke, 2019 https://scholar.google.com/scholar?q=Surrogate+Gradient+Learning+in+Spiking+Neural+Networks:+Bringing+the+Power+of+Gradient-Based+Optimization+to+Spiking+Neural+Networks 11. Direct Training for Spiking Neural Networks: Faster, Larger, Better — Yujie Wu, Lei Deng, Guoqi Li, Jun Zhu, Luping Shi, 2019 https://scholar.google.com/scholar?q=Direct+Training+for+Spiking+Neural+Networks:+Faster,+Larger,+Better 12. Going Deeper With Directly-Trained Larger Spiking Neural Networks — Chaoteng Duan, Shikuang Deng, Xingting Wang, Meng Zhang, and others, 2022 https://scholar.google.com/scholar?q=Going+Deeper+With+Directly-Trained+Larger+Spiking+Neural+Networks 13. Threshold-Dependent Batch Normalization for Training Deep Spiking Neural Networks — Yujie Wu, Lei Deng, Guoqi Li, Jun Zhu, Luping Shi, 2021 https://scholar.google.com/scholar?q=Threshold-Dependent+Batch+Normalization+for+Training+Deep+Spiking+Neural+Networks 14. A million spiking-neuron integrated circuit with a scalable communication network and interface — Paul A. Merolla, John V. Arthur, Rodrigo Alvarez-Icaza, Andrew S. Cassidy, Jun Sawada, Filipp Akopyan, Bryan L. Jackson, Nabil Imam, Chen Guo, Yutaka Nakamura, Bernard Brezzo, Ivan Vo, Steven Esser, Rathinakumar Appuswamy, Brian Taba, Arnon Amir, Myron Flickner, William Risk, Rajit Manohar, Dharmendra Modha, 2014 https://scholar.google.com/scholar?q=A+million+spiking-neuron+integrated+circuit+with+a+scalable+communication+network+and+interface 15. Loihi: A Neuromorphic Manycore Processor with On-Chip Learning — Mike Davies, Narayan Srinivasa, Tsung-Han Lin, Gautham Chinya, Yongqiang Cao, Sri Harsha Choday, Georgios Dimou, Prasad Joshi, Nabil Imam, Shweta Jain, et al., 2018 https://scholar.google.com/scholar?q=Loihi:+A+Neuromorphic+Manycore+Processor+with+On-Chip+Learning 16. SpiNNaker: A 1-W 18-Core System-on-Chip for Massively-Parallel Neural Network Simulation — Steve B. Furber, Francesco Galluppi, Steve Temple, Luis A. Plana, 2014 https://scholar.google.com/scholar?q=SpiNNaker:+A+1-W+18-Core+System-on-Chip+for+Massively-Parallel+Neural+Network+Simulation 17. Benchmarking Neuromorphic Systems with Nengo — Terry C. Stewart, Dan Rasmussen, Xuan Choo, Aaron Voelker, and others, 2015-2017 era benchmarking work https://scholar.google.com/scholar?q=Benchmarking+Neuromorphic+Systems+with+Nengo 18. Playing Atari with Deep Reinforcement Learning — Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, Martin Riedmiller, 2013 https://scholar.google.com/scholar?q=Playing+Atari+with+Deep+Reinforcement+Learning 19. Deep Reinforcement Learning with Double Q-learning — Hado van Hasselt, Arthur Guez, David Silver, 2016 https://scholar.google.com/scholar?q=Deep+Reinforcement+Learning+with+Double+Q-learning 20. Rainbow: Combining Improvements in Deep Reinforcement Learning — Matteo Hessel, Joseph Modayil, Hado van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, David Silver, 2018 https://scholar.google.com/scholar?q=Rainbow:+Combining+Improvements+in+Deep+Reinforcement+Learning 21. Enabling Deep Spiking Neural Networks for Reinforcement Learning — Nitin Rathi, Gopalakrishnan Srinivasan, Priyadarshini Panda, Kaushik Roy, 2020 https://scholar.google.com/scholar?q=Enabling+Deep+Spiking+Neural+Networks+for+Reinforcement+Learning 22. Going Deeper in Spiking Neural Networks: VGG and Residual Architectures — Nitin Rathi, Gopalakrishnan Srinivasan, Priyadarshini Panda, Kaushik Roy, 2021 https://scholar.google.com/scholar?q=Going+Deeper+in+Spiking+Neural+Networks:+VGG+and+Residual+Architectures 23. Incorporating Learnable Membrane Time Constant to Enhance Learning of Spiking Neural Networks — Yuhang Fang, Zhaofei Yu, Tielin Zhang, et al., 2021 https://scholar.google.com/scholar?q=Incorporating+Learnable+Membrane+Time+Constant+to+Enhance+Learning+of+Spiking+Neural+Networks 24. Deep Residual Learning in Spiking Neural Networks — Yujie Wu, Yuhang Zhao, et al., 2021 https://scholar.google.com/scholar?q=Deep+Residual+Learning+in+Spiking+Neural+Networks 25. A Unified Optimization Framework of ANN-SNN Conversion: Towards Optimal Mapping from Activation Values to Firing Rates — approx. recent ANN-to-SNN conversion literature, 2023-2024 https://scholar.google.com/scholar?q=A+Unified+Optimization+Framework+of+ANN-SNN+Conversion:+Towards+Optimal+Mapping+from+Activation+Values+to+Firing+Rates 26. Towards High-Performance Spiking Transformers from ANN to SNN Conversion — approx. recent conversion/transformer authors, 2024 https://scholar.google.com/scholar?q=Towards+High-Performance+Spiking+Transformers+from+ANN+to+SNN+Conversion 27. Towards Training-Free and Accurate ANN-to-SNN Conversion via Activation-Aware Redistribution — approx. recent ANN-to-SNN conversion authors, 2024 https://scholar.google.com/scholar?q=Towards+Training-Free+and+Accurate+ANN-to-SNN+Conversion+via+Activation-Aware+Redistribution 28. Adaptive Surrogate Gradients for Sequential Reinforcement Learning in Spiking Neural Networks — approx. recent SNN RL authors, 2024-2025 https://scholar.google.com/scholar?q=Adaptive+Surrogate+Gradients+for+Sequential+Reinforcement+Learning+in+Spiking+Neural+Networks 29. Elucidating the Theoretical Underpinnings of Surrogate Gradient Learning in Spiking Neural Networks — approx. recent theoretical SNN authors, 2023-2024 https://scholar.google.com/scholar?q=Elucidating+the+Theoretical+Underpinnings+of+Surrogate+Gradient+Learning+in+Spiking+Neural+Networks 30. Spiking Reinforcement Learning Enhanced by Bioinspired Event Source of Multi-Dendrite Spiking Neuron and Dynamic Thresholds — approx. recent spiking RL authors, 2024-2025 https://scholar.google.com/scholar?q=Spiking+Reinforcement+Learning+Enhanced+by+Bioinspired+Event+Source+of+Multi-Dendrite+Spiking+Neuron+and+Dynamic+Thresholds 31. S2Act: Simple Spiking Actor — approx. recent spiking actor-critic authors, 2024-2025 https://scholar.google.com/scholar?q=S2Act:+Simple+Spiking+Actor 32. AI Post Transformers: Zero-Shot Context Gen

  5. -1 J

    Distilling Multi-Agent Reasoning into a Single LLM

    This episode explores a 2026 paper on AgentArk, which asks whether the reasoning gains of multi-agent LLM systems can be compressed into a single model, reducing the latency, token cost, and orchestration burden of running a “committee” of models at inference time. It explains multi-agent systems as setups where multiple model instances debate, critique, and revise one another, arguing that their real advantage comes less from the visible agent structure and more from iterative conflict-and-refinement dynamics that expose errors and improve reasoning. The discussion also breaks down the paper’s distillation framework—from outcome-based supervision to trajectory-based augmentation and process-aware distillation with process reward models that score intermediate reasoning steps, not just final answers. Listeners would find it interesting because it connects a major practical AI deployment problem—how to keep reasoning quality without paying for expensive test-time compute—to a concrete research attempt to internalize deliberation into one cheaper model. Sources: 1. AgentArk: Distilling Multi-Agent Intelligence into a Single LLM Agent — Yinyi Luo, Yiqiao Jin, Weichen Yu, Mengqi Zhang, Srijan Kumar, Xiaoxiao Li, Weijie Xu, Xin Chen, Jindong Wang, 2026 http://arxiv.org/abs/2602.03955 2. Training Language Models to Self-Correct via Reinforcement Learning — Chen et al., 2025 https://scholar.google.com/scholar?q=Training+Language+Models+to+Self-Correct+via+Reinforcement+Learning 3. Debate Helps or Not? The Impact of Multi-Agent Structure Perturbation on LLM Reasoning — Kim et al., 2025 https://scholar.google.com/scholar?q=Debate+Helps+or+Not?+The+Impact+of+Multi-Agent+Structure+Perturbation+on+LLM+Reasoning 4. Systematic Study of Orchestration Strategies for Multi-Agent LLM Reasoning — Ke et al., 2026 https://scholar.google.com/scholar?q=Systematic+Study+of+Orchestration+Strategies+for+Multi-Agent+LLM+Reasoning 5. Improving Multi-Agent Debate with Critique and Revision for LLM Reasoning — Lan et al., 2024 https://scholar.google.com/scholar?q=Improving+Multi-Agent+Debate+with+Critique+and+Revision+for+LLM+Reasoning 6. Multi-Agent Consensus Reasoning with Large Language Models — Chen et al., 2024 https://scholar.google.com/scholar?q=Multi-Agent+Consensus+Reasoning+with+Large+Language+Models 7. MAD: Multi-Agent Debate with Large Language Models — Du et al., 2023 https://scholar.google.com/scholar?q=MAD:+Multi-Agent+Debate+with+Large+Language+Models 8. Reflexion: Language Agents with Verbal Reinforcement Learning — Shinn et al., 2023 https://scholar.google.com/scholar?q=Reflexion:+Language+Agents+with+Verbal+Reinforcement+Learning 9. STaR: Self-Taught Reasoner Bootstrapping Reasoning with Reasoning — Zelikman et al., 2022 https://scholar.google.com/scholar?q=STaR:+Self-Taught+Reasoner+Bootstrapping+Reasoning+with+Reasoning 10. Revisiting Multi-Agent Debate as Test-Time Scaling: When Does Multi-Agent Help? — approx. 2025 authors unclear from snippet, 2025 https://scholar.google.com/scholar?q=Revisiting+Multi-Agent+Debate+as+Test-Time+Scaling:+When+Does+Multi-Agent+Help? 11. Revisiting multi-agent debate as test-time scaling: A systematic study of conditional effectiveness — approx. 2025 authors unclear from snippet, 2025 https://scholar.google.com/scholar?q=Revisiting+multi-agent+debate+as+test-time+scaling:+A+systematic+study+of+conditional+effectiveness 12. How to Steal Reasoning Without Reasoning Traces — approx. 2024/2025 authors unclear from snippet, 2024/2025 https://scholar.google.com/scholar?q=How+to+Steal+Reasoning+Without+Reasoning+Traces 13. Sample, Don't Search: Rethinking Test-Time Alignment for Language Models — approx. 2025 authors unclear from snippet, 2025 https://scholar.google.com/scholar?q=Sample,+Don't+Search:+Rethinking+Test-Time+Alignment+for+Language+Models 14. A survey on test-time scaling in large language models: What, how, where, and how well? — approx. 2025 survey authors unclear from snippet, 2025 https://scholar.google.com/scholar?q=A+survey+on+test-time+scaling+in+large+language+models:+What,+how,+where,+and+how+well? 15. Optimizing the Last Mile: Test-Time Compute Strategies for Next-Generation Language Models — approx. 2025 authors unclear from snippet, 2025 https://scholar.google.com/scholar?q=Optimizing+the+Last+Mile:+Test-Time+Compute+Strategies+for+Next-Generation+Language+Models 16. Symbolic mixture-of-experts: Adaptive skill-based routing for heterogeneous reasoning — approx. 2025 authors unclear from snippet, 2025 https://scholar.google.com/scholar?q=Symbolic+mixture-of-experts:+Adaptive+skill-based+routing+for+heterogeneous+reasoning 17. AI Post Transformers: Simple Self-Distillation for Better Code Generation — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-02-simple-self-distillation-for-better-code-cc88e0.mp3 18. AI Post Transformers: Learning to Reason with 13 Parameters — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-14-learning-to-reason-with-13-parameters-54c87f.mp3 19. AI Post Transformers: Speculative Decoding in Real vLLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-speculative-decoding-in-real-vllm-servin-6f4e2b.mp3 Interactive Visualization: Distilling Multi-Agent Reasoning into a Single LLM

  6. -1 J

    DreamerV3 World Models Across 150 Tasks

    This episode explores DreamerV3, a world-model reinforcement learning system that claims to use one main configuration across more than 150 tasks spanning Atari, ProcGen, DMLab, robot control, visual control, BSuite, and Minecraft. It explains how world models work—learning compact environment dynamics so an agent can train on imagined futures—and why that approach is appealing for sample efficiency but historically difficult because agents can overfit to inaccurate “fantasy” dynamics. The discussion highlights the paper’s central argument that robust world-model design may reduce the need for domain-specific retuning, while also stressing that “fixed hyperparameters” does not eliminate all domain engineering such as wrappers, action discretization, and evaluation choices. Listeners would find it interesting for its clear look at a major RL unification attempt, including why the results matter for scaling, sparse-reward tasks, and expensive real-world settings like robotics. Sources: 1. Mastering Diverse Domains through World Models — Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, Timothy Lillicrap, 2023 http://arxiv.org/abs/2301.04104 2. Mastering Atari with Discrete World Models — Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, Jimmy Ba, 2021 https://scholar.google.com/scholar?q=Mastering+Atari+with+Discrete+World+Models 3. Mastering Visual Continuous Control: Improved Data-Efficient Reinforcement Learning with Dreamer — Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, James Davidson, 2020 https://scholar.google.com/scholar?q=Mastering+Visual+Continuous+Control:+Improved+Data-Efficient+Reinforcement+Learning+with+Dreamer 4. Learning Latent Dynamics for Planning from Pixels — Danijar Hafner, Timothy Lillicrap, Jimmy Ba, Mohammad Norouzi, 2019 https://scholar.google.com/scholar?q=Learning+Latent+Dynamics+for+Planning+from+Pixels 5. MuZero — Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, et al., 2020 https://scholar.google.com/scholar?q=MuZero 6. IRIS: Efficient Video Pretraining for Reinforcement Learning — various authors as cited by the paper, 2023 https://scholar.google.com/scholar?q=IRIS:+Efficient+Video+Pretraining+for+Reinforcement+Learning 7. Temporal Difference Models / TD-MPC / TD-MPC2 — various authors including Nicklas Hansen and colleagues, 2022-2024 https://scholar.google.com/scholar?q=Temporal+Difference+Models+/+TD-MPC+/+TD-MPC2 8. MineRL BASALT / VPT-related Minecraft works — various authors including OpenAI and MineRL participants, 2021-2022 https://scholar.google.com/scholar?q=MineRL+BASALT+/+VPT-related+Minecraft+works 9. DrQ-v2 — Ilya Kostrikov, Denis Yarats, Rob Fergus, 2021 https://scholar.google.com/scholar?q=DrQ-v2 10. R2D2 — Steven Kapturowski, Georg Ostrovski, John Quan, et al., 2019 https://scholar.google.com/scholar?q=R2D2 11. STORM: Efficient Stochastic Transformer-based World Models for Reinforcement Learning — approx. Guo et al., 2023/2024 https://scholar.google.com/scholar?q=STORM:+Efficient+Stochastic+Transformer-based+World+Models+for+Reinforcement+Learning 12. Improving Transformer World Models for Data-Efficient RL — approx. recent 2023/2024 RL world-model authors, 2023/2024 https://scholar.google.com/scholar?q=Improving+Transformer+World+Models+for+Data-Efficient+RL 13. GIRL: Generative Imagination Reinforcement Learning via Information-Theoretic Hallucination Control — approx. recent MBRL authors, 2024/2025 https://scholar.google.com/scholar?q=GIRL:+Generative+Imagination+Reinforcement+Learning+via+Information-Theoretic+Hallucination+Control 14. Normalization Enhances Generalization in Visual Reinforcement Learning — approx. recent visual RL authors, 2024/2025 https://scholar.google.com/scholar?q=Normalization+Enhances+Generalization+in+Visual+Reinforcement+Learning 15. Understanding the Mechanisms of Fast Hyperparameter Transfer — approx. recent hyperparameter-transfer authors, 2024/2025 https://scholar.google.com/scholar?q=Understanding+the+Mechanisms+of+Fast+Hyperparameter+Transfer 16. Completed Hyperparameter Transfer across Modules, Width, Depth, Batch and Duration — approx. recent hyperparameter-transfer authors, 2024/2025 https://scholar.google.com/scholar?q=Completed+Hyperparameter+Transfer+across+Modules,+Width,+Depth,+Batch+and+Duration 17. AI Post Transformers: LeWorldModel: Stable Joint-Embedding World Models from Pixels — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-25-leworldmodel-stable-joint-embedding-worl-650f9f.mp3 18. AI Post Transformers: Zero-Shot Context Generalization in Reinforcement Learning from Few Training Contexts — Hal Turing & Dr. Ada Shannon, Tue, https://podcast.do-not-panic.com/episodes/zero-shot-context-generalization-in-reinforcement-learning-from-few-training-con/ 19. AI Post Transformers: Contrastive Behavioral Similarity Embeddings for Generalization in Reinforcement Learning — Hal Turing & Dr. Ada Shannon, Fri, https://podcast.do-not-panic.com/episodes/contrastive-behavioral-similarity-embeddings-for-generalization-in-reinforcement/ 20. AI Post Transformers: HyperController: Fast, Stable Reinforcement Learning Hyperparameter Optimization — Hal Turing & Dr. Ada Shannon, Fri, https://podcast.do-not-panic.com/episodes/hypercontroller-fast-stable-reinforcement-learning-hyperparameter-optimization/ Interactive Visualization: DreamerV3 World Models Across 150 Tasks

  7. -1 J

    Efficient KV Cache Sharing for Multi-LoRA Agents

    This episode explores a systems paper on making multi-agent LLM setups far more efficient by sharing most of the KV cache across agents that use the same base model with different LoRA adapters. It explains the core argument: for a shared long context, the backbone model’s hidden states are nearly identical across agents, while most role-specific differences come from LoRA’s low-rank adapter outputs, making it possible to store one shared base cache plus tiny agent-specific low-rank caches. The discussion breaks down how LoRA’s down- and up-projection structure enables this cache design, why “shared-A” multi-LoRA expands what can be shared, and how a custom Flash-LoRA-Attention kernel reconstructs adapter effects efficiently at inference time. Listeners would find it interesting because it connects transformer math to a concrete bottleneck in real agent systems—long prompts, repeated prefills, and exploding GPU memory—and examines whether the reported gains come from the cache-sharing idea itself, the kernel engineering, or both. Sources: 1. LRAgent: Efficient KV Cache Sharing for Multi-LoRA LLM Agents — Hyesung Jeon, Hyeongju Ha, Jae-Joon Kim, 2026 http://arxiv.org/abs/2602.01053 2. LoRA: Low-Rank Adaptation of Large Language Models — Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, 2022 https://scholar.google.com/scholar?q=LoRA:+Low-Rank+Adaptation+of+Large+Language+Models 3. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness — Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré, 2022 https://scholar.google.com/scholar?q=FlashAttention:+Fast+and+Memory-Efficient+Exact+Attention+with+IO-Awareness 4. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning — Tri Dao, 2023 https://scholar.google.com/scholar?q=FlashAttention-2:+Faster+Attention+with+Better+Parallelism+and+Work+Partitioning 5. S-LoRA: Serving Thousands of Concurrent LoRA Adapters — Zhen Wang and collaborators, 2023 https://scholar.google.com/scholar?q=S-LoRA:+Serving+Thousands+of+Concurrent+LoRA+Adapters 6. MiLoRA: Efficient Serving for Multiple LoRA Adapters — Xia et al., 2024 https://scholar.google.com/scholar?q=MiLoRA:+Efficient+Serving+for+Multiple+LoRA+Adapters 7. MELoRA: Mini-Ensemble Low-Rank Adapters for Parameter-Efficient Fine-Tuning — Tian et al., 2024 https://scholar.google.com/scholar?q=MELoRA:+Mini-Ensemble+Low-Rank+Adapters+for+Parameter-Efficient+Fine-Tuning 8. Multi-Head Latent Attention — Ji et al. / DeepSeek-AI team, 2025 https://scholar.google.com/scholar?q=Multi-Head+Latent+Attention 9. ReAct: Synergizing Reasoning and Acting in Language Models — Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, Yuan Cao, 2023 https://scholar.google.com/scholar?q=ReAct:+Synergizing+Reasoning+and+Acting+in+Language+Models 10. Tree of Thoughts: Deliberate Problem Solving with Large Language Models — Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas Griffiths, Yuan Cao, Karthik Narasimhan, 2023 https://scholar.google.com/scholar?q=Tree+of+Thoughts:+Deliberate+Problem+Solving+with+Large+Language+Models 11. KV Packet: Recomputation-Free Context-Independent KV Caching for LLMs — approx. unknown from snippet, recent/2025-2026 https://scholar.google.com/scholar?q=KV+Packet:+Recomputation-Free+Context-Independent+KV+Caching+for+LLMs 12. Kvshare: An LLM Service System with Efficient and Effective Multi-Tenant KV Cache Reuse — approx. unknown from snippet, recent/2025-2026 https://scholar.google.com/scholar?q=Kvshare:+An+LLM+Service+System+with+Efficient+and+Effective+Multi-Tenant+KV+Cache+Reuse 13. Improving the Serving Performance of Multi-LoRA Large Language Models via Efficient LoRA and KV Cache Management — approx. unknown from snippet, recent/2025-2026 https://scholar.google.com/scholar?q=Improving+the+Serving+Performance+of+Multi-LoRA+Large+Language+Models+via+Efficient+LoRA+and+KV+Cache+Management 14. AIRA: Activation-Informed Low-Rank Adaptation for Large Models — approx. unknown from snippet, recent/2025-2026 https://scholar.google.com/scholar?q=AIRA:+Activation-Informed+Low-Rank+Adaptation+for+Large+Models 15. Activation-guided Low-Rank Parameter Adaptation for Efficient Model Fine-Tuning — approx. unknown from snippet, recent/2025-2026 https://scholar.google.com/scholar?q=Activation-guided+Low-Rank+Parameter+Adaptation+for+Efficient+Model+Fine-Tuning 16. Capacity and Redundancy Trade-offs in Multi-Task Learning — approx. unknown from snippet, recent/2025-2026 https://scholar.google.com/scholar?q=Capacity+and+Redundancy+Trade-offs+in+Multi-Task+Learning 17. Align, Don't Divide: Revisiting the LoRA Architecture in Multi-Task Learning — approx. unknown from snippet, recent/2025-2026 https://scholar.google.com/scholar?q=Align,+Don't+Divide:+Revisiting+the+LoRA+Architecture+in+Multi-Task+Learning 18. AI Post Transformers: Doc-to-LoRA: Internalizing Context as LoRA — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-29-doc-to-lora-internalizing-context-as-lor-8dd5ec.mp3 19. AI Post Transformers: FAST26: Bidaw: Enhancing Key-Value Caching for Interactive LLM Serving via Bidirectional — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/fast26-bidaw-enhancing-key-value-caching-for-interactive-llm-serving-via-bidirec/ 20. AI Post Transformers: Quest: Query-Aware Sparsity for Efficient LLM Inference — Hal Turing & Dr. Ada Shannon, 2025 https://podcast.do-not-panic.com/episodes/quest-query-aware-sparsity-for-efficient-llm-inference/ 21. AI Post Transformers: Prefill-as-a-Service for Cross-Datacenter KV Cache — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-19-prefill-as-a-service-for-cross-datacente-7560be.mp3 22. AI Post Transformers: Splitwise: Phase-Split LLM Inference — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-26-splitwise-phase-split-llm-inference-e8945b.mp3 23. AI Post Transformers: TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-25-turboquant-online-vector-quantiz-1967b7.mp3 24. AI Post Transformers: FengHuang for Rack-Scale LLM Inference Memory — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-12-fenghuang-for-rack-scale-llm-inference-m-62708e.mp3 Interactive Visualization: Efficient KV Cache Sharing for Multi-LoRA Agents

  8. -1 J

    Program Synthesis with Large Language Models

    This episode explores a 2021 Google Research paper on whether large language models can synthesize short Python programs directly from natural-language descriptions, moving beyond code autocomplete into true program synthesis. It explains why this is difficult in general-purpose languages, contrasts classical search-based synthesis with transformer-based generation, and highlights the paper’s emphasis on execution-based evaluation, where code must actually run and pass tests rather than merely resemble reference solutions. The discussion covers the MBPP and MathQA-Python benchmarks, the effects of model scale from 244 million to 137 billion parameters, and the finding that larger models improve substantially, with the biggest model solving 59.6% of MBPP in a few-shot setting and fine-tuning on just 374 examples adding roughly 10 points. Listeners would find it interesting for its clear look at an early turning point when code LLMs began to show measurable, testable synthesis ability rather than just fluent code-like text. Sources: 1. Program Synthesis with Large Language Models — Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, Charles Sutton, 2021 http://arxiv.org/abs/2108.07732 2. Program Synthesis — Sumit Gulwani, Oleksandr Polozov, Rishabh Singh, 2017 https://scholar.google.com/scholar?q=Program+Synthesis 3. Neural Program Synthesis: A Survey — Michele Vallecorsa, Luca Quartana, Luca Pasquale and others, 2022 https://scholar.google.com/scholar?q=Neural+Program+Synthesis:+A+Survey 4. Program Synthesis with Large Language Models — Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, Charles Sutton, 2021 https://scholar.google.com/scholar?q=Program+Synthesis+with+Large+Language+Models 5. A Survey on Neural Code Intelligence: From Program Representation to Program Synthesis — Uri Alon, Miltiadis Allamanis, Marc Brockschmidt and others, 2024 https://scholar.google.com/scholar?q=A+Survey+on+Neural+Code+Intelligence:+From+Program+Representation+to+Program+Synthesis 6. Evaluating Large Language Models Trained on Code — Mark Chen, Jerry Tworek, Heewoo Jun, et al., 2021 https://scholar.google.com/scholar?q=Evaluating+Large+Language+Models+Trained+on+Code 7. Language Models are Few-Shot Learners — Tom B. Brown, Benjamin Mann, Nick Ryder, et al., 2020 https://scholar.google.com/scholar?q=Language+Models+are+Few-Shot+Learners 8. CuBERT: BERT Models for Python Source Code Understanding — Rahul Kanade, Petros Maniatis, Gogul Balakrishnan, Kensen Shi, 2020 https://scholar.google.com/scholar?q=CuBERT:+BERT+Models+for+Python+Source+Code+Understanding 9. CodeBERT: A Pre-Trained Model for Programming and Natural Languages — Zhangyin Feng, Daya Guo, Duyu Tang, et al., 2020 https://scholar.google.com/scholar?q=CodeBERT:+A+Pre-Trained+Model+for+Programming+and+Natural+Languages 10. PyMT5: Multi-mode Translation of Natural Language and Python Code with Transformers — Colin Clement, Dawn Drain, Aakanksha S. Bhatia, et al., 2020 https://scholar.google.com/scholar?q=PyMT5:+Multi-mode+Translation+of+Natural+Language+and+Python+Code+with+Transformers 11. DeepCoder: Learning to Write Programs — Matej Balog, Alexander L. Gaunt, Marc Brockschmidt, et al., 2017 https://scholar.google.com/scholar?q=DeepCoder:+Learning+to+Write+Programs 12. RobustFill: Neural Program Learning under Noisy I/O — Rishabh Singh, Abhishek Gulwani, 2017 https://scholar.google.com/scholar?q=RobustFill:+Neural+Program+Learning+under+Noisy+I/O 13. DreamCoder: Bootstrapping Inductive Program Synthesis with Wake-Sleep Library Learning — Kevin Ellis, Catherine Wong, Maxwell Nye, Mathias Sablé-Meyer, Lucas Morales, Luke Hewitt, Josh Tenenbaum, Armando Solar-Lezama, 2021 https://scholar.google.com/scholar?q=DreamCoder:+Bootstrapping+Inductive+Program+Synthesis+with+Wake-Sleep+Library+Learning 14. Learning to Infer Graphics Programs from Hand-Drawn Images — Augustus Odena, Charles Sutton, 2020 https://scholar.google.com/scholar?q=Learning+to+Infer+Graphics+Programs+from+Hand-Drawn+Images 15. MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms — Aida Amini, Saeideh Bakhshi, Sivan Ray Choi, et al., 2019 https://scholar.google.com/scholar?q=MathQA:+Towards+Interpretable+Math+Word+Problem+Solving+with+Operation-Based+Formalisms 16. Allamanis et al. 2018 Survey on Machine Learning for Code — Miltiadis Allamanis, Earl T. Barr, Premkumar Devanbu, Charles Sutton, 2018 https://scholar.google.com/scholar?q=Allamanis+et+al.+2018+Survey+on+Machine+Learning+for+Code 17. Chain-of-Code: Reasoning with a Language Model-Augmented Code Emulator — Li et al. (approx.), 2024 https://scholar.google.com/scholar?q=Chain-of-Code:+Reasoning+with+a+Language+Model-Augmented+Code+Emulator 18. OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement — Zhang et al. (approx.), 2024 https://scholar.google.com/scholar?q=OpenCodeInterpreter:+Integrating+Code+Generation+with+Execution+and+Refinement 19. CodePRM: Execution Feedback-Enhanced Process Reward Model for Code Generation — Wang et al. (approx.), 2024 https://scholar.google.com/scholar?q=CodePRM:+Execution+Feedback-Enhanced+Process+Reward+Model+for+Code+Generation 20. CodeMonkeys: Scaling Test-Time Compute for Software Engineering — anonymous/uncertain from snippet, 2024 or 2025 https://scholar.google.com/scholar?q=CodeMonkeys:+Scaling+Test-Time+Compute+for+Software+Engineering 21. AI Post Transformers: CODEGEN: Open Language Model for Code Synthesis — Hal Turing & Dr. Ada Shannon, Fri, https://podcast.do-not-panic.com/episodes/codegen-open-language-model-for-code-synthesis/ 22. AI Post Transformers: Simple Self-Distillation for Better Code Generation — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-02-simple-self-distillation-for-better-code-cc88e0.mp3 23. AI Post Transformers: CWM: Code Generation with World Models — Hal Turing & Dr. Ada Shannon, Sat, https://podcast.do-not-panic.com/episodes/cwm-code-generation-with-world-models/ 24. AI Post Transformers: CodeI/O: Reasoning Patterns Through Code Input-Output Prediction — Hal Turing & Dr. Ada Shannon, Tue, https://podcast.do-not-panic.com/episodes/codeio-reasoning-patterns-through-code-input-output-prediction/ Interactive Visualization: Program Synthesis with Large Language Models

Notes et avis

3,7
sur 5
3 notes

À propos

AI-generated podcast where hosts Hal Turing and Dr. Ada Shannon discuss the latest research papers and reports in machine learning, AI systems, and optimization. Featuring honest critical analysis, proper citations, and nerdy humor.

Vous aimeriez peut‑être aussi