AI Post Transformers

mcgrof

0,0 (0)
TECNOLOGIA
OGNI GIORNO

AI-generated podcast where hosts Hal Turing and Dr. Ada Shannon discuss the latest research papers and reports in machine learning, AI systems, and optimization. Featuring honest critical analysis, proper citations, and nerdy humor.

23 H FA

EverMemOS for Long-Horizon Agent Memory

This episode explores EverMemOS, a memory system for long-lived AI agents that tries to organize past interactions into structured, higher-level semantic “scenes” instead of relying on flat retrieval alone. It explains why bigger context windows and standard RAG often fail when agents accumulate stale preferences, conflicting facts, and fragmented conversational traces, arguing that the real problem is not just forgetting but poorly organized remembering. The discussion walks through the paper’s core design, including MemCells, MemScenes, semantic consolidation, and reconstructive recollection, framing the system as a state-management layer around transformers rather than a new model architecture. A listener would find it interesting because it connects abstract memory research to practical agent failures and offers a concrete alternative for building assistants that can reason more reliably over long time horizons. Sources: 1. EverMemOS: A Self-Organizing Memory Operating System for Structured Long-Horizon Reasoning — Chuanrui Hu, Xingze Gao, Zuyi Zhou, Dannong Xu, Yi Bai, Xintong Li, Hui Zhang, Tong Li, Chong Zhang, Lidong Bing, Yafeng Deng, 2026 http://arxiv.org/abs/2601.02163 2. MemoryBank: Enhancing Large Language Models with Long-Term Memory — Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, Yanlin Wang, 2023 https://scholar.google.com/scholar?q=MemoryBank:+Enhancing+Large+Language+Models+with+Long-Term+Memory 3. MemGPT: Towards LLMs as Operating Systems — Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Joseph E. Gonzalez, 2023 https://scholar.google.com/scholar?q=MemGPT:+Towards+LLMs+as+Operating+Systems 4. A Survey on the Memory Mechanism of Large Language Model based Agents — Zeyu Zhang, Xiaohe Bo, Chen Ma, Rui Li, Xu Chen, Quanyu Dai, Jieming Zhu, Zhenhua Dong, Ji-Rong Wen, 2024 https://scholar.google.com/scholar?q=A+Survey+on+the+Memory+Mechanism+of+Large+Language+Model+based+Agents 5. MemOS: A Memory OS for AI System — Zhiyu Li, Shichao Song, Chenyang Xi, Hanyu Wang, Chen Tang, Simin Niu, Ding Chen, et al., 2025 https://scholar.google.com/scholar?q=MemOS:+A+Memory+OS+for+AI+System 6. Memory OS of AI Agent — Jiazheng Kang, Mingming Ji, Zhe Zhao, Ting Bai, 2025 https://scholar.google.com/scholar?q=Memory+OS+of+AI+Agent 7. Zep: A Temporal Knowledge Graph Architecture for Agent Memory — Preston Rasmussen, Pavlo Paliychuk, Travis Beauvais, Jack Ryan, Daniel Chalef, 2025 https://scholar.google.com/scholar?q=Zep:+A+Temporal+Knowledge+Graph+Architecture+for+Agent+Memory 8. Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory — Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, Deshraj Yadav, 2025 https://scholar.google.com/scholar?q=Mem0:+Building+Production-Ready+AI+Agents+with+Scalable+Long-Term+Memory 9. LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory — Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, Dong Yu, 2024 https://scholar.google.com/scholar?q=LongMemEval:+Benchmarking+Chat+Assistants+on+Long-Term+Interactive+Memory 10. Evaluating Very Long-Term Conversational Memory of LLM Agents — Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, Yuwei Fang, 2024 https://scholar.google.com/scholar?q=Evaluating+Very+Long-Term+Conversational+Memory+of+LLM+Agents 11. BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack — Yuri/Yurii Kuratov et al., 2024 https://scholar.google.com/scholar?q=BABILong:+Testing+the+Limits+of+LLMs+with+Long+Context+Reasoning-in-a-Haystack 12. LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks — Yushi Bai et al., 2024/2025 https://scholar.google.com/scholar?q=LongBench+v2:+Towards+Deeper+Understanding+and+Reasoning+on+Realistic+Long-context+Multitasks 13. Memory-Aware and Uncertainty-Guided Retrieval for Multi-Hop Question Answering — Yuelyu Ji, Rui Meng, Zhuochun Li, Daqing He, 2025 https://scholar.google.com/scholar?q=Memory-Aware+and+Uncertainty-Guided+Retrieval+for+Multi-Hop+Question+Answering 14. BRIEF: Bridging Retrieval and Inference for Multi-hop Reasoning via Compression — Yuankai Li, Jia-Chen Gu, Di Wu, Kai-Wei Chang, Nanyun Peng, 2024/2025 https://scholar.google.com/scholar?q=BRIEF:+Bridging+Retrieval+and+Inference+for+Multi-hop+Reasoning+via+Compression 15. Explicit v.s. Implicit Memory: Exploring Multi-hop Complex Reasoning Over Personalized Information — Zeyu Zhang et al., 2025 https://scholar.google.com/scholar?q=Explicit+v.s.+Implicit+Memory:+Exploring+Multi-hop+Complex+Reasoning+Over+Personalized+Information 16. Handling Preference Drift in Capturing Dynamic User Preferences for Streaming Session-Based Recommendations — Oussama Alahoum, Boudjemaa Boudaa, Laouni Djafri, 2025/2026 https://scholar.google.com/scholar?q=Handling+Preference+Drift+in+Capturing+Dynamic+User+Preferences+for+Streaming+Session-Based+Recommendations 17. Episodic Memory in AI Agents Poses Risks That Should Be Studied and Mitigated — Chad DeChant, 2025 https://scholar.google.com/scholar?q=Episodic+Memory+in+AI+Agents+Poses+Risks+That+Should+Be+Studied+and+Mitigated 18. AI Post Transformers: Memory Sparse Attention for 100M-Token Scaling — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-07-memory-sparse-attention-for-100m-token-s-377cff.mp3 19. AI Post Transformers: Recursive Language Models for Arbitrarily Long Prompts — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-recursive-language-models-for-arbitraril-fbcd1c.mp3 20. AI Post Transformers: Neural Computers as Learned Latent Runtimes — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-11-neural-computers-as-learned-latent-runti-9fa282.mp3 21. AI Post Transformers: How Induction Heads Emerge in Transformers — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-03-how-induction-heads-emerge-in-transforme-a7bfcb.mp3 Interactive Visualization: EverMemOS for Long-Horizon Agent Memory
23 H FA

Explicit Information Transmission for Context Compression

This episode explores a paper on long-context compression that argues standard “soft compression” methods, which rely on learned memory or gist tokens, lose information because those tokens get overwritten across layers and fail to coordinate what each slot should retain. It explains the paper’s alternative design, which keeps the language model backbone frozen and instead explicitly transmits information from hidden states into a small set of latent slots through a two-stage process: selecting useful signals across layers, then globally allocating token information to slots with a transport-based assignment. The discussion highlights why this matters for deployment, where long contexts and growing KV caches make inference expensive, while also noting the risks of latent compression for exact recall, citations, and fine-grained factual detail. Listeners would find it interesting for both the strong benchmark results, where the method substantially outperforms prior compressors on several QA datasets, and the debate over whether those gains on a 512-token testbed really translate to the much larger context problems practitioners care about. Sources: 1. Context Compression via Explicit Information Transmission — Jiangnan Ye, Hanqi Yan, Zhenyi Shen, Heng Chang, Ye Mao, Yulan He, 2026 http://arxiv.org/abs/2602.03784 2. Compressive Transformers for Long-Range Sequence Modelling — Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, Timothy P. Lillicrap, 2019 https://scholar.google.com/scholar?q=Compressive+Transformers+for+Long-Range+Sequence+Modelling 3. Learning to Compress Prompts with Gist Tokens — Jesse Mu, Xiang Lisa Li, Noah D. Goodman, 2023 https://scholar.google.com/scholar?q=Learning+to+Compress+Prompts+with+Gist+Tokens 4. Adapting Language Models to Compress Contexts — Alexis Chevalier, Alexander Wettig, Anirudh Ajith, Danqi Chen, 2023 https://scholar.google.com/scholar?q=Adapting+Language+Models+to+Compress+Contexts 5. A Silver Bullet or a Compromise for Full Attention? A Comprehensive Study of Gist Token-based Context Compression — Chenlong Deng, Zhisong Zhang, Kelong Mao, Shuaiyi Li, Xinting Huang, Dong Yu, Zhicheng Dou, 2024 https://scholar.google.com/scholar?q=A+Silver+Bullet+or+a+Compromise+for+Full+Attention?+A+Comprehensive+Study+of+Gist+Token-based+Context+Compression 6. In-context Autoencoder for Context Compression in a Large Language Model — Tao Ge, Jing Hu, Haixun Wang, Si-Qing Chen, Furu Wei, 2024 https://scholar.google.com/scholar?q=In-context+Autoencoder+for+Context+Compression+in+a+Large+Language+Model 7. 500xCompressor: Generalized Prompt Compression for Large Language Models — Zongqian Li, Yixuan Su, Nigel Collier, 2025 https://scholar.google.com/scholar?q=500xCompressor:+Generalized+Prompt+Compression+for+Large+Language+Models 8. Long Context Compression with Activation Beacon — Peitian Zhang, Zheng Liu, Shitao Xiao, Ninglu Shao, Qiwei Ye, Zhicheng Dou, 2025 https://scholar.google.com/scholar?q=Long+Context+Compression+with+Activation+Beacon 9. RazorAttention: Efficient KV Cache Compression through Retrieval Heads — approx. anonymous/unknown from snippet, 2025/2026 https://scholar.google.com/scholar?q=RazorAttention:+Efficient+KV+Cache+Compression+through+Retrieval+Heads 10. Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning — approx. anonymous/unknown from snippet, 2025/2026 https://scholar.google.com/scholar?q=Not+All+Heads+Matter:+A+Head-Level+KV+Cache+Compression+Method+with+Integrated+Retrieval+and+Reasoning 11. Efficient Context Selection for Long-Context QA: No Tuning, No Iteration, Just Adaptive-k — approx. anonymous/unknown from snippet, 2025/2026 https://scholar.google.com/scholar?q=Efficient+Context+Selection+for+Long-Context+QA:+No+Tuning,+No+Iteration,+Just+Adaptive-k 12. TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection — approx. anonymous/unknown from snippet, 2025/2026 https://scholar.google.com/scholar?q=TokenSelect:+Efficient+Long-Context+Inference+and+Length+Extrapolation+for+LLMs+via+Dynamic+Token-Level+KV+Cache+Selection 13. Generative Adapter: Contextualizing Language Models in Parameters with a Single Forward Pass — approx. anonymous/unknown from snippet, 2025/2026 https://scholar.google.com/scholar?q=Generative+Adapter:+Contextualizing+Language+Models+in+Parameters+with+a+Single+Forward+Pass 14. Demystifying the Roles of LLM Layers in Retrieval, Knowledge, and Reasoning — approx. anonymous/unknown from snippet, 2025/2026 https://scholar.google.com/scholar?q=Demystifying+the+Roles+of+LLM+Layers+in+Retrieval,+Knowledge,+and+Reasoning 15. AI Post Transformers: DeepSeek-V4 and Practical Million-Token Context — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-25-deepseek-v4-and-practical-million-token-6f4de1.mp3 16. AI Post Transformers: Gated Linear Attention for Efficient Long Sequences — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-18-gated-linear-attention-for-efficient-lon-c858ab.mp3 17. AI Post Transformers: Doc-to-LoRA: Internalizing Context as LoRA — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-29-doc-to-lora-internalizing-context-as-lor-8dd5ec.mp3 18. AI Post Transformers: Memory Sparse Attention for 100M-Token Scaling — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-07-memory-sparse-attention-for-100m-token-s-377cff.mp3 19. AI Post Transformers: Recursive Language Models for Arbitrarily Long Prompts — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-recursive-language-models-for-arbitraril-fbcd1c.mp3 20. AI Post Transformers: How Induction Heads Emerge in Transformers — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-03-how-induction-heads-emerge-in-transforme-a7bfcb.mp3 Interactive Visualization: Explicit Information Transmission for Context Compression
23 H FA

Generative Modeling via Drifting in One Step

This episode explores a 2026 paper, Generative Modeling via Drifting, which argues that the hard transport process behind modern generative models can be moved into training so that inference becomes a single forward pass. It explains the core idea of a pushforward distribution, introduces the paper’s notion of a drifting field that nudges generated samples toward the data distribution during optimization, and frames equilibrium as the point where those updates no longer need to move samples. The discussion compares this approach with GANs, diffusion models, flow matching, and other fast one-step systems, highlighting the tradeoff between low-latency generation and the quality advantages of multi-step correction. A listener would find it interesting because it lays out a possible new generative modeling paradigm and tests whether one-shot generation can become more than just an accelerated approximation of diffusion. Sources: 1. Generative Modeling via Drifting — Mingyang Deng, He Li, Tianhong Li, Yilun Du, Kaiming He, 2026 http://arxiv.org/abs/2602.04770 2. Generative Adversarial Nets — Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio, 2014 https://neurips.cc/virtual/2014/poster/4618 3. Generative Moment Matching Networks — Yujia Li, Kevin Swersky, Rich Zemel, 2015 https://proceedings.mlr.press/v37/li15.html 4. Consistency Models — Yang Song, Prafulla Dhariwal, Mark Chen, Ilya Sutskever, 2023 https://icml.cc/virtual/2023/poster/24593 5. Adversarial Diffusion Distillation — Stability AI researchers, 2023 https://stability.ai/research/adversarial-diffusion-distillation 6. A Kernel Two-Sample Test — Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Scholkopf, Alexander Smola, 2012 https://www.jmlr.org/beta/papers/v13/gretton12a.html 7. Wasserstein GAN — Martin Arjovsky, Soumith Chintala, Leon Bottou, 2017 https://icml.cc/virtual/2017/poster/799 8. Density Estimation using Real NVP — Laurent Dinh, Jascha Sohl-Dickstein, Samy Bengio, 2017 https://openreview.net/forum?id=HkpbnH9lx 9. Glow: Generative Flow with Invertible 1x1 Convolutions — Diederik P. Kingma, Prafulla Dhariwal, 2018 https://openai.com/index/glow/ 10. Flow Matching for Generative Modeling — Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, Matt Le, 2023 https://openreview.net/forum?id=PqvMRDCJT9t 11. Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow — Xingchao Liu, Chengyue Gong, Qiang Liu, 2022 https://openreview.net/forum?id=gWxpdtQpiYV 12. Large Scale GAN Training for High Fidelity Natural Image Synthesis — Andrew Brock, Jeff Donahue, Karen Simonyan, 2018 https://huggingface.co/papers/1809.11096 13. Diffusion Models Beat GANs on Image Synthesis — Prafulla Dhariwal, Alex Nichol, 2021 https://openreview.net/forum?id=AAWuCvzaVt 14. High-Resolution Image Synthesis with Latent Diffusion Models — Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Bjorn Ommer, 2022 https://openaccess.thecvf.com/content/CVPR2022/html/Rombach_High-Resolution_Image_Synthesis_With_Latent_Diffusion_Models_CVPR_2022_paper.html 15. Scalable Diffusion Models with Transformers — William Peebles, Saining Xie, 2023 https://openaccess.thecvf.com/content/ICCV2023/html/Peebles_Scalable_Diffusion_Models_with_Transformers_ICCV_2023_paper.html 16. Denoising Diffusion Probabilistic Models — Jonathan Ho, Ajay Jain, Pieter Abbeel, 2020 https://scholar.google.com/scholar?q=Denoising+Diffusion+Probabilistic+Models 17. Progressive Distillation for Fast Sampling of Diffusion Models — Tim Salimans, Jonathan Ho, 2022 https://scholar.google.com/scholar?q=Progressive+Distillation+for+Fast+Sampling+of+Diffusion+Models 18. Unsupervised Image-to-Image Translation Networks — Ferenc Huszar and coauthors are not cited here; instead the more relevant cited moment-matching line is:, 2015 https://scholar.google.com/scholar?q=Unsupervised+Image-to-Image+Translation+Networks 19. Auto-Encoding Variational Bayes — Diederik P. Kingma, Max Welling, 2013 https://scholar.google.com/scholar?q=Auto-Encoding+Variational+Bayes 20. One-step diffusion with distribution matching distillation — approx. diffusion-distillation literature, recent https://scholar.google.com/scholar?q=One-step+diffusion+with+distribution+matching+distillation 21. One-step diffusion distillation via deep equilibrium models — approx. diffusion-distillation / equilibrium-model authors, recent https://scholar.google.com/scholar?q=One-step+diffusion+distillation+via+deep+equilibrium+models 22. Discrete Flow Matching — approx. flow-matching authors, recent https://scholar.google.com/scholar?q=Discrete+Flow+Matching 23. Elucidating the design choice of probability paths in flow matching for forecasting — approx. forecasting / flow-matching authors, recent https://scholar.google.com/scholar?q=Elucidating+the+design+choice+of+probability+paths+in+flow+matching+for+forecasting 24. Mixed Autoregressive and Diffusion Transformers for Continuous Image Generation — approx. hybrid AR-diffusion authors, recent https://scholar.google.com/scholar?q=Mixed+Autoregressive+and+Diffusion+Transformers+for+Continuous+Image+Generation 25. ACDiT: Interpolating autoregressive conditional modeling and diffusion transformer — approx. ACDiT authors, 2025 https://scholar.google.com/scholar?q=ACDiT:+Interpolating+autoregressive+conditional+modeling+and+diffusion+transformer 26. AI Post Transformers: Paris: Decentralized Open-Weight Diffusion Model — Hal Turing & Dr. Ada Shannon, 2025 https://podcast.do-not-panic.com/episodes/paris-decentralized-open-weight-diffusion-model/ Interactive Visualization: Generative Modeling via Drifting in One Step
23 H FA

LAPS for Length-Aware LLM Serving

This episode explores LAPS, a serving system for large language models that treats long prompt prefills and short multi-turn re-prefills as fundamentally different workloads instead of batching them together. It explains why user-perceived latency, especially time to first token, suffers when tiny follow-up requests get stuck behind large compute-heavy context loads, and how LAPS models the boundary between compute-bound and memory-bound prefills to separate them more intelligently. The discussion covers LAPS’s dual-queue design, its temporal and spatial disaggregation strategies, and engineering choices like short-request waiting windows, length-aware smart batching, and CUDA Graph execution. Listeners would find it interesting because it connects low-level scheduling and KV-cache behavior to the everyday experience of whether chat systems feel fast and responsive. Sources: 1. LAPS: A Length-Aware-Prefill LLM Serving System — Jianshu She, Zonghang Li, Hongchao Du, Shangyu Wu, Wenhao Zheng, Eric Xing, Zhengzhong Liu, Huaxiu Yao, Jason Xue, Qirong Ho, 2026 http://arxiv.org/abs/2601.11589 2. ORCA: A Distributed Serving System for Transformer-Based Generative Models — Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, Byung-Gon Chun, 2022 https://scholar.google.com/scholar?q=ORCA:+A+Distributed+Serving+System+for+Transformer-Based+Generative+Models 3. Efficient Memory Management for Large Language Model Serving with PagedAttention — Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica, 2023 https://scholar.google.com/scholar?q=Efficient+Memory+Management+for+Large+Language+Model+Serving+with+PagedAttention 4. SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills — Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Ramachandran Ramjee, 2023 https://scholar.google.com/scholar?q=SARATHI:+Efficient+LLM+Inference+by+Piggybacking+Decodes+with+Chunked+Prefills 5. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving — Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, Hao Zhang, 2024 https://scholar.google.com/scholar?q=DistServe:+Disaggregating+Prefill+and+Decoding+for+Goodput-optimized+Large+Language+Model+Serving 6. BucketServe: Bucket-Based Dynamic Batching for Smart and Efficient LLM Inference Serving — Wanyi Zheng, Minxian Xu, Shengye Song, Kejiang Ye, 2025 https://scholar.google.com/scholar?q=BucketServe:+Bucket-Based+Dynamic+Batching+for+Smart+and+Efficient+LLM+Inference+Serving 7. DéjàVu: KV-cache Streaming for Fast, Fault-tolerant Generative LLM Serving — Foteini Strati, Sara McAllister, Amar Phanishayee, Jakub Tarnawski, Ana Klimovic, 2024 https://scholar.google.com/scholar?q=DéjàVu:+KV-cache+Streaming+for+Fast,+Fault-tolerant+Generative+LLM+Serving 8. Prefill-Decode Aggregation or Disaggregation? Unifying Both for Goodput-Optimized LLM Serving — approx. contemporary LLM systems authors, 2025 https://scholar.google.com/scholar?q=Prefill-Decode+Aggregation+or+Disaggregation?+Unifying+Both+for+Goodput-Optimized+LLM+Serving 9. FlowPrefill: Decoupling Preemption from Prefill Scheduling Granularity to Mitigate Head-of-Line Blocking in LLM Serving — approx. contemporary LLM serving authors, 2025 https://scholar.google.com/scholar?q=FlowPrefill:+Decoupling+Preemption+from+Prefill+Scheduling+Granularity+to+Mitigate+Head-of-Line+Blocking+in+LLM+Serving 10. BurstGPT: A Real-World Workload Dataset to Optimize LLM Serving Systems — approx. systems/data-center workload authors, 2025 https://scholar.google.com/scholar?q=BurstGPT:+A+Real-World+Workload+Dataset+to+Optimize+LLM+Serving+Systems 11. SageServe: Optimizing LLM Serving on Cloud Data Centers with Forecast Aware Auto-Scaling — approx. cloud systems authors, 2025 https://scholar.google.com/scholar?q=SageServe:+Optimizing+LLM+Serving+on+Cloud+Data+Centers+with+Forecast+Aware+Auto-Scaling 12. ServeGen: Workload Characterization and Generation of Large Language Model Serving in Production — approx. production-serving measurement authors, 2025 https://scholar.google.com/scholar?q=ServeGen:+Workload+Characterization+and+Generation+of+Large+Language+Model+Serving+in+Production 13. Fairness in Serving Large Language Models — approx. theory/systems fairness authors, 2025 https://scholar.google.com/scholar?q=Fairness+in+Serving+Large+Language+Models 14. FairBatching: Fairness-Aware Batch Formation for LLM Inference — approx. LLM inference scheduling authors, 2025 https://scholar.google.com/scholar?q=FairBatching:+Fairness-Aware+Batch+Formation+for+LLM+Inference 15. Locality-Aware Fair Scheduling in LLM Serving — approx. LLM serving systems authors, 2025 https://scholar.google.com/scholar?q=Locality-Aware+Fair+Scheduling+in+LLM+Serving 16. AI Post Transformers: Splitwise: Phase-Split LLM Inference — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-26-splitwise-phase-split-llm-inference-e8945b.mp3 17. AI Post Transformers: Prefill-as-a-Service for Cross-Datacenter KV Cache — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-19-prefill-as-a-service-for-cross-datacente-7560be.mp3 18. AI Post Transformers: Speculative Decoding in Real vLLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-speculative-decoding-in-real-vllm-servin-6f4e2b.mp3 19. AI Post Transformers: Breaking the Prefix Barrier with Shared KV Cache — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-24-breaking-the-prefix-barrier-with-shared-a5e5a6.mp3 20. AI Post Transformers: Computation-Bandwidth-Memory Trade-offs for AI Infrastructure — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-09-computation-bandwidth-memory-trade-offs-a83f2b.mp3 Interactive Visualization: LAPS for Length-Aware LLM Serving
23 H FA

Why LLM Serving Needs Mathematical Optimization

This episode explores a position paper arguing that modern LLM serving has outgrown simple heuristics like FIFO, shortest-queue routing, and LRU eviction. It explains why transformer inference creates harder control problems than standard inference, focusing on continuous batching, KV-cache growth, and the tension between compute-heavy prefill and memory-bound decode phases. The discussion highlights the paper’s central claim that serving systems need explicit objective-driven optimization for routing, admission control, scheduling, and cache management, while also questioning where formal methods would truly outperform today’s stronger heuristic baselines such as vLLM and PagedAttention-inspired designs. Listeners would find it interesting because it connects low-level serving mechanics to real product tradeoffs like latency, throughput, and cache churn, showing why infrastructure choices increasingly shape LLM performance. Sources: 1. Position: LLM Serving Needs Mathematical Optimization and Algorithmic Foundations, Not Just Heuristics — Zijie Zhou, 2026 http://arxiv.org/abs/2605.01280 2. PREBLE: Efficient Distributed Prompt Scheduling for LLM Serving — Vikranth Srivatsa, Zijian He, Reyna Abhyankar, Dongming Li, Yiying Zhang, 2024 https://scholar.google.com/scholar?q=PREBLE:+Efficient+Distributed+Prompt+Scheduling+for+LLM+Serving 3. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving — Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, Hao Zhang, 2024 https://scholar.google.com/scholar?q=DistServe:+Disaggregating+Prefill+and+Decoding+for+Goodput-optimized+Large+Language+Model+Serving 4. Semantic Caching for Low-Cost LLM Serving: From Offline Learning to Online Adaptation — Xutong Liu, Baran Atalar, Xiangxiang Dai, Jinhang Zuo, Siwei Wang, John C. S. Lui, Wei Chen, Carlee Joe-Wong, 2026 https://scholar.google.com/scholar?q=Semantic+Caching+for+Low-Cost+LLM+Serving:+From+Offline+Learning+to+Online+Adaptation 5. POLAR: Online Learning for LoRA Adapter Caching and Routing in Edge LLM Serving — Shaoang Li, Jian Li, 2026 https://scholar.google.com/scholar?q=POLAR:+Online+Learning+for+LoRA+Adapter+Caching+and+Routing+in+Edge+LLM+Serving 6. Orca: A Distributed Serving System for Transformer-Based Generative Models — Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, Byung-Gon Chun, 2022 https://scholar.google.com/scholar?q=Orca:+A+Distributed+Serving+System+for+Transformer-Based+Generative+Models 7. Efficient Memory Management for Large Language Model Serving with PagedAttention — Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica, 2023 https://scholar.google.com/scholar?q=Efficient+Memory+Management+for+Large+Language+Model+Serving+with+PagedAttention 8. Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve — Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, Ramachandran Ramjee, 2024 https://scholar.google.com/scholar?q=Taming+Throughput-Latency+Tradeoff+in+LLM+Inference+with+Sarathi-Serve 9. SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills — Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Ramachandran Ramjee, 2023 https://scholar.google.com/scholar?q=SARATHI:+Efficient+LLM+Inference+by+Piggybacking+Decodes+with+Chunked+Prefills 10. Faster LLM Inference using DBMS-Inspired Preemption and Cache Replacement Policies — Kyoungmin Kim, Jiacheng Li, Kijae Hong, Anastasia Ailamaki, 2024 https://scholar.google.com/scholar?q=Faster+LLM+Inference+using+DBMS-Inspired+Preemption+and+Cache+Replacement+Policies 11. DualMap: Enabling Both Cache Affinity and Load Balancing for Distributed LLM Serving — Ying Yuan et al., 2026 https://scholar.google.com/scholar?q=DualMap:+Enabling+Both+Cache+Affinity+and+Load+Balancing+for+Distributed+LLM+Serving 12. Slice-Level Scheduling for High Throughput and Load Balanced LLM Serving — Ke Cheng et al., 2024 https://scholar.google.com/scholar?q=Slice-Level+Scheduling+for+High+Throughput+and+Load+Balanced+LLM+Serving 13. A Predictive and Synergistic Two-Layer Scheduling Framework for LLM Serving — Yue Zhang et al., 2025 https://scholar.google.com/scholar?q=A+Predictive+and+Synergistic+Two-Layer+Scheduling+Framework+for+LLM+Serving 14. KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse — Jingbo Yang et al., 2025 https://scholar.google.com/scholar?q=KVLink:+Accelerating+Large+Language+Models+via+Efficient+KV+Cache+Reuse 15. HyperRAG: Enhancing Quality-Efficiency Tradeoffs in Retrieval-Augmented Generation with Reranker KV-Cache Reuse — Yuwei An et al., 2025 https://scholar.google.com/scholar?q=HyperRAG:+Enhancing+Quality-Efficiency+Tradeoffs+in+Retrieval-Augmented+Generation+with+Reranker+KV-Cache+Reuse 16. ProphetKV: User-Query-Driven Selective Recomputation for Efficient KV Cache Reuse in Retrieval-Augmented Generation — Shihao Wang et al., 2026 https://scholar.google.com/scholar?q=ProphetKV:+User-Query-Driven+Selective+Recomputation+for+Efficient+KV+Cache+Reuse+in+Retrieval-Augmented+Generation 17. dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving — Bingyang Wu et al., 2024 https://scholar.google.com/scholar?q=dLoRA:+Dynamically+Orchestrating+Requests+and+Adapters+for+LoRA+LLM+Serving 18. SPAD: Specialized Prefill and Decode Hardware for Disaggregated LLM Inference — Hengrui Zhang et al., 2025 https://scholar.google.com/scholar?q=SPAD:+Specialized+Prefill+and+Decode+Hardware+for+Disaggregated+LLM+Inference 19. AI Post Transformers: Speculative Decoding in Real vLLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-speculative-decoding-in-real-vllm-servin-6f4e2b.mp3 20. AI Post Transformers: Lookahead Q-Cache for Consistent KV Eviction — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-25-lookahead-q-cache-for-consistent-kv-evic-d97b09.mp3 21. AI Post Transformers: LookaheadKV: Fast and Accurate KV Cache Eviction by Glimpsing into the Future without Generation — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-25-lookaheadkv-fast-and-accurate-kv-9cfc9f.mp3 22. AI Post Transformers: KV Cache TTL for Multi-Turn Agent Scheduling — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-09-kv-cache-ttl-for-multi-turn-agent-schedu-996bf1.mp3 23. AI Post Transformers: CacheFlow and 3D-Parallel KV Cache Restoration — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-01-cacheflow-and-3d-parallel-kv-cache-resto-8db883.mp3 24. AI Post Transformers: ContiguousKV for Faster LLM Prefill KV Reuse — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-20-contiguouskv-for-faster-llm-prefill-kv-r-59f545.mp3 25. AI Post Transformers: Breaking the Prefix Barrier with Shared KV Cache — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-24-breaking-the-prefix-barrier-with-shared-a5e5a6.mp3 26. AI Post Transformers: TokenDance for Multi-Agent KV Cache Sharing — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-22-tokendance-for-multi-agent-kv-cache-shar-aa9b99.mp3 27. AI Post Transformers: RetrievalAttention for Long-Context LLM Inference — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-17-retrievalattention-for-long-context-llm-ddf566.mp3 28. AI Post Transformers: KVSwap for Disk-Aware Long-Context On-Device Inference — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-16-kvswap-for-disk-aware-long-context-on-de-f3c15e.mp3 Interactive Visualization: Why LLM Serving Needs Mathematical Optimization
1 G FA

Backpropagation Through Time Explained

This episode explores Paul Werbos’s 1990 paper on Backpropagation Through Time and explains how ordinary backpropagation extends to systems whose state evolves over time. It walks through the core idea of unrolling a recurrent or dynamic system into a time-indexed computation graph, then applying reverse-mode differentiation to compute exact gradients across both layers and time steps. The discussion also places BPTT in historical context, connecting it to earlier work on backpropagation, automatic differentiation, and alternative recurrent learning methods like real-time recurrent learning. Listeners would find it interesting because it shows how a foundational training method for sequence models, control systems, and differentiable simulations emerged from a simple but powerful reframing of memory and time in neural computation. Sources: 1. Backpropagation Through Time Explained https://podcast.do-not-panic.com/uploaded-pdfs/2026-05-08T22-17-15-230Z-Backpropagation-through-time-what-it-does-and-how-to-do-it.pdf 2. Learning representations by back-propagating errors — David E. Rumelhart, Geoffrey E. Hinton, Ronald J. Williams, 1986 https://scholar.google.com/scholar?q=Learning+representations+by+back-propagating+errors 3. A Learning Algorithm for Continually Running Fully Recurrent Neural Networks — Ronald J. Williams, David Zipser, 1989 https://scholar.google.com/scholar?q=A+Learning+Algorithm+for+Continually+Running+Fully+Recurrent+Neural+Networks 4. Backpropagation Through Time: What It Does and How to Do It — Paul J. Werbos, 1990 https://scholar.google.com/scholar?q=Backpropagation+Through+Time:+What+It+Does+and+How+to+Do+It 5. Learning long-term dependencies with gradient descent is difficult — Yoshua Bengio, Patrice Simard, Paolo Frasconi, 1994 https://scholar.google.com/scholar?q=Learning+long-term+dependencies+with+gradient+descent+is+difficult 6. Long Short-Term Memory — Sepp Hochreiter, Jürgen Schmidhuber, 1997 https://scholar.google.com/scholar?q=Long+Short-Term+Memory 7. Taylor expansion of the accumulated rounding error — Seppo Linnainmaa, 1976 https://scholar.google.com/scholar?q=Taylor+expansion+of+the+accumulated+rounding+error 8. Fast Exact Multiplication by the Hessian — Barak A. Pearlmutter, 1994 https://scholar.google.com/scholar?q=Fast+Exact+Multiplication+by+the+Hessian 9. Automatic Differentiation in Machine Learning: a Survey — Atilim Gunes Baydin, Barak A. Pearlmutter, Alexey Andreyevich Radul, Jeffrey Mark Siskind, 2018 https://scholar.google.com/scholar?q=Automatic+Differentiation+in+Machine+Learning:+a+Survey 10. A review of automatic differentiation and its efficient implementation — Charles C. Margossian, 2019 https://scholar.google.com/scholar?q=A+review+of+automatic+differentiation+and+its+efficient+implementation 11. Generalization of Back-Propagation to Recurrent Neural Networks — Fernando J. Pineda, 1987 https://scholar.google.com/scholar?q=Generalization+of+Back-Propagation+to+Recurrent+Neural+Networks 12. Finding Structure in Time — Jeffrey L. Elman, 1990 https://scholar.google.com/scholar?q=Finding+Structure+in+Time 13. BP(lambda): Online Learning via Synthetic Gradients — approx. anonymous from snippet / modern deep learning authors, recent https://scholar.google.com/scholar?q=BP(lambda):+Online+Learning+via+Synthetic+Gradients 14. Streaming Propagation Through Time: A New Computational Paradigm for Recurrent Neural Networks — approx. modern recurrent-learning authors, recent https://scholar.google.com/scholar?q=Streaming+Propagation+Through+Time:+A+New+Computational+Paradigm+for+Recurrent+Neural+Networks 15. Combining Truncated BPTT and Truncated RTRL for LSTM Training — Jakob Stefan Weber, recent https://scholar.google.com/scholar?q=Combining+Truncated+BPTT+and+Truncated+RTRL+for+LSTM+Training 16. Second-order forward-mode optimization of recurrent neural networks for neuroscience — approx. modern neuroscience/optimization authors, recent https://scholar.google.com/scholar?q=Second-order+forward-mode+optimization+of+recurrent+neural+networks+for+neuroscience 17. Sample-Based Hybrid Mode Control: Asymptotically Optimal Switching of Algorithmic and Non-Differentiable Control Modes — approx. modern control authors, recent https://scholar.google.com/scholar?q=Sample-Based+Hybrid+Mode+Control:+Asymptotically+Optimal+Switching+of+Algorithmic+and+Non-Differentiable+Control+Modes 18. On the differentiability of the value function of switched linear systems under arbitrary and controlled switching — approx. control theory authors, recent https://scholar.google.com/scholar?q=On+the+differentiability+of+the+value+function+of+switched+linear+systems+under+arbitrary+and+controlled+switching 19. AI Post Transformers: Long Short-Term Memory and Vanishing Gradients — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-19-long-short-term-memory-and-vanishing-gra-72448c.mp3 20. AI Post Transformers: When Spectral Gradient Updates Help Deep Learning — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-when-spectral-gradient-updates-help-deep-9c8441.mp3 Interactive Visualization: Backpropagation Through Time Explained
1 G FA

Can Models Learn from Long Context?

This episode explores CL-BENCH, a benchmark designed to test whether language models can actually learn task-specific knowledge from long, messy context and then reason with it, rather than merely retrieving facts or mimicking examples. It explains the distinction between long-context understanding, in-context learning, and the stronger notion of context learning, using examples like legal codes, product manuals, and experimental notebooks to show what real-world adaptation demands. The discussion highlights how the benchmark’s 500 contexts, 1,899 tasks, and dense binary verification rubrics are built to stress models on rule-following, procedural reasoning, and inferring governing relationships from data. Listeners would find it interesting because it gets at a central question in modern AI: whether bigger context windows actually make systems more capable, or just better at holding more text without truly learning from it. Sources: 1. CL-bench: A Benchmark for Context Learning — Shihan Dou, Ming Zhang, Zhangyue Yin, Chenhao Huang, Yujiong Shen, Junzhe Wang, Jiayi Chen, Yuchen Ni, Junjie Ye, Cheng Zhang, Huaibing Xie, Jianglu Hu, Shaolei Wang, Weichao Wang, Yanling Xiao, Yiting Liu, Zenan Xu, Zhen Guo, Pluto Zhou, Tao Gui, Zuxuan Wu, Xipeng Qiu, Qi Zhang, Xuanjing Huang, Yu-Gang Jiang, Di Wang, Shunyu Yao, 2026 http://arxiv.org/abs/2602.03587 2. Language Models are Few-Shot Learners — Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan and others, 2020 https://scholar.google.com/scholar?q=Language+Models+are+Few-Shot+Learners 3. MetaICL: Learning to Learn In Context — Sewon Min, Mike Lewis, Luke Zettlemoyer, Hannaneh Hajishirzi, 2021 https://scholar.google.com/scholar?q=MetaICL:+Learning+to+Learn+In+Context 4. Transformers learn in-context by gradient descent — Johannes von Oswald, Eyvind Niklasson, Ettore Randazzo, Joao Sacramento, Alexander Mordvintsev, Andrey Zhmoginov, Max Vladymyrov, 2022 https://scholar.google.com/scholar?q=Transformers+learn+in-context+by+gradient+descent 5. Lost in the Middle: How Language Models Use Long Contexts — Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, Percy Liang, 2023 https://scholar.google.com/scholar?q=Lost+in+the+Middle:+How+Language+Models+Use+Long+Contexts 6. LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding — Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, Juanzi Li, 2023 https://scholar.google.com/scholar?q=LongBench:+A+Bilingual,+Multitask+Benchmark+for+Long+Context+Understanding 7. BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack — Yurii Kuratov, Aydar Bulatov, Petr Anokhin, Ivan Rodkin, Dmitry Sorokin, Artyom Sorokin, Mikhail Burtsev, 2024 https://scholar.google.com/scholar?q=BABILong:+Testing+the+Limits+of+LLMs+with+Long+Context+Reasoning-in-a-Haystack 8. LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks — Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, Jie Tang, Juanzi Li, 2024 https://scholar.google.com/scholar?q=LongBench+v2:+Towards+Deeper+Understanding+and+Reasoning+on+Realistic+Long-context+Multitasks 9. NoLiMa: Long-Context Evaluation Beyond Literal Matching — Ali Modarressi, Hanieh Deilamsalehy, Franck Dernoncourt, Trung Bui, Ryan A. Rossi, David Seunghyun Yoon, Hinrich Schutze, 2025 https://scholar.google.com/scholar?q=NoLiMa:+Long-Context+Evaluation+Beyond+Literal+Matching 10. LongReason: A Synthetic Long-Context Reasoning Benchmark via Context Expansion — Zhan Ling et al., 2025 https://scholar.google.com/scholar?q=LongReason:+A+Synthetic+Long-Context+Reasoning+Benchmark+via+Context+Expansion 11. DocPuzzle: A Process-Aware Benchmark for Evaluating Realistic Long-Context Reasoning Capabilities — Tianyi Zhuang et al., 2025 https://scholar.google.com/scholar?q=DocPuzzle:+A+Process-Aware+Benchmark+for+Evaluating+Realistic+Long-Context+Reasoning+Capabilities 12. In-Context Learning Creates Task Vectors — Roee Hendel, Mor Geva, Amir Globerson, 2023 https://scholar.google.com/scholar?q=In-Context+Learning+Creates+Task+Vectors 13. In-context Vectors: Making In Context Learning More Effective and Controllable Through Latent Space Steering — Sheng Liu, Haotian Ye, Lei Xing, James Zou, 2024 https://scholar.google.com/scholar?q=In-context+Vectors:+Making+In+Context+Learning+More+Effective+and+Controllable+Through+Latent+Space+Steering 14. Task Vectors in In-Context Learning: Emergence, Formation, and Benefit — Liu Yang, Ziqian Lin, Kangwook Lee, Dimitris Papailiopoulos, Robert Nowak, 2025 https://scholar.google.com/scholar?q=Task+Vectors+in+In-Context+Learning:+Emergence,+Formation,+and+Benefit 15. Learn to Memorize: Scalable Continual Learning in Semiparametric Models with Mixture-of-Neighbors Induction Memory — Guangyue Peng, Tao Ge, Wen Luo, Wei Li, Houfeng Wang, 2025 https://scholar.google.com/scholar?q=Learn+to+Memorize:+Scalable+Continual+Learning+in+Semiparametric+Models+with+Mixture-of-Neighbors+Induction+Memory 16. AI Post Transformers: How Induction Heads Emerge in Transformers — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-03-how-induction-heads-emerge-in-transforme-a7bfcb.mp3 17. AI Post Transformers: Real Context Size and Context Rot — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-07-real-context-size-and-context-rot-56cbb4.mp3 18. AI Post Transformers: DeepSeek-V4 and Practical Million-Token Context — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-25-deepseek-v4-and-practical-million-token-6f4de1.mp3 19. AI Post Transformers: In-Place Test-Time Training for Transformers — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-09-in-place-test-time-training-for-transfor-d0b976.mp3 20. AI Post Transformers: Training LLMs for Divide-and-Conquer Reasoning — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-04-training-llms-for-divide-and-conquer-rea-ea6e22.mp3 21. AI Post Transformers: Inverse IFEval: Unlearning LLM Cognitive Inertia — Hal Turing & Dr. Ada Shannon, 2025 https://podcast.do-not-panic.com/episodes/inverse-ifeval-unlearning-llm-cognitive-inertia/ Interactive Visualization: Can Models Learn from Long Context?
1 G FA

How Models Detect Hidden Activation Steering

This episode explores a mechanistic interpretability study asking whether a language model can detect when a concept has been injected into its hidden activations and, in some cases, identify what that concept was. It explains the difference between detection and identification, walks through activation steering in the residual stream, and highlights the paper’s controlled experiments on Gemma3-27B across 500 concepts, including a strong result of moderate detection with zero false positives under several prompt styles. The discussion also focuses on the paper’s argument that this reporting behavior emerges mainly during post-training, especially preference optimization, rather than from pretraining alone. Listeners would find it interesting because it turns a provocative claim about model “introspection” into a concrete circuit-level question about what internal features and gates may be doing. Sources: 1. Mechanisms of Introspective Awareness — Uzay Macar, Li Yang, Atticus Wang, Peter Wallich, Emmanuel Ameisen, Jack Lindsey, 2026 http://arxiv.org/abs/2603.21396 2. Emergent Introspective Awareness in Large Language Models — Jack Lindsey, 2025 https://scholar.google.com/scholar?q=Emergent+Introspective+Awareness+in+Large+Language+Models 3. Looking Inward: Language Models Can Learn About Themselves by Introspection — Felix J. Binder, James Chua, Tomek Korbak, Henry Sleight, John Hughes, Robert Long, Ethan Perez, Miles Turpin, Owain Evans, 2024 https://scholar.google.com/scholar?q=Looking+Inward:+Language+Models+Can+Learn+About+Themselves+by+Introspection 4. Activation Addition: Steering Language Models Without Optimization — Alexander Matt Turner, Lisa Thiergart, David Udell, Gavin Leech, Ulisse Mini, Monte MacDiarmid, Chris Olah, 2023 https://scholar.google.com/scholar?q=Activation+Addition:+Steering+Language+Models+Without+Optimization 5. Representation Engineering: A Top-Down Approach to AI Transparency — Andy Zou, Long Phan, Sarah Chen, James Campbell, Richard Ngo, Adam Jermyn, Stephen McAleer, Alexander Tamkin, 2023 https://scholar.google.com/scholar?q=Representation+Engineering:+A+Top-Down+Approach+to+AI+Transparency 6. Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet — Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, and others, 2024 https://scholar.google.com/scholar?q=Scaling+Monosemanticity:+Extracting+Interpretable+Features+from+Claude+3+Sonnet 7. Circuit Tracing: Revealing Computational Graphs in Language Models — Emmanuel Ameisen, Jack Lindsey, Adam Pearce, Wes Gurnee, Nicholas L. Turner, Brian Chen, Craig Citro, and others, 2025 https://scholar.google.com/scholar?q=Circuit+Tracing:+Revealing+Computational+Graphs+in+Language+Models 8. Steering Vector Fields for Context-Aware Inference-Time Control in Large Language Models — authors unclear from snippet, 2025/2026 https://scholar.google.com/scholar?q=Steering+Vector+Fields+for+Context-Aware+Inference-Time+Control+in+Large+Language+Models 9. No Training Wheels: Steering Vectors for Bias Correction at Inference Time — authors unclear from snippet, 2025/2026 https://scholar.google.com/scholar?q=No+Training+Wheels:+Steering+Vectors+for+Bias+Correction+at+Inference+Time 10. A mechanistic understanding of alignment algorithms: A case study on DPO and toxicity — authors unclear from snippet, 2025/2026 https://scholar.google.com/scholar?q=A+mechanistic+understanding+of+alignment+algorithms:+A+case+study+on+DPO+and+toxicity 11. How Does DPO Reduce Toxicity? A Mechanistic Neuron-Level Analysis — authors unclear from snippet, 2025/2026 https://scholar.google.com/scholar?q=How+Does+DPO+Reduce+Toxicity?+A+Mechanistic+Neuron-Level+Analysis 12. Refusal in language models is mediated by a single direction — authors unclear from snippet, 2025/2026 https://scholar.google.com/scholar?q=Refusal+in+language+models+is+mediated+by+a+single+direction 13. Beyond I'm Sorry, I Can't: Dissecting Large-Language-Model Refusal — authors unclear from snippet, 2025/2026 https://scholar.google.com/scholar?q=Beyond+I'm+Sorry,+I+Can't:+Dissecting+Large-Language-Model+Refusal 14. Surgical, cheap, and flexible: Mitigating false refusal in language models via single vector ablation — authors unclear from snippet, 2025/2026 https://scholar.google.com/scholar?q=Surgical,+cheap,+and+flexible:+Mitigating+false+refusal+in+language+models+via+single+vector+ablation 15. Residual stream analysis with multi-layer saes — authors unclear from snippet, 2025/2026 https://scholar.google.com/scholar?q=Residual+stream+analysis+with+multi-layer+saes 16. AI Post Transformers: Anthropic: Introspective Awareness in LLMs — Hal Turing & Dr. Ada Shannon, 2025 https://podcast.do-not-panic.com/episodes/anthropic-introspective-awareness-in-llms/ 17. AI Post Transformers: Neural Chameleons and Evading Activation Monitors — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-14-neural-chameleons-and-evading-activation-bc470e.mp3 18. AI Post Transformers: Advancing Mechanistic Interpretability with Sparse Autoencoders — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/advancing-mechanistic-interpretability-with-sparse-autoencoders/ 19. AI Post Transformers: How Induction Heads Emerge in Transformers — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-03-how-induction-heads-emerge-in-transforme-a7bfcb.mp3 20. AI Post Transformers: Self-Improving Pretraining With Post-Trained Models — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-02-self-improving-pretraining-with-post-tra-e37460.mp3 Interactive Visualization: How Models Detect Hidden Activation Steering

Elenco completo (621)

Creatore

mcgrof
Anni di attività

2025 - 2026
Puntate

621
Classificazione

Contenuti adatti a tutti
Sito web del podcast

AI Post Transformers

Tecnologia

Tecnologia

Ogni giorno
Tecnologia

Tecnologia

Ogni settimana
Tecnologia

Tecnologia

Ogni settimana
Cultura e società

Cultura e società

Ogni settimana
Medicina

Medicina

Ogni settimana

AI Post Transformers

EverMemOS for Long-Horizon Agent Memory

Explicit Information Transmission for Context Compression

Generative Modeling via Drifting in One Step

LAPS for Length-Aware LLM Serving

Why LLM Serving Needs Mathematical Optimization

Backpropagation Through Time Explained

Can Models Learn from Long Context?

How Models Detect Hidden Activation Steering

Descrizione

Informazioni

Potrebbero piacerti anche…

AI Post Transformers

Puntate

EverMemOS for Long-Horizon Agent Memory

Explicit Information Transmission for Context Compression

Generative Modeling via Drifting in One Step

LAPS for Length-Aware LLM Serving

Why LLM Serving Needs Mathematical Optimization

Backpropagation Through Time Explained

Can Models Learn from Long Context?

How Models Detect Hidden Activation Steering

Descrizione

Informazioni

Potrebbero piacerti anche…