AI Post Transformers

mcgrof

3.7 (3)
Tecnología
Cada día

AI-generated podcast where hosts Hal Turing and Dr. Ada Shannon discuss the latest research papers and reports in machine learning, AI systems, and optimization. Featuring honest critical analysis, proper citations, and nerdy humor.

hace 1 día

Information-Aware KV Cache Compression for Long Reasoning

This episode explores Information-Aware KV Cache Compression for Long Reasoning, a paper about making long-context inference cheaper and more reliable by deciding which KV-cache tokens to keep during extended reasoning. It explains why long prefilling and long decoding turn the cache into a major memory bottleneck, and why common heuristics such as sliding windows or recent-attention-based retention can discard tokens that only become important much later. The discussion centers on the paper’s claim that future usefulness is better captured by information-theoretic signals like predictive entropy and Forward Influence, with experiments showing that attention-ranked tokens help short-horizon predictions while entropy-ranked tokens matter more over long horizons. Listeners get a concrete account of how InfoKV blends recent attention with per-layer entropy-based scoring to improve the tradeoff between memory savings and long-range reasoning quality. Sources: 1. Information-Aware KV Cache Compression for Long Reasoning — Jushi Kai, Zhuiri Xiao, Alexandra Birch, Zhouhan Lin, 2026 http://arxiv.org/abs/2606.26875 2. Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time — Zichang Liu, Aditya Desai, Fangshuo Liao, Anshumali Shrivastava, et al., 2023 https://scholar.google.com/scholar?q=Scissorhands:+Exploiting+the+Persistence+of+Importance+Hypothesis+for+LLM+KV+Cache+Compression+at+Test+Time 3. H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models — Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Beidi Chen, Christopher Re, et al., 2023 https://scholar.google.com/scholar?q=H2O:+Heavy-Hitter+Oracle+for+Efficient+Generative+Inference+of+Large+Language+Models 4. SnapKV: LLM Knows What You are Looking for Before Generation — Yuhong Li, Yingbing Huang, Bowen Yang, Patrick Lewis, Deming Chen, et al., 2024 https://scholar.google.com/scholar?q=SnapKV:+LLM+Knows+What+You+are+Looking+for+Before+Generation 5. Information-Aware KV Cache Compression for Long Reasoning — Jushi Kai, Zhuiri Xiao, Alexandra Birch, Zhouhan Lin, 2026 https://scholar.google.com/scholar?q=Information-Aware+KV+Cache+Compression+for+Long+Reasoning 6. Expected Attention: KV Cache Compression by Estimating Attention from Future Queries Distribution — Alessio Devoto, Maximilian Jeblick, Simon Jegou, 2025 https://scholar.google.com/scholar?q=Expected+Attention:+KV+Cache+Compression+by+Estimating+Attention+from+Future+Queries+Distribution 7. Reasoning Path Compression: Compressing Generation Trajectories for Efficient LLM Reasoning — Jiwon Song, Dongwon Jo, Yulhwa Kim, Jae-Joon Kim, 2025 https://scholar.google.com/scholar?q=Reasoning+Path+Compression:+Compressing+Generation+Trajectories+for+Efficient+LLM+Reasoning 8. Compressing Context to Enhance Inference Efficiency of Large Language Models — Yucheng Li, Bo Dong, Chenghua Lin, Frank Guerin, 2023 https://scholar.google.com/scholar?q=Compressing+Context+to+Enhance+Inference+Efficiency+of+Large+Language+Models 9. FreqKV: Key-Value Compression in Frequency Domain for Context Window Extension — Jushi Kai et al., 2026 https://scholar.google.com/scholar?q=FreqKV:+Key-Value+Compression+in+Frequency+Domain+for+Context+Window+Extension 10. LongReason: A Synthetic Long-Context Reasoning Benchmark via Context Expansion — Zhan Ling et al., 2025 https://scholar.google.com/scholar?q=LongReason:+A+Synthetic+Long-Context+Reasoning+Benchmark+via+Context+Expansion 11. Attention Reveals More Than Tokens: Training-Free Long-Context Reasoning with Attention-guided Retrieval — Yuwei Zhang et al., 2025 https://scholar.google.com/scholar?q=Attention+Reveals+More+Than+Tokens:+Training-Free+Long-Context+Reasoning+with+Attention-guided+Retrieval 12. Uncertainty Quantification for Hallucination Detection in Large Language Models: Foundations, Methodology, and Future Directions — Sungmin Kang et al., 2025 https://scholar.google.com/scholar?q=Uncertainty+Quantification+for+Hallucination+Detection+in+Large+Language+Models:+Foundations,+Methodology,+and+Future+Directions 13. Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinations — Christian Tomani et al., 2024 https://scholar.google.com/scholar?q=Uncertainty-Based+Abstention+in+LLMs+Improves+Safety+and+Reduces+Hallucinations 14. KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction — Jang-Hyun Kim et al., 2025 https://scholar.google.com/scholar?q=KVzip:+Query-Agnostic+KV+Cache+Compression+with+Context+Reconstruction 15. Can LLMs Maintain Fundamental Abilities under KV Cache Compression? — Xiang Liu et al., 2025 https://scholar.google.com/scholar?q=Can+LLMs+Maintain+Fundamental+Abilities+under+KV+Cache+Compression? 16. KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches — Jiayi Yuan et al., 2024 https://scholar.google.com/scholar?q=KV+Cache+Compression,+But+What+Must+We+Give+in+Return?+A+Comprehensive+Benchmark+of+Long+Context+Capable+Approaches 17. KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse — Jingbo Yang et al., 2025 https://scholar.google.com/scholar?q=KVLink:+Accelerating+Large+Language+Models+via+Efficient+KV+Cache+Reuse 18. AI Post Transformers: IndexMem: Learned KV-Cache Eviction for Long-Context LLMs — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-06-12-indexmem-learned-kv-cache-eviction-for-l-132c2a.mp3 19. AI Post Transformers: When Quantization Hurts Reasoning Models — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-06-17-when-quantization-hurts-reasoning-models-eca9e7.mp3 20. AI Post Transformers: Hyper-Scaling LLM Inference with KV Cache Compression — Hal Turing & Dr. Ada Shannon, 2025 https://podcast.do-not-panic.com/episodes/hyper-scaling-llm-inference-with-kv-cache-compression/ 21. AI Post Transformers: Lattice: Fixed-Slot Compression for Transformer Memory — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-06-11-lattice-fixed-slot-compression-for-trans-5509ea.mp3 22. AI Post Transformers: Adaptive Compression Techniques for Efficient LLM Inference — Hal Turing & Dr. Ada Shannon, 2025 https://podcast.do-not-panic.com/episodes/adaptive-compression-techniques-for-efficient-llm-inference/ 23. AI Post Transformers: Explicit Information Transmission for Context Compression — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-05-explicit-information-transmission-for-co-24e3c2.mp3 24. AI Post Transformers: When LoRA Helps Under KV Cache Compression — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-06-12-when-lora-helps-under-kv-cache-compressi-76dda6.mp3
hace 1 día

JETSPEC and Parallel Tree Speculative Decoding

This episode explores JETSPEC, a 2026 inference paper on speculative decoding that asks whether a language model can draft an entire tree of future tokens in parallel while preserving causal consistency and actually reducing latency on long generations. It explains why autoregressive decoding remains a serving bottleneck for long proofs, code completions, and assistant replies, even when the underlying transformer model itself is unchanged. The discussion compares JetSpec’s approach with Medusa, EAGLE-3, and DFlash, focusing on the central tradeoff between stronger path-conditioned drafts that are slow to produce and cheaper parallel drafts that risk internally inconsistent branches. Listeners would find it interesting because it turns a very practical systems problem, why powerful GPUs still feel slow at inference time, into a concrete debate about the next generation of real-world decoding optimizations. Sources: 1. JetSpec: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting — Lanxiang Hu, Zhaoxiang Feng, Yulun Wu, Haoran Yuan, Yujie Zhao, Yu-Yang Qian, Bojun Wang, Peng Zhao, Daxin Jiang, Yibo Zhu, Tajana Rosing, Hao Zhang, 2026 http://arxiv.org/abs/2606.18394 2. Fast Inference from Transformers via Speculative Decoding — Yaniv Leviathan, Matan Kalman, Yossi Matias, 2023 https://scholar.google.com/scholar?q=Fast+Inference+from+Transformers+via+Speculative+Decoding 3. Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads — Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, Tri Dao, 2024 https://scholar.google.com/scholar?q=Medusa:+Simple+LLM+Inference+Acceleration+Framework+with+Multiple+Decoding+Heads 4. EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees — Yuhui Li, Fangyun Wei, Chao Zhang, Hongyang Zhang, 2024 https://scholar.google.com/scholar?q=EAGLE-2:+Faster+Inference+of+Language+Models+with+Dynamic+Draft+Trees 5. DFlash: Block Diffusion for Flash Speculative Decoding — Jian Chen, Yesheng Liang, Zhijian Liu, 2026 https://scholar.google.com/scholar?q=DFlash:+Block+Diffusion+for+Flash+Speculative+Decoding 6. EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test — Yuhui Li, Fangyun Wei, Chao Zhang, Hongyang Zhang, 2025 https://scholar.google.com/scholar?q=EAGLE-3:+Scaling+up+Inference+Acceleration+of+Large+Language+Models+via+Training-Time+Test 7. SpecInfer: Accelerating Generative Large Language Model Serving with Tree-based Speculative Inference and Verification — Xupeng Miao et al., 2023 https://scholar.google.com/scholar?q=SpecInfer:+Accelerating+Generative+Large+Language+Model+Serving+with+Tree-based+Speculative+Inference+and+Verification 8. DFlare: Scaling Up Draft Capacity for Block Diffusion Speculative Decoding — Jiebin Zhang et al., 2026 https://scholar.google.com/scholar?q=DFlare:+Scaling+Up+Draft+Capacity+for+Block+Diffusion+Speculative+Decoding 9. TriSpec: Ternary Speculative Decoding via Lightweight Proxy Verification — Haoyun Jiang et al., 2026 https://scholar.google.com/scholar?q=TriSpec:+Ternary+Speculative+Decoding+via+Lightweight+Proxy+Verification 10. ParallelSpec: Parallel Drafter for Efficient Speculative Decoding — Zilin Xiao et al., 2024 https://arxiv.org/abs/2410.05589 11. Mamba Drafters for Speculative Decoding — Daewon Choi et al., 2025 https://arxiv.org/abs/2506.01206 12. OmniDraft: A Cross-vocabulary, Online Adaptive Drafter for On-device Speculative Decoding — Ramchalam Kinattinkara Ramakrishnan et al., 2025 https://arxiv.org/abs/2507.02659 13. Clover: Regressive Lightweight Speculative Decoding with Sequential Knowledge — Bin Xiao et al., 2024 https://arxiv.org/abs/2405.00263 14. Make Every Draft Count: Hidden State based Speculative Decoding — Yuetao Chen et al., 2026 https://arxiv.org/abs/2602.21224 15. When Hidden States Drift: Can KV Caches Rescue Long-Range Speculative Decoding? — Tianyu Liu et al., 2026 https://arxiv.org/abs/2604.26412 16. MoE-Spec: Expert Budgeting for Efficient Speculative Decoding — Bradley McDanel et al., 2026 https://arxiv.org/abs/2602.16052 17. Beat the long tail: Distribution-Aware Speculative Decoding for RL Training — Zelei Shao et al., 2025 https://arxiv.org/abs/2511.13841 18. AI Post Transformers: Speculative Decoding in Real vLLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-speculative-decoding-in-real-vllm-servin-6f4e2b.mp3 19. AI Post Transformers: Deep Kernel Fusion for Transformer Decoding — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-15-deep-kernel-fusion-for-transformer-decod-b1a703.mp3 20. AI Post Transformers: Serving MoE Models with Disaggregated Expert Parallelism — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-19-serving-moe-models-with-disaggregated-ex-6979d2.mp3 Interactive Visualization: JETSPEC and Parallel Tree Speculative Decoding
hace 2 días

DAK: Direct GPU Memory Offloading for LLMs

This episode explores DAK, a Cornell systems paper arguing that LLM inference on tiered-memory machines can be faster when offloaded weights and KV-cache blocks are fetched directly into on-chip shared memory instead of being prefetched and staged through GPU HBM. It breaks down the tradeoffs among HBM capacity, HBM bandwidth, KV-cache growth during decoding, and prior approaches such as FlexGen, vLLM’s PagedAttention, and emerging KV offload systems like LMCache. The discussion focuses on DAK’s core technical idea: using Hopper’s Tensor Memory Accelerator inside custom GEMM and FlashAttention kernels so data movement and computation are co-designed, reducing bounce buffers, HBM contention, and pipeline bubbles while aggregating bandwidth from multiple memory tiers. Listeners would find it interesting because it turns a low-level memory-path decision into a concrete argument about when offloading is merely a fallback and when it becomes a real performance advantage for serving larger models, longer contexts, or bigger batches. Sources: 1. DAK: Direct-Access-Enabled GPU Memory Offloading with Optimal Efficiency for LLM Inference — Shouxu Lin, Zhiyuan Guo, Jiaxin Lin, 2026 http://arxiv.org/abs/2604.26074 2. ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning — Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, Yuxiong He, 2021 https://scholar.google.com/scholar?q=ZeRO-Infinity:+Breaking+the+GPU+Memory+Wall+for+Extreme+Scale+Deep+Learning 3. FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU — Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Ion Stoica, Percy Liang, Ce Zhang, and colleagues, 2023 https://scholar.google.com/scholar?q=FlexGen:+High-Throughput+Generative+Inference+of+Large+Language+Models+with+a+Single+GPU 4. Efficient Memory Management for Large Language Model Serving with PagedAttention — Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica, 2023 https://scholar.google.com/scholar?q=Efficient+Memory+Management+for+Large+Language+Model+Serving+with+PagedAttention 5. DAK: Direct-Access-Enabled GPU Memory Offloading with Optimal Efficiency for LLM Inference — Shouxu Lin, Zhiyuan Guo, Jiaxin Lin, 2026 https://scholar.google.com/scholar?q=DAK:+Direct-Access-Enabled+GPU+Memory+Offloading+with+Optimal+Efficiency+for+LLM+Inference 6. PIE: Pooling CPU Memory for LLM Inference — Yi Xu, Ziming Mao, Xiangxi Mo, Shu Liu, Ion Stoica, 2024 https://scholar.google.com/scholar?q=PIE:+Pooling+CPU+Memory+for+LLM+Inference 7. NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference — Xuanlin Jiang, Yang Zhou, Shiyi Cao, Ion Stoica, Minlan Yu, 2025 https://scholar.google.com/scholar?q=NEO:+Saving+GPU+Memory+Crisis+with+CPU+Offloading+for+Online+LLM+Inference 8. Understanding Data Movement in Tightly Coupled Heterogeneous Systems: A Case Study with the Grace Hopper Superchip — Luigi Fusco, Mikhail Khalilov, Marcin Chrapek, Giridhar Chukkapalli, Thomas Schulthess, Torsten Hoefler, 2024 https://scholar.google.com/scholar?q=Understanding+Data+Movement+in+Tightly+Coupled+Heterogeneous+Systems:+A+Case+Study+with+the+Grace+Hopper+Superchip 9. FengHuang: Next-Generation Memory Orchestration for AI Inferencing — Jiamin Li, Lei Qu, Tao Zhang, Grigory Chirkov, Shuotao Xu, Peng Cheng, Lidong Zhou, 2025 https://scholar.google.com/scholar?q=FengHuang:+Next-Generation+Memory+Orchestration+for+AI+Inferencing 10. Reducing Transformer Key-Value Cache Size with Cross-Layer Attention — William Brandon et al., 2024 https://scholar.google.com/scholar?q=Reducing+Transformer+Key-Value+Cache+Size+with+Cross-Layer+Attention 11. xKV: Cross-Layer SVD for KV-Cache Compression — Chi-Chih Chang et al., 2025 https://scholar.google.com/scholar?q=xKV:+Cross-Layer+SVD+for+KV-Cache+Compression 12. XQuant: Achieving Ultra-Low Bit KV Cache Quantization with Cross-Layer Compression — Haoqi Yang et al., 2025 https://scholar.google.com/scholar?q=XQuant:+Achieving+Ultra-Low+Bit+KV+Cache+Quantization+with+Cross-Layer+Compression 13. InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management — Wonbeom Lee et al., 2024 https://scholar.google.com/scholar?q=InfiniGen:+Efficient+Generative+Inference+of+Large+Language+Models+with+Dynamic+KV+Cache+Management 14. Accelerating LLM Inference Throughput via Asynchronous KV Cache Prefetching — Yanhao Dong et al., 2025 https://scholar.google.com/scholar?q=Accelerating+LLM+Inference+Throughput+via+Asynchronous+KV+Cache+Prefetching 15. KVShare: Semantic-Aware Key-Value Cache Sharing for Efficient Large Language Model Inference — Huan Yang et al., 2025 https://scholar.google.com/scholar?q=KVShare:+Semantic-Aware+Key-Value+Cache+Sharing+for+Efficient+Large+Language+Model+Inference 16. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration — Ji Lin et al., 2023 https://scholar.google.com/scholar?q=AWQ:+Activation-aware+Weight+Quantization+for+LLM+Compression+and+Acceleration 17. SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression — Tim Dettmers et al., 2023 https://scholar.google.com/scholar?q=SpQR:+A+Sparse-Quantized+Representation+for+Near-Lossless+LLM+Weight+Compression 18. AI Post Transformers: Beluga: CXL Memory Pooling for LLM KV Cache — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-27-beluga-cxl-memory-pooling-for-llm-kv-cac-b6142f.mp3 19. AI Post Transformers: ScoutAttention for Efficient KV Cache Offloading — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-24-scoutattention-for-efficient-kv-cache-of-b26699.mp3 20. AI Post Transformers: InfiniGen for Efficient Long-Context LLM Inference — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-06-18-infinigen-for-efficient-long-context-llm-143d77.mp3 21. AI Post Transformers: Mooncake for KV Cache-Centric LLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-06-05-mooncake-for-kv-cache-centric-llm-servin-1086d0.mp3 22. AI Post Transformers: Memory-Bound, Not Bandwidth-Limited Batch-1 LLM Decode — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-06-02-memory-bound-not-bandwidth-limited-batch-114799.mp3 23. AI Post Transformers: LAPS for Length-Aware LLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-05-laps-for-length-aware-llm-serving-0c6149.mp3
hace 3 días

Prefix-Tuning for Efficient Text Generation

This episode explores the 2021 prefix-tuning paper and asks whether a large language model can be adapted to new generation tasks by learning a small continuous prompt while keeping the full model frozen. It explains where prefix tuning fits within parameter-efficient fine-tuning, contrasting it with full fine-tuning, adapters, ordinary prompting, in-context learning, AutoPrompt, and soft prompt tuning. The discussion highlights the paper’s two main evaluation settings, structured data-to-text generation on E2E, WebNLG, and DART with GPT-2, and abstractive summarization on XSUM with BART, while stressing that these are meaningfully different tests despite being grouped under one headline. It also digs into the core technical idea that the learned prefix acts as trainable internal state visible to attention throughout the network, making the method an early and elegant approach to low-storage task adaptation even if later methods like LoRA proved more practical. Sources: 1. Prefix-Tuning: Optimizing Continuous Prompts for Generation — Xiang Lisa Li, Percy Liang, 2021 http://arxiv.org/abs/2101.00190 2. Prefix-Tuning: Optimizing Continuous Prompts for Generation — Xiang Lisa Li; Percy Liang, 2021 https://scholar.google.com/scholar?q=Prefix-Tuning:+Optimizing+Continuous+Prompts+for+Generation 3. The Power of Scale for Parameter-Efficient Prompt Tuning — Brian Lester; Rami Al-Rfou; Noah Constant, 2021 https://scholar.google.com/scholar?q=The+Power+of+Scale+for+Parameter-Efficient+Prompt+Tuning 4. When Do Prompting and Prefix-Tuning Work? A Theory of Capabilities and Limitations — Aleksandar Petrov; Philip H. S. Torr; Adel Bibi, 2023 https://scholar.google.com/scholar?q=When+Do+Prompting+and+Prefix-Tuning+Work?+A+Theory+of+Capabilities+and+Limitations 5. LoRA: Low-Rank Adaptation of Large Language Models — Edward J. Hu; Yelong Shen; Phillip Wallis; Zeyuan Allen-Zhu; Yuanzhi Li; Shean Wang; Lu Wang; Weizhu Chen, 2021 https://scholar.google.com/scholar?q=LoRA:+Low-Rank+Adaptation+of+Large+Language+Models 6. Parameter-efficient Transfer Learning for NLP — Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin de Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly, 2019 https://scholar.google.com/scholar?q=Parameter-efficient+Transfer+Learning+for+NLP 7. Exploring Versatile Generative Language Model via Parameter-Efficient Transfer Learning — Zhaojiang Lin, Andrea Madotto, and Pascale Fung, 2020 https://scholar.google.com/scholar?q=Exploring+Versatile+Generative+Language+Model+via+Parameter-Efficient+Transfer+Learning 8. AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts — Taylor Shin, Yasaman Razeghi, Robert L. Logan IV, Eric Wallace, and Sameer Singh, 2020 https://scholar.google.com/scholar?q=AutoPrompt:+Eliciting+Knowledge+from+Language+Models+with+Automatically+Generated+Prompts 9. Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning — Armen Aghajanyan, Luke Zettlemoyer, and Sonal Gupta, 2020 https://scholar.google.com/scholar?q=Intrinsic+Dimensionality+Explains+the+Effectiveness+of+Language+Model+Fine-Tuning 10. Can Unconditional Language Models Recover Arbitrary Sentences? — Nishant Subramani, Samuel R. Bowman, and Kyunghyun Cho, 2020 https://scholar.google.com/scholar?q=Can+Unconditional+Language+Models+Recover+Arbitrary+Sentences? 11. Universality and Limitations of Prompt Tuning — Yihan Wang et al., 2023 https://arxiv.org/abs/2305.18787 12. Fundamental Limits of Prompt Tuning Transformers: Universality, Capacity and Efficiency — Jerry Yao-Chieh Hu et al., 2024 https://arxiv.org/abs/2411.16525 13. Memory Limitations of Prompt Tuning in Transformers — Maxime Meyer et al., 2025 https://arxiv.org/abs/2509.00421 14. Parameter-Efficient Fine-Tuning for Medical Text Summarization: A Comparative Study of Lora, Prompt Tuning, and Full Fine-Tuning — Ulugbek Shernazarov et al., 2026 https://arxiv.org/abs/2603.21970 15. Task Singular Vectors: Reducing Task Interference in Model Merging — Antonio Andrea Gargiulo et al., 2024 https://arxiv.org/abs/2412.00081 16. Task Vector Quantization for Memory-Efficient Model Merging — Youngeun Kim et al., 2025 https://arxiv.org/abs/2503.06921 17. Last One Standing: A Comparative Analysis of Security and Privacy of Soft Prompt Tuning, LoRA, and In-Context Learning — Rui Wen et al., 2023 https://arxiv.org/abs/2310.11397 18. Progressive Prompts: Continual Learning for Language Models — Anastasia Razdaibiedina et al., 2023 https://arxiv.org/abs/2301.12314 19. AI Post Transformers: Benchmarking PEFT Techniques for Large Language Models — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-06-20-benchmarking-peft-techniques-for-large-l-41bbf5.mp3 20. AI Post Transformers: Learning to Reason with 13 Parameters — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-14-learning-to-reason-with-13-parameters-54c87f.mp3
hace 3 días

RMSNorm: Simplifying Layer Normalization for Sequence Models

This episode explores the 2019 RMSNorm paper, which asks whether LayerNorm’s mean-subtraction step is actually necessary or whether controlling activation scale is the part that really stabilizes training. It explains how RMSNorm keeps LayerNorm’s rescaling behavior while dropping explicit centering, and how the paper’s pRMSNorm variant estimates the normalization term from only a small subset of features to reduce cost further. The discussion covers experiments in machine translation, image classification, image-caption retrieval, and question answering, where model quality stayed roughly comparable while reported runtime improved, with smaller gains in transformers and much larger ones in older RNN-based systems. Listeners would find it interesting because it turns a seemingly minor mathematical tweak into a broader argument about efficiency, optimization stability, and how much claimed speedups depend on the era and quality of the baseline implementation. Sources: 1. Root Mean Square Layer Normalization — Biao Zhang, Rico Sennrich, 2019 http://arxiv.org/abs/1910.07467 2. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift — Sergey Ioffe, Christian Szegedy, 2015 https://scholar.google.com/scholar?q=Batch+Normalization:+Accelerating+Deep+Network+Training+by+Reducing+Internal+Covariate+Shift 3. Layer Normalization — Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E. Hinton, 2016 https://scholar.google.com/scholar?q=Layer+Normalization 4. Root Mean Square Layer Normalization — Biao Zhang, Rico Sennrich, 2019 https://scholar.google.com/scholar?q=Root+Mean+Square+Layer+Normalization 5. On Layer Normalization in the Transformer Architecture — Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, Tie-Yan Liu, 2020 https://scholar.google.com/scholar?q=On+Layer+Normalization+in+the+Transformer+Architecture 6. Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks — Tim Salimans, Diederik P. Kingma, 2016 https://scholar.google.com/scholar?q=Weight+Normalization:+A+Simple+Reparameterization+to+Accelerate+Training+of+Deep+Neural+Networks 7. How Does Batch Normalization Help Optimization? — Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, Aleksander Madry, 2018 https://scholar.google.com/scholar?q=How+Does+Batch+Normalization+Help+Optimization? 8. Understanding Batch Normalization — Nils Bjorck, Carla P. Gomes, Bart Selman, Kilian Q. Weinberger, 2018 https://scholar.google.com/scholar?q=Understanding+Batch+Normalization 9. Norm Matters: Efficient and Accurate Normalization Schemes in Deep Networks — Elad Hoffer, Ron Banner, Itay Golan, Daniel Soudry, 2018 https://scholar.google.com/scholar?q=Norm+Matters:+Efficient+and+Accurate+Normalization+Schemes+in+Deep+Networks 10. Group Normalization — Yuxin Wu, Kaiming He, 2018 https://scholar.google.com/scholar?q=Group+Normalization 11. Residual Learning Without Normalization via Better Initialization — Hongyi Zhang, Yann N. Dauphin, Tengyu Ma, 2019 https://scholar.google.com/scholar?q=Residual+Learning+Without+Normalization+via+Better+Initialization 12. Tuning LayerNorm in Attention: Towards Efficient Multi-Modal LLM Finetuning — Bingchen Zhao et al., 2023 https://scholar.google.com/scholar?q=Tuning+LayerNorm+in+Attention:+Towards+Efficient+Multi-Modal+LLM+Finetuning 13. LayerNorm: A key component in parameter-efficient fine-tuning — Taha ValizadehAslani and Hualou Liang, 2024 https://scholar.google.com/scholar?q=LayerNorm:+A+key+component+in+parameter-efficient+fine-tuning 14. Efficiency in Focus: LayerNorm as a Catalyst for Fine-tuning Medical Visual Language Pre-trained Models — Jiawei Chen et al., 2024 https://scholar.google.com/scholar?q=Efficiency+in+Focus:+LayerNorm+as+a+Catalyst+for+Fine-tuning+Medical+Visual+Language+Pre-trained+Models 15. The Curse of Depth in Large Language Models — Wenfang Sun et al., 2025 https://scholar.google.com/scholar?q=The+Curse+of+Depth+in+Large+Language+Models 16. Just One Layer Norm Guarantees Stable Extrapolation — Juliusz Ziomek, George Whittle, Michael A. Osborne, 2025 https://scholar.google.com/scholar?q=Just+One+Layer+Norm+Guarantees+Stable+Extrapolation 17. Normalization Layer Per-Example Gradients are Sufficient to Predict Gradient Noise Scale in Transformers — Gavia Gray et al., 2024 https://scholar.google.com/scholar?q=Normalization+Layer+Per-Example+Gradients+are+Sufficient+to+Predict+Gradient+Noise+Scale+in+Transformers 18. AI Post Transformers: Keel: Post-LayerNorm Is Back: Stable, ExpressivE, and Deep — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/keel-post-layernorm-is-back-stable-expressive-and-deep/ 19. AI Post Transformers: Deep Kernel Fusion for Transformer Decoding — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-15-deep-kernel-fusion-for-transformer-decod-b1a703.mp3 20. AI Post Transformers: Long Short-Term Memory and Vanishing Gradients — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-19-long-short-term-memory-and-vanishing-gra-72448c.mp3
hace 3 días

ReasonCACHE: Learning Reasoning Without Weight Updates

This episode explores ReasonCACHE, a method for improving multi-step reasoning in large language models by keeping the backbone frozen and training a compact per-layer key-value memory instead of updating billions of weights. It situates the paper against in-context learning, many-shot prompting, prefix tuning, LoRA, and context-distillation work, explaining how learned latent memory sits between raw prompting and full fine-tuning. The discussion centers on the paper’s real claim and its main point of skepticism: whether these learned caches actually teach a reusable reasoning procedure or mostly compress and elicit abilities the model already had. Listeners would find it interesting because it connects a concrete new method to a larger debate about how LLMs acquire reasoning skills, while also highlighting the practical payoff of avoiding huge prompts, quadratic attention costs, and brittle long-context setups. Sources: 1. ReasonCACHE: Teaching LLMs To Reason Without Weight Updates — Sharut Gupta, Phillip Isola, Stefanie Jegelka, David Lopez-Paz, Kartik Ahuja, Mark Ibrahim, Mohammad Pezeshki, 2026 http://arxiv.org/abs/2602.02366 2. Prefix-Tuning: Optimizing Continuous Prompts for Generation — Xiang Lisa Li, Percy Liang, 2021 https://arxiv.org/abs/2101.00190 3. The Power of Scale for Parameter-Efficient Prompt Tuning — Brian Lester, Rami Al-Rfou, Noah Constant, 2021 https://arxiv.org/abs/2104.08691 4. P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks — Xiao Liu, Kaixuan Ji, Yicheng Fu, Zhengxiao Du, Zhilin Yang, Jie Tang, 2022 https://arxiv.org/abs/2110.07602 5. LoRA: Low-Rank Adaptation of Large Language Models — Edward J. Hu, Yelong Shen, Phillip Wallis, Weizhu Chen, et al., 2021 https://arxiv.org/abs/2106.09685 6. Adapting Language Models to Compress Contexts — Alexis Chevalier, Alexander Wettig, Anirudh Ajith, Danqi Chen, 2023 https://arxiv.org/abs/2305.14788 7. Learning to Compress Prompts with Gist Tokens — Jesse Mu, Xiang Lisa Li, Noah Goodman, 2023 https://arxiv.org/abs/2304.08467 8. Deliberation in Latent Space via Differentiable Cache Augmentation — Luyang Liu, Jonas Pfeiffer, Jiaxing Wu, Jun Xie, Arthur Szlam, 2024 https://arxiv.org/abs/2412.17747 9. When Do Prompting and Prefix-Tuning Work? A Theory of Capabilities and Limitations — Aleksandar Petrov, Philip H. S. Torr, Adel Bibi, 2023 https://scholar.google.com/scholar?q=When+Do+Prompting+and+Prefix-Tuning+Work?+A+Theory+of+Capabilities+and+Limitations 10. Many-Shot In-Context Learning — Rishabh Agarwal et al., 2024 https://scholar.google.com/scholar?q=Many-Shot+In-Context+Learning 11. Cartridges: Lightweight and general-purpose long context representations via self-study — Sabri Eyuboglu et al., 2025 https://scholar.google.com/scholar?q=Cartridges:+Lightweight+and+general-purpose+long+context+representations+via+self-study 12. Great Memory, Shallow Reasoning: Limits of kNN-LMs — Shangyi Geng, Wenting Zhao, Alexander M. Rush, 2024 https://scholar.google.com/scholar?q=Great+Memory,+Shallow+Reasoning:+Limits+of+kNN-LMs 13. Training Plug-n-Play Knowledge Modules with Deep Context Distillation — Lucas Caccia, Alan Ansell, Edoardo Ponti, Ivan Vulić, Alessandro Sordoni, 2025 https://scholar.google.com/scholar?q=Training+Plug-n-Play+Knowledge+Modules+with+Deep+Context+Distillation 14. More is not always better? Enhancing Many-Shot In-Context Learning with Differentiated and Reweighting Objectives — Xiaoqing Zhang et al., 2025 https://scholar.google.com/scholar?q=More+is+not+always+better?+Enhancing+Many-Shot+In-Context+Learning+with+Differentiated+and+Reweighting+Objectives 15. HyperAttention: Long-context Attention in Near-Linear Time — Insu Han, Rajesh Jayaram, Amin Karbasi, Vahab Mirrokni, David P. Woodruff, Amir Zandieh, 2023 https://scholar.google.com/scholar?q=HyperAttention:+Long-context+Attention+in+Near-Linear+Time 16. Every Attention Matters: An Efficient Hybrid Architecture for Long-Context Reasoning — Ling Team et al., 2025 https://scholar.google.com/scholar?q=Every+Attention+Matters:+An+Efficient+Hybrid+Architecture+for+Long-Context+Reasoning 17. AI Post Transformers: When Many-Shot CoT Becomes Test-Time Learning — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-15-when-many-shot-cot-becomes-test-time-lea-c25bfe.mp3 18. AI Post Transformers: Can Models Learn from Long Context? — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-05-can-models-learn-from-long-context-77533e.mp3 19. AI Post Transformers: How Induction Heads Emerge in Transformers — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-03-how-induction-heads-emerge-in-transforme-a7bfcb.mp3 20. AI Post Transformers: Latent Reasoning with Normalizing Flows — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-06-06-latent-reasoning-with-normalizing-flows-6ee916.mp3 21. AI Post Transformers: Training LLMs for Divide-and-Conquer Reasoning — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-04-training-llms-for-divide-and-conquer-rea-ea6e22.mp3 22. AI Post Transformers: Why Open Relational Foundation Models Fail — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-06-22-why-open-relational-foundation-models-fa-c303c6.mp3 Interactive Visualization: ReasonCACHE: Learning Reasoning Without Weight Updates
hace 4 días

HELM: Holistic Evaluation of Language Models

This episode explores the HELM framework for evaluating language models, arguing that once models become general-purpose infrastructure, single-dataset accuracy benchmarks are too narrow to capture their real-world behavior. It explains how HELM organizes evaluation across 30 models, 16 core scenarios, and seven metric families, measuring not just accuracy but also calibration, robustness, fairness, bias, toxicity, and efficiency under standardized conditions. The discussion highlights why HELM’s scenario-by-metric grid and targeted side studies on issues like reasoning, memorization, copyright, and disinformation matter: they make gaps in measurement visible instead of hiding them behind a single leaderboard score. A listener would find it interesting because it shows how benchmark design reflects values, and why model rankings can be misleading if they ignore confidence, harm, and cost. Sources: 1. HELM: Holistic Evaluation of Language Models https://arxiv.org/pdf/2211.09110 2. Equality of Opportunity in Supervised Learning — Moritz Hardt, Eric Price, Nathan Srebro, 2016 https://arxiv.org/abs/1610.02413 3. Language (Technology) is Power: A Critical Survey of "Bias" in NLP — Su Lin Blodgett, Solon Barocas, Hal Daume III, Hanna Wallach, 2020 https://arxiv.org/abs/2005.14050 4. StereoSet: Measuring stereotypical bias in pretrained language models — Moin Nadeem, Anna Bethke, Siva Reddy, 2020 https://arxiv.org/abs/2004.09456 5. BBQ: A Hand-Built Bias Benchmark for Question Answering — Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jana Thompson, Phu Mon Htut, Samuel R. Bowman, 2021 https://arxiv.org/abs/2110.08193 6. Nuanced Metrics for Measuring Unintended Bias with Real Data for Text Classification — Daniel Borkan, Lucas Dixon, Jeffrey Sorensen, Nithum Thain, Lucy Vasserman, 2019 https://arxiv.org/abs/1903.04561 7. RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models — Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, Noah A. Smith, 2020 https://arxiv.org/abs/2009.11462 8. Challenges in Detoxifying Language Models — Johannes Welbl, Amelia Glaese, Jonathan Uesato, Sumanth Dathathri, John Mellor, Lisa Anne Hendricks, Kirsty Anderson, Pushmeet Kohli, Ben Coppin, Po-Sen Huang, 2021 https://arxiv.org/abs/2109.07445 9. ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection — Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, Ece Kamar, 2022 https://arxiv.org/abs/2203.09509 10. On the Opportunities and Risks of Foundation Models — Rishi Bommasani et al., 2021 https://scholar.google.com/scholar?q=On+the+Opportunities+and+Risks+of+Foundation+Models 11. The EleutherAI Language Model Evaluation Harness — Leo Gao et al., 2021 https://scholar.google.com/scholar?q=The+EleutherAI+Language+Model+Evaluation+Harness 12. Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models — Aarohi Srivastava et al., 2022 https://scholar.google.com/scholar?q=Beyond+the+Imitation+Game:+Quantifying+and+Extrapolating+the+Capabilities+of+Language+Models 13. Dynabench: Rethinking Benchmarking in NLP — Douwe Kiela et al., 2021 https://scholar.google.com/scholar?q=Dynabench:+Rethinking+Benchmarking+in+NLP 14. What Will it Take to Fix Benchmarking in Natural Language Understanding? — Samuel R. Bowman, George Dahl, 2021 https://scholar.google.com/scholar?q=What+Will+it+Take+to+Fix+Benchmarking+in+Natural+Language+Understanding? 15. Rethinking Benchmark and Contamination for Language Models with Rephrased Samples — Shuo Yang et al., 2023 https://scholar.google.com/scholar?q=Rethinking+Benchmark+and+Contamination+for+Language+Models+with+Rephrased+Samples 16. Investigating Data Contamination in Modern Benchmarks for Large Language Models — Chunyuan Deng et al., 2023 https://scholar.google.com/scholar?q=Investigating+Data+Contamination+in+Modern+Benchmarks+for+Large+Language+Models 17. Benchmark Data Contamination of Large Language Models: A Survey — Cheng Xu et al., 2024 https://scholar.google.com/scholar?q=Benchmark+Data+Contamination+of+Large+Language+Models:+A+Survey 18. Inference-Time Scaling for Complex Tasks: Where We Stand and What Lies Ahead — Vidhisha Balachandran et al., 2025 https://scholar.google.com/scholar?q=Inference-Time+Scaling+for+Complex+Tasks:+Where+We+Stand+and+What+Lies+Ahead 19. WTU-EVAL: A Whether-or-Not Tool Usage Evaluation Benchmark for Large Language Models — Kangyun Ning et al., 2024 https://scholar.google.com/scholar?q=WTU-EVAL:+A+Whether-or-Not+Tool+Usage+Evaluation+Benchmark+for+Large+Language+Models 20. T-Eval: Evaluating the Tool Utilization Capability of Large Language Models Step by Step — Zehui Chen et al., 2023 https://scholar.google.com/scholar?q=T-Eval:+Evaluating+the+Tool+Utilization+Capability+of+Large+Language+Models+Step+by+Step 21. AI Post Transformers: IMO-Bench for Robust Mathematical Reasoning — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-imo-bench-for-robust-mathematical-reason-143489.mp3 22. AI Post Transformers: Real Context Size and Context Rot — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-07-real-context-size-and-context-rot-56cbb4.mp3 23. AI Post Transformers: Qwen3Guard: Streaming Three-Way Safety Classification for LLMs — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-16-qwen3guard-streaming-three-way-safety-cl-26b0ef.mp3 24. AI Post Transformers: Speculative Decoding in Real vLLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-speculative-decoding-in-real-vllm-servin-6f4e2b.mp3 Interactive Visualization: HELM: Holistic Evaluation of Language Models
hace 4 días

Learning Facts at Scale with Active Reading

This episode explores Active Reading, a training method that tries to move facts from documents into a model’s weights so it can answer closed-book questions without retrieval. It explains how the approach generates document-specific study materials such as paraphrases, active-recall prompts, timelines, analogies, and associations, and argues that this pedagogical synthetic data works better than simply rereading raw text or producing generic QA pairs. The discussion highlights reported gains from about 16% to 66% on a Wikipedia-based factual recall benchmark and strong relative improvement on finance documents, along with the larger WikiExpert-8B result that reportedly beats bigger models on factual QA after training on a trillion synthetic tokens. It also digs into the paper’s main weaknesses, including missing equal-compute baselines and possible benchmark coupling, which makes the episode interesting for listeners who want both the promise and the limits of using training curricula, rather than new architectures, to improve factual memory. Sources: 1. Learning Facts at Scale with Active Reading — Jessy Lin, Vincent-Pierre Berges, Xilun Chen, Wen-Tau Yih, Gargi Ghosh, Barlas Oğuz, 2025 http://arxiv.org/abs/2508.09494 2. Training Question Answering Models From Synthetic Data — Raul Puri, Ryan Spring, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, 2020 https://scholar.google.com/scholar?q=Training+Question+Answering+Models+From+Synthetic+Data 3. Self-Instruct: Aligning Language Models with Self-Generated Instructions — Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, Hannaneh Hajishirzi, 2022 https://scholar.google.com/scholar?q=Self-Instruct:+Aligning+Language+Models+with+Self-Generated+Instructions 4. Textbooks Are All You Need — Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sébastien Bubeck, Ronen Eldan, Yuanzhi Li, et al., 2023 https://scholar.google.com/scholar?q=Textbooks+Are+All+You+Need 5. Learning Facts at Scale with Active Reading — Jessy Lin, Vincent-Pierre Berges, Xilun Chen, Wen-Tau Yih, Gargi Ghosh, Barlas Oğuz, 2025 https://scholar.google.com/scholar?q=Learning+Facts+at+Scale+with+Active+Reading 6. How Much Knowledge Can You Pack Into the Parameters of a Language Model? — Adam Roberts, Colin Raffel, Noam Shazeer, 2020 https://scholar.google.com/scholar?q=How+Much+Knowledge+Can+You+Pack+Into+the+Parameters+of+a+Language+Model? 7. Large Language Models Struggle to Learn Long-Tail Knowledge — Nikhil Kandpal, Haikang Deng, Adam Roberts, Eric Wallace, Colin Raffel, 2022 https://scholar.google.com/scholar?q=Large+Language+Models+Struggle+to+Learn+Long-Tail+Knowledge 8. Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations? — Zorik Gekhman, Gal Yona, Roee Aharoni, Matan Eyal, Amir Feder, Roi Reichart, Jonathan Herzig, 2024 https://scholar.google.com/scholar?q=Does+Fine-Tuning+LLMs+on+New+Knowledge+Encourage+Hallucinations? 9. Measuring short-form factuality in large language models — Jason Wei, Nguyen Karina, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay, Amelia Glaese, John Schulman, William Fedus, 2024 https://scholar.google.com/scholar?q=Measuring+short-form+factuality+in+large+language+models 10. Synthetic Continued Pretraining — Zitong Yang, Neil Band, Shuangping Li, Emmanuel Candes, Tatsunori Hashimoto, 2024 https://scholar.google.com/scholar?q=Synthetic+Continued+Pretraining 11. Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs — Oded Ovadia, Menachem Brief, Moshik Mishaeli, Oren Elisha, 2023 https://scholar.google.com/scholar?q=Fine-Tuning+or+Retrieval?+Comparing+Knowledge+Injection+in+LLMs 12. How New Data Permeates LLM Knowledge and How to Dilute It — Chen Sun, Renat Aksitov, Andrey Zhmoginov, Nolan Andrew Miller, Max Vladymyrov, Ulrich Rueckert, Been Kim, Mark Sandler, 2025 https://scholar.google.com/scholar?q=How+New+Data+Permeates+LLM+Knowledge+and+How+to+Dilute+It 13. Memory Layers at Scale — Vincent-Pierre Berges, Barlas Oguz, Daniel Haziza, Wen-Tau Yih, Luke Zettlemoyer, Gargi Ghosh, 2024 https://scholar.google.com/scholar?q=Memory+Layers+at+Scale 14. Beyond Model Collapse: Scaling Up with Synthesized Data Requires Verification — Yunzhen Feng, Elvis Dohmatob, Pu Yang, Francois Charton, Julia Kempe, 2024 https://scholar.google.com/scholar?q=Beyond+Model+Collapse:+Scaling+Up+with+Synthesized+Data+Requires+Verification 15. Strong Model Collapse — Elvis Dohmatob, Yunzhen Feng, Arjun Subramonian, Julia Kempe, 2024 https://scholar.google.com/scholar?q=Strong+Model+Collapse 16. Retrieval meets Long Context Large Language Models — Peng Xu et al., 2023 https://scholar.google.com/scholar?q=Retrieval+meets+Long+Context+Large+Language+Models 17. Expect the Unexpected: FailSafe Long Context QA for Finance — Kiran Kamble et al., 2025 https://scholar.google.com/scholar?q=Expect+the+Unexpected:+FailSafe+Long+Context+QA+for+Finance 18. A Parametric Memory Head for Continual Generative Retrieval — Kidist Amde Mekonnen, Yubao Tang, Maarten de Rijke, 2026 https://scholar.google.com/scholar?q=A+Parametric+Memory+Head+for+Continual+Generative+Retrieval 19. AI Post Transformers: Self-Improving Pretraining With Post-Trained Models — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-02-self-improving-pretraining-with-post-tra-e37460.mp3 20. AI Post Transformers: Training Modular KV Caches at Scale — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-06-15-training-modular-kv-caches-at-scale-382577.mp3 21. AI Post Transformers: Experimental Comparison of Agentic and Enhanced RAG — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-14-experimental-comparison-of-agentic-and-e-37d8bc.mp3 Interactive Visualization: Learning Facts at Scale with Active Reading

Ver todo (737)

3.7

de 5

3 calificaciones

Creador

mcgrof
Años de actividad

2025 - 2026
Episodios

737
Clasificación

Apto
Mostrar sitio web

AI Post Transformers

Tecnología

Tecnología

Dos veces a la semana

AI Post Transformers

Information-Aware KV Cache Compression for Long Reasoning

JETSPEC and Parallel Tree Speculative Decoding

DAK: Direct GPU Memory Offloading for LLMs

Prefix-Tuning for Efficient Text Generation

RMSNorm: Simplifying Layer Normalization for Sequence Models

ReasonCACHE: Learning Reasoning Without Weight Updates

HELM: Holistic Evaluation of Language Models

Learning Facts at Scale with Active Reading

Calificaciones y reseñas

Acerca de

Información

También te podría interesar

AI Post Transformers

Episodios

Information-Aware KV Cache Compression for Long Reasoning

JETSPEC and Parallel Tree Speculative Decoding

DAK: Direct GPU Memory Offloading for LLMs

Prefix-Tuning for Efficient Text Generation

RMSNorm: Simplifying Layer Normalization for Sequence Models

ReasonCACHE: Learning Reasoning Without Weight Updates

HELM: Holistic Evaluation of Language Models

Learning Facts at Scale with Active Reading

Calificaciones y reseñas

Acerca de

Información

También te podría interesar