AI Post Transformers

mcgrof

3,7 (3)
Công nghệ
Hằng ngày

AI-generated podcast where hosts Hal Turing and Dr. Ada Shannon discuss the latest research papers and reports in machine learning, AI systems, and optimization. Featuring honest critical analysis, proper citations, and nerdy humor.

19 giờ trước

LLMServingSim 2.0 for Disaggregated LLM Serving

This episode explores LLMServingSim 2.0, a simulator designed to model how large language models behave when they are served on mixed hardware fleets with separated compute, memory, and networking resources rather than a uniform GPU cluster. It explains the practical serving concepts that shape user experience, including prefill versus decode, time to first token, time per output token, prefix caching, KV-cache movement, and why latency problems emerge from interactions among batching, routing, placement, and interconnect contention rather than a single bottleneck. The discussion highlights the paper’s core idea of a Model Serving Group, which combines queueing, scheduling, operation mapping, memory modeling, and power modeling into one runtime-style unit driven by measured hardware profiles instead of purely theoretical kernel estimates. Listeners would find it interesting because it shows how modern AI performance depends not just on better models, but on the messy systems engineering tradeoffs that determine speed, efficiency, and scalability in real deployments. Sources: 1. LLMServingSim 2.0: A Unified Simulator for Heterogeneous and Disaggregated LLM Serving Infrastructure — Jaehong Cho, Hyunmin Choi, Guseul Heo, Jongse Park, 2026 http://arxiv.org/abs/2602.23036 2. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving — Yinmin Zhong et al., 2024 https://scholar.google.com/scholar?q=DistServe:+Disaggregating+Prefill+and+Decoding+for+Goodput-optimized+Large+Language+Model+Serving 3. P/D-Serve: Serving Disaggregated Large Language Model at Scale — Yibo Jin et al., 2024 https://scholar.google.com/scholar?q=P/D-Serve:+Serving+Disaggregated+Large+Language+Model+at+Scale 4. LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference — Yuhan Liu, Yihua Cheng, Jiayi Yao, et al., 2025 https://scholar.google.com/scholar?q=LMCache:+An+Efficient+KV+Cache+Layer+for+Enterprise-Scale+LLM+Inference 5. Mooncake: Trading More Storage for Less Computation - A KVCache-centric Architecture for Serving LLM Chatbot — Ruoyu Qin et al., 2025 https://scholar.google.com/scholar?q=Mooncake:+Trading+More+Storage+for+Less+Computation+-+A+KVCache-centric+Architecture+for+Serving+LLM+Chatbot 6. NeuPIMs: NPU-PIM Heterogeneous Acceleration for Batched LLM Inferencing — Guseul Heo et al., 2024 https://scholar.google.com/scholar?q=NeuPIMs:+NPU-PIM+Heterogeneous+Acceleration+for+Batched+LLM+Inferencing 7. Frontier: Towards Comprehensive and Accurate LLM Inference Simulation — Yicheng Feng et al., 2026 https://scholar.google.com/scholar?q=Frontier:+Towards+Comprehensive+and+Accurate+LLM+Inference+Simulation 8. KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse — Jingbo Yang et al., 2025 https://scholar.google.com/scholar?q=KVLink:+Accelerating+Large+Language+Models+via+Efficient+KV+Cache+Reuse 9. Selective KV-Cache Sharing to Mitigate Timing Side-Channels in LLM Inference — Kexin Chu et al., 2025 https://scholar.google.com/scholar?q=Selective+KV-Cache+Sharing+to+Mitigate+Timing+Side-Channels+in+LLM+Inference 10. Prefill-Decode Aggregation or Disaggregation? Unifying Both for Goodput-Optimized LLM Serving — Chao Wang et al., 2025 https://scholar.google.com/scholar?q=Prefill-Decode+Aggregation+or+Disaggregation?+Unifying+Both+for+Goodput-Optimized+LLM+Serving 11. Enhancing LLM Efficiency: Targeted Pruning for Prefill-Decode Disaggregation in Inference — Hao Zhang et al., 2025 https://scholar.google.com/scholar?q=Enhancing+LLM+Efficiency:+Targeted+Pruning+for+Prefill-Decode+Disaggregation+in+Inference 12. AdaServe: SLO-Customized LLM Serving with Fine-Grained Speculative Decoding — Zikun Li et al., 2025 https://scholar.google.com/scholar?q=AdaServe:+SLO-Customized+LLM+Serving+with+Fine-Grained+Speculative+Decoding 13. SwiftSpec: Ultra-Low Latency LLM Decoding by Scaling Asynchronous Speculative Decoding — Ziyi Zhang et al., 2025 https://scholar.google.com/scholar?q=SwiftSpec:+Ultra-Low+Latency+LLM+Decoding+by+Scaling+Asynchronous+Speculative+Decoding 14. Nightjar: Dynamic Adaptive Speculative Decoding for Large Language Models Serving — Rui Li et al., 2025 https://scholar.google.com/scholar?q=Nightjar:+Dynamic+Adaptive+Speculative+Decoding+for+Large+Language+Models+Serving 15. Hetis: Serving LLMs in Heterogeneous GPU Clusters with Fine-grained and Dynamic Parallelism — Zizhao Mo et al., 2025 https://scholar.google.com/scholar?q=Hetis:+Serving+LLMs+in+Heterogeneous+GPU+Clusters+with+Fine-grained+and+Dynamic+Parallelism 16. AI Post Transformers: Why LLM Serving Needs Mathematical Optimization — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-05-why-llm-serving-needs-mathematical-optim-647fc6.mp3 17. AI Post Transformers: FengHuang for Rack-Scale LLM Inference Memory — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-12-fenghuang-for-rack-scale-llm-inference-m-62708e.mp3 18. AI Post Transformers: Vistara Brings CXL Memory to Hyperscale — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-11-vistara-brings-cxl-memory-to-hyperscale-b5199e.mp3 19. AI Post Transformers: Characterizing LLM KV Cache Workloads in Production — Hal Turing & Dr. Ada Shannon, 2025 https://podcast.do-not-panic.com/episodes/characterizing-llm-kv-cache-workloads-in-production/ 20. AI Post Transformers: Memory-Bound, Not Bandwidth-Limited Batch-1 LLM Decode — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-06-02-memory-bound-not-bandwidth-limited-batch-114799.mp3 21. AI Post Transformers: LPU Chip for Low-Latency LLM Inference — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-20-lpu-chip-for-low-latency-llm-inference-be13c3.mp3 22. AI Post Transformers: TokenDance for Multi-Agent KV Cache Sharing — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-22-tokendance-for-multi-agent-kv-cache-shar-aa9b99.mp3
1 ngày trước

Information-Aware KV Cache Compression for Long Reasoning

This episode explores Information-Aware KV Cache Compression for Long Reasoning, a paper about making long-context inference cheaper and more reliable by deciding which KV-cache tokens to keep during extended reasoning. It explains why long prefilling and long decoding turn the cache into a major memory bottleneck, and why common heuristics such as sliding windows or recent-attention-based retention can discard tokens that only become important much later. The discussion centers on the paper’s claim that future usefulness is better captured by information-theoretic signals like predictive entropy and Forward Influence, with experiments showing that attention-ranked tokens help short-horizon predictions while entropy-ranked tokens matter more over long horizons. Listeners get a concrete account of how InfoKV blends recent attention with per-layer entropy-based scoring to improve the tradeoff between memory savings and long-range reasoning quality. Sources: 1. Information-Aware KV Cache Compression for Long Reasoning — Jushi Kai, Zhuiri Xiao, Alexandra Birch, Zhouhan Lin, 2026 http://arxiv.org/abs/2606.26875 2. Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time — Zichang Liu, Aditya Desai, Fangshuo Liao, Anshumali Shrivastava, et al., 2023 https://scholar.google.com/scholar?q=Scissorhands:+Exploiting+the+Persistence+of+Importance+Hypothesis+for+LLM+KV+Cache+Compression+at+Test+Time 3. H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models — Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Beidi Chen, Christopher Re, et al., 2023 https://scholar.google.com/scholar?q=H2O:+Heavy-Hitter+Oracle+for+Efficient+Generative+Inference+of+Large+Language+Models 4. SnapKV: LLM Knows What You are Looking for Before Generation — Yuhong Li, Yingbing Huang, Bowen Yang, Patrick Lewis, Deming Chen, et al., 2024 https://scholar.google.com/scholar?q=SnapKV:+LLM+Knows+What+You+are+Looking+for+Before+Generation 5. Information-Aware KV Cache Compression for Long Reasoning — Jushi Kai, Zhuiri Xiao, Alexandra Birch, Zhouhan Lin, 2026 https://scholar.google.com/scholar?q=Information-Aware+KV+Cache+Compression+for+Long+Reasoning 6. Expected Attention: KV Cache Compression by Estimating Attention from Future Queries Distribution — Alessio Devoto, Maximilian Jeblick, Simon Jegou, 2025 https://scholar.google.com/scholar?q=Expected+Attention:+KV+Cache+Compression+by+Estimating+Attention+from+Future+Queries+Distribution 7. Reasoning Path Compression: Compressing Generation Trajectories for Efficient LLM Reasoning — Jiwon Song, Dongwon Jo, Yulhwa Kim, Jae-Joon Kim, 2025 https://scholar.google.com/scholar?q=Reasoning+Path+Compression:+Compressing+Generation+Trajectories+for+Efficient+LLM+Reasoning 8. Compressing Context to Enhance Inference Efficiency of Large Language Models — Yucheng Li, Bo Dong, Chenghua Lin, Frank Guerin, 2023 https://scholar.google.com/scholar?q=Compressing+Context+to+Enhance+Inference+Efficiency+of+Large+Language+Models 9. FreqKV: Key-Value Compression in Frequency Domain for Context Window Extension — Jushi Kai et al., 2026 https://scholar.google.com/scholar?q=FreqKV:+Key-Value+Compression+in+Frequency+Domain+for+Context+Window+Extension 10. LongReason: A Synthetic Long-Context Reasoning Benchmark via Context Expansion — Zhan Ling et al., 2025 https://scholar.google.com/scholar?q=LongReason:+A+Synthetic+Long-Context+Reasoning+Benchmark+via+Context+Expansion 11. Attention Reveals More Than Tokens: Training-Free Long-Context Reasoning with Attention-guided Retrieval — Yuwei Zhang et al., 2025 https://scholar.google.com/scholar?q=Attention+Reveals+More+Than+Tokens:+Training-Free+Long-Context+Reasoning+with+Attention-guided+Retrieval 12. Uncertainty Quantification for Hallucination Detection in Large Language Models: Foundations, Methodology, and Future Directions — Sungmin Kang et al., 2025 https://scholar.google.com/scholar?q=Uncertainty+Quantification+for+Hallucination+Detection+in+Large+Language+Models:+Foundations,+Methodology,+and+Future+Directions 13. Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinations — Christian Tomani et al., 2024 https://scholar.google.com/scholar?q=Uncertainty-Based+Abstention+in+LLMs+Improves+Safety+and+Reduces+Hallucinations 14. KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction — Jang-Hyun Kim et al., 2025 https://scholar.google.com/scholar?q=KVzip:+Query-Agnostic+KV+Cache+Compression+with+Context+Reconstruction 15. Can LLMs Maintain Fundamental Abilities under KV Cache Compression? — Xiang Liu et al., 2025 https://scholar.google.com/scholar?q=Can+LLMs+Maintain+Fundamental+Abilities+under+KV+Cache+Compression? 16. KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches — Jiayi Yuan et al., 2024 https://scholar.google.com/scholar?q=KV+Cache+Compression,+But+What+Must+We+Give+in+Return?+A+Comprehensive+Benchmark+of+Long+Context+Capable+Approaches 17. KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse — Jingbo Yang et al., 2025 https://scholar.google.com/scholar?q=KVLink:+Accelerating+Large+Language+Models+via+Efficient+KV+Cache+Reuse 18. AI Post Transformers: IndexMem: Learned KV-Cache Eviction for Long-Context LLMs — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-06-12-indexmem-learned-kv-cache-eviction-for-l-132c2a.mp3 19. AI Post Transformers: When Quantization Hurts Reasoning Models — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-06-17-when-quantization-hurts-reasoning-models-eca9e7.mp3 20. AI Post Transformers: Hyper-Scaling LLM Inference with KV Cache Compression — Hal Turing & Dr. Ada Shannon, 2025 https://podcast.do-not-panic.com/episodes/hyper-scaling-llm-inference-with-kv-cache-compression/ 21. AI Post Transformers: Lattice: Fixed-Slot Compression for Transformer Memory — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-06-11-lattice-fixed-slot-compression-for-trans-5509ea.mp3 22. AI Post Transformers: Adaptive Compression Techniques for Efficient LLM Inference — Hal Turing & Dr. Ada Shannon, 2025 https://podcast.do-not-panic.com/episodes/adaptive-compression-techniques-for-efficient-llm-inference/ 23. AI Post Transformers: Explicit Information Transmission for Context Compression — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-05-explicit-information-transmission-for-co-24e3c2.mp3 24. AI Post Transformers: When LoRA Helps Under KV Cache Compression — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-06-12-when-lora-helps-under-kv-cache-compressi-76dda6.mp3
1 ngày trước

JETSPEC and Parallel Tree Speculative Decoding

This episode explores JETSPEC, a 2026 inference paper on speculative decoding that asks whether a language model can draft an entire tree of future tokens in parallel while preserving causal consistency and actually reducing latency on long generations. It explains why autoregressive decoding remains a serving bottleneck for long proofs, code completions, and assistant replies, even when the underlying transformer model itself is unchanged. The discussion compares JetSpec’s approach with Medusa, EAGLE-3, and DFlash, focusing on the central tradeoff between stronger path-conditioned drafts that are slow to produce and cheaper parallel drafts that risk internally inconsistent branches. Listeners would find it interesting because it turns a very practical systems problem, why powerful GPUs still feel slow at inference time, into a concrete debate about the next generation of real-world decoding optimizations. Sources: 1. JetSpec: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting — Lanxiang Hu, Zhaoxiang Feng, Yulun Wu, Haoran Yuan, Yujie Zhao, Yu-Yang Qian, Bojun Wang, Peng Zhao, Daxin Jiang, Yibo Zhu, Tajana Rosing, Hao Zhang, 2026 http://arxiv.org/abs/2606.18394 2. Fast Inference from Transformers via Speculative Decoding — Yaniv Leviathan, Matan Kalman, Yossi Matias, 2023 https://scholar.google.com/scholar?q=Fast+Inference+from+Transformers+via+Speculative+Decoding 3. Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads — Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, Tri Dao, 2024 https://scholar.google.com/scholar?q=Medusa:+Simple+LLM+Inference+Acceleration+Framework+with+Multiple+Decoding+Heads 4. EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees — Yuhui Li, Fangyun Wei, Chao Zhang, Hongyang Zhang, 2024 https://scholar.google.com/scholar?q=EAGLE-2:+Faster+Inference+of+Language+Models+with+Dynamic+Draft+Trees 5. DFlash: Block Diffusion for Flash Speculative Decoding — Jian Chen, Yesheng Liang, Zhijian Liu, 2026 https://scholar.google.com/scholar?q=DFlash:+Block+Diffusion+for+Flash+Speculative+Decoding 6. EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test — Yuhui Li, Fangyun Wei, Chao Zhang, Hongyang Zhang, 2025 https://scholar.google.com/scholar?q=EAGLE-3:+Scaling+up+Inference+Acceleration+of+Large+Language+Models+via+Training-Time+Test 7. SpecInfer: Accelerating Generative Large Language Model Serving with Tree-based Speculative Inference and Verification — Xupeng Miao et al., 2023 https://scholar.google.com/scholar?q=SpecInfer:+Accelerating+Generative+Large+Language+Model+Serving+with+Tree-based+Speculative+Inference+and+Verification 8. DFlare: Scaling Up Draft Capacity for Block Diffusion Speculative Decoding — Jiebin Zhang et al., 2026 https://scholar.google.com/scholar?q=DFlare:+Scaling+Up+Draft+Capacity+for+Block+Diffusion+Speculative+Decoding 9. TriSpec: Ternary Speculative Decoding via Lightweight Proxy Verification — Haoyun Jiang et al., 2026 https://scholar.google.com/scholar?q=TriSpec:+Ternary+Speculative+Decoding+via+Lightweight+Proxy+Verification 10. ParallelSpec: Parallel Drafter for Efficient Speculative Decoding — Zilin Xiao et al., 2024 https://arxiv.org/abs/2410.05589 11. Mamba Drafters for Speculative Decoding — Daewon Choi et al., 2025 https://arxiv.org/abs/2506.01206 12. OmniDraft: A Cross-vocabulary, Online Adaptive Drafter for On-device Speculative Decoding — Ramchalam Kinattinkara Ramakrishnan et al., 2025 https://arxiv.org/abs/2507.02659 13. Clover: Regressive Lightweight Speculative Decoding with Sequential Knowledge — Bin Xiao et al., 2024 https://arxiv.org/abs/2405.00263 14. Make Every Draft Count: Hidden State based Speculative Decoding — Yuetao Chen et al., 2026 https://arxiv.org/abs/2602.21224 15. When Hidden States Drift: Can KV Caches Rescue Long-Range Speculative Decoding? — Tianyu Liu et al., 2026 https://arxiv.org/abs/2604.26412 16. MoE-Spec: Expert Budgeting for Efficient Speculative Decoding — Bradley McDanel et al., 2026 https://arxiv.org/abs/2602.16052 17. Beat the long tail: Distribution-Aware Speculative Decoding for RL Training — Zelei Shao et al., 2025 https://arxiv.org/abs/2511.13841 18. AI Post Transformers: Speculative Decoding in Real vLLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-speculative-decoding-in-real-vllm-servin-6f4e2b.mp3 19. AI Post Transformers: Deep Kernel Fusion for Transformer Decoding — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-15-deep-kernel-fusion-for-transformer-decod-b1a703.mp3 20. AI Post Transformers: Serving MoE Models with Disaggregated Expert Parallelism — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-19-serving-moe-models-with-disaggregated-ex-6979d2.mp3 Interactive Visualization: JETSPEC and Parallel Tree Speculative Decoding
2 ngày trước

DAK: Direct GPU Memory Offloading for LLMs

This episode explores DAK, a Cornell systems paper arguing that LLM inference on tiered-memory machines can be faster when offloaded weights and KV-cache blocks are fetched directly into on-chip shared memory instead of being prefetched and staged through GPU HBM. It breaks down the tradeoffs among HBM capacity, HBM bandwidth, KV-cache growth during decoding, and prior approaches such as FlexGen, vLLM’s PagedAttention, and emerging KV offload systems like LMCache. The discussion focuses on DAK’s core technical idea: using Hopper’s Tensor Memory Accelerator inside custom GEMM and FlashAttention kernels so data movement and computation are co-designed, reducing bounce buffers, HBM contention, and pipeline bubbles while aggregating bandwidth from multiple memory tiers. Listeners would find it interesting because it turns a low-level memory-path decision into a concrete argument about when offloading is merely a fallback and when it becomes a real performance advantage for serving larger models, longer contexts, or bigger batches. Sources: 1. DAK: Direct-Access-Enabled GPU Memory Offloading with Optimal Efficiency for LLM Inference — Shouxu Lin, Zhiyuan Guo, Jiaxin Lin, 2026 http://arxiv.org/abs/2604.26074 2. ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning — Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, Yuxiong He, 2021 https://scholar.google.com/scholar?q=ZeRO-Infinity:+Breaking+the+GPU+Memory+Wall+for+Extreme+Scale+Deep+Learning 3. FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU — Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Ion Stoica, Percy Liang, Ce Zhang, and colleagues, 2023 https://scholar.google.com/scholar?q=FlexGen:+High-Throughput+Generative+Inference+of+Large+Language+Models+with+a+Single+GPU 4. Efficient Memory Management for Large Language Model Serving with PagedAttention — Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica, 2023 https://scholar.google.com/scholar?q=Efficient+Memory+Management+for+Large+Language+Model+Serving+with+PagedAttention 5. DAK: Direct-Access-Enabled GPU Memory Offloading with Optimal Efficiency for LLM Inference — Shouxu Lin, Zhiyuan Guo, Jiaxin Lin, 2026 https://scholar.google.com/scholar?q=DAK:+Direct-Access-Enabled+GPU+Memory+Offloading+with+Optimal+Efficiency+for+LLM+Inference 6. PIE: Pooling CPU Memory for LLM Inference — Yi Xu, Ziming Mao, Xiangxi Mo, Shu Liu, Ion Stoica, 2024 https://scholar.google.com/scholar?q=PIE:+Pooling+CPU+Memory+for+LLM+Inference 7. NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference — Xuanlin Jiang, Yang Zhou, Shiyi Cao, Ion Stoica, Minlan Yu, 2025 https://scholar.google.com/scholar?q=NEO:+Saving+GPU+Memory+Crisis+with+CPU+Offloading+for+Online+LLM+Inference 8. Understanding Data Movement in Tightly Coupled Heterogeneous Systems: A Case Study with the Grace Hopper Superchip — Luigi Fusco, Mikhail Khalilov, Marcin Chrapek, Giridhar Chukkapalli, Thomas Schulthess, Torsten Hoefler, 2024 https://scholar.google.com/scholar?q=Understanding+Data+Movement+in+Tightly+Coupled+Heterogeneous+Systems:+A+Case+Study+with+the+Grace+Hopper+Superchip 9. FengHuang: Next-Generation Memory Orchestration for AI Inferencing — Jiamin Li, Lei Qu, Tao Zhang, Grigory Chirkov, Shuotao Xu, Peng Cheng, Lidong Zhou, 2025 https://scholar.google.com/scholar?q=FengHuang:+Next-Generation+Memory+Orchestration+for+AI+Inferencing 10. Reducing Transformer Key-Value Cache Size with Cross-Layer Attention — William Brandon et al., 2024 https://scholar.google.com/scholar?q=Reducing+Transformer+Key-Value+Cache+Size+with+Cross-Layer+Attention 11. xKV: Cross-Layer SVD for KV-Cache Compression — Chi-Chih Chang et al., 2025 https://scholar.google.com/scholar?q=xKV:+Cross-Layer+SVD+for+KV-Cache+Compression 12. XQuant: Achieving Ultra-Low Bit KV Cache Quantization with Cross-Layer Compression — Haoqi Yang et al., 2025 https://scholar.google.com/scholar?q=XQuant:+Achieving+Ultra-Low+Bit+KV+Cache+Quantization+with+Cross-Layer+Compression 13. InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management — Wonbeom Lee et al., 2024 https://scholar.google.com/scholar?q=InfiniGen:+Efficient+Generative+Inference+of+Large+Language+Models+with+Dynamic+KV+Cache+Management 14. Accelerating LLM Inference Throughput via Asynchronous KV Cache Prefetching — Yanhao Dong et al., 2025 https://scholar.google.com/scholar?q=Accelerating+LLM+Inference+Throughput+via+Asynchronous+KV+Cache+Prefetching 15. KVShare: Semantic-Aware Key-Value Cache Sharing for Efficient Large Language Model Inference — Huan Yang et al., 2025 https://scholar.google.com/scholar?q=KVShare:+Semantic-Aware+Key-Value+Cache+Sharing+for+Efficient+Large+Language+Model+Inference 16. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration — Ji Lin et al., 2023 https://scholar.google.com/scholar?q=AWQ:+Activation-aware+Weight+Quantization+for+LLM+Compression+and+Acceleration 17. SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression — Tim Dettmers et al., 2023 https://scholar.google.com/scholar?q=SpQR:+A+Sparse-Quantized+Representation+for+Near-Lossless+LLM+Weight+Compression 18. AI Post Transformers: Beluga: CXL Memory Pooling for LLM KV Cache — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-27-beluga-cxl-memory-pooling-for-llm-kv-cac-b6142f.mp3 19. AI Post Transformers: ScoutAttention for Efficient KV Cache Offloading — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-24-scoutattention-for-efficient-kv-cache-of-b26699.mp3 20. AI Post Transformers: InfiniGen for Efficient Long-Context LLM Inference — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-06-18-infinigen-for-efficient-long-context-llm-143d77.mp3 21. AI Post Transformers: Mooncake for KV Cache-Centric LLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-06-05-mooncake-for-kv-cache-centric-llm-servin-1086d0.mp3 22. AI Post Transformers: Memory-Bound, Not Bandwidth-Limited Batch-1 LLM Decode — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-06-02-memory-bound-not-bandwidth-limited-batch-114799.mp3 23. AI Post Transformers: LAPS for Length-Aware LLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-05-laps-for-length-aware-llm-serving-0c6149.mp3
3 ngày trước

Prefix-Tuning for Efficient Text Generation

This episode explores the 2021 prefix-tuning paper and asks whether a large language model can be adapted to new generation tasks by learning a small continuous prompt while keeping the full model frozen. It explains where prefix tuning fits within parameter-efficient fine-tuning, contrasting it with full fine-tuning, adapters, ordinary prompting, in-context learning, AutoPrompt, and soft prompt tuning. The discussion highlights the paper’s two main evaluation settings, structured data-to-text generation on E2E, WebNLG, and DART with GPT-2, and abstractive summarization on XSUM with BART, while stressing that these are meaningfully different tests despite being grouped under one headline. It also digs into the core technical idea that the learned prefix acts as trainable internal state visible to attention throughout the network, making the method an early and elegant approach to low-storage task adaptation even if later methods like LoRA proved more practical. Sources: 1. Prefix-Tuning: Optimizing Continuous Prompts for Generation — Xiang Lisa Li, Percy Liang, 2021 http://arxiv.org/abs/2101.00190 2. Prefix-Tuning: Optimizing Continuous Prompts for Generation — Xiang Lisa Li; Percy Liang, 2021 https://scholar.google.com/scholar?q=Prefix-Tuning:+Optimizing+Continuous+Prompts+for+Generation 3. The Power of Scale for Parameter-Efficient Prompt Tuning — Brian Lester; Rami Al-Rfou; Noah Constant, 2021 https://scholar.google.com/scholar?q=The+Power+of+Scale+for+Parameter-Efficient+Prompt+Tuning 4. When Do Prompting and Prefix-Tuning Work? A Theory of Capabilities and Limitations — Aleksandar Petrov; Philip H. S. Torr; Adel Bibi, 2023 https://scholar.google.com/scholar?q=When+Do+Prompting+and+Prefix-Tuning+Work?+A+Theory+of+Capabilities+and+Limitations 5. LoRA: Low-Rank Adaptation of Large Language Models — Edward J. Hu; Yelong Shen; Phillip Wallis; Zeyuan Allen-Zhu; Yuanzhi Li; Shean Wang; Lu Wang; Weizhu Chen, 2021 https://scholar.google.com/scholar?q=LoRA:+Low-Rank+Adaptation+of+Large+Language+Models 6. Parameter-efficient Transfer Learning for NLP — Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin de Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly, 2019 https://scholar.google.com/scholar?q=Parameter-efficient+Transfer+Learning+for+NLP 7. Exploring Versatile Generative Language Model via Parameter-Efficient Transfer Learning — Zhaojiang Lin, Andrea Madotto, and Pascale Fung, 2020 https://scholar.google.com/scholar?q=Exploring+Versatile+Generative+Language+Model+via+Parameter-Efficient+Transfer+Learning 8. AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts — Taylor Shin, Yasaman Razeghi, Robert L. Logan IV, Eric Wallace, and Sameer Singh, 2020 https://scholar.google.com/scholar?q=AutoPrompt:+Eliciting+Knowledge+from+Language+Models+with+Automatically+Generated+Prompts 9. Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning — Armen Aghajanyan, Luke Zettlemoyer, and Sonal Gupta, 2020 https://scholar.google.com/scholar?q=Intrinsic+Dimensionality+Explains+the+Effectiveness+of+Language+Model+Fine-Tuning 10. Can Unconditional Language Models Recover Arbitrary Sentences? — Nishant Subramani, Samuel R. Bowman, and Kyunghyun Cho, 2020 https://scholar.google.com/scholar?q=Can+Unconditional+Language+Models+Recover+Arbitrary+Sentences? 11. Universality and Limitations of Prompt Tuning — Yihan Wang et al., 2023 https://arxiv.org/abs/2305.18787 12. Fundamental Limits of Prompt Tuning Transformers: Universality, Capacity and Efficiency — Jerry Yao-Chieh Hu et al., 2024 https://arxiv.org/abs/2411.16525 13. Memory Limitations of Prompt Tuning in Transformers — Maxime Meyer et al., 2025 https://arxiv.org/abs/2509.00421 14. Parameter-Efficient Fine-Tuning for Medical Text Summarization: A Comparative Study of Lora, Prompt Tuning, and Full Fine-Tuning — Ulugbek Shernazarov et al., 2026 https://arxiv.org/abs/2603.21970 15. Task Singular Vectors: Reducing Task Interference in Model Merging — Antonio Andrea Gargiulo et al., 2024 https://arxiv.org/abs/2412.00081 16. Task Vector Quantization for Memory-Efficient Model Merging — Youngeun Kim et al., 2025 https://arxiv.org/abs/2503.06921 17. Last One Standing: A Comparative Analysis of Security and Privacy of Soft Prompt Tuning, LoRA, and In-Context Learning — Rui Wen et al., 2023 https://arxiv.org/abs/2310.11397 18. Progressive Prompts: Continual Learning for Language Models — Anastasia Razdaibiedina et al., 2023 https://arxiv.org/abs/2301.12314 19. AI Post Transformers: Benchmarking PEFT Techniques for Large Language Models — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-06-20-benchmarking-peft-techniques-for-large-l-41bbf5.mp3 20. AI Post Transformers: Learning to Reason with 13 Parameters — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-14-learning-to-reason-with-13-parameters-54c87f.mp3
3 ngày trước

RMSNorm: Simplifying Layer Normalization for Sequence Models

This episode explores the 2019 RMSNorm paper, which asks whether LayerNorm’s mean-subtraction step is actually necessary or whether controlling activation scale is the part that really stabilizes training. It explains how RMSNorm keeps LayerNorm’s rescaling behavior while dropping explicit centering, and how the paper’s pRMSNorm variant estimates the normalization term from only a small subset of features to reduce cost further. The discussion covers experiments in machine translation, image classification, image-caption retrieval, and question answering, where model quality stayed roughly comparable while reported runtime improved, with smaller gains in transformers and much larger ones in older RNN-based systems. Listeners would find it interesting because it turns a seemingly minor mathematical tweak into a broader argument about efficiency, optimization stability, and how much claimed speedups depend on the era and quality of the baseline implementation. Sources: 1. Root Mean Square Layer Normalization — Biao Zhang, Rico Sennrich, 2019 http://arxiv.org/abs/1910.07467 2. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift — Sergey Ioffe, Christian Szegedy, 2015 https://scholar.google.com/scholar?q=Batch+Normalization:+Accelerating+Deep+Network+Training+by+Reducing+Internal+Covariate+Shift 3. Layer Normalization — Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E. Hinton, 2016 https://scholar.google.com/scholar?q=Layer+Normalization 4. Root Mean Square Layer Normalization — Biao Zhang, Rico Sennrich, 2019 https://scholar.google.com/scholar?q=Root+Mean+Square+Layer+Normalization 5. On Layer Normalization in the Transformer Architecture — Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, Tie-Yan Liu, 2020 https://scholar.google.com/scholar?q=On+Layer+Normalization+in+the+Transformer+Architecture 6. Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks — Tim Salimans, Diederik P. Kingma, 2016 https://scholar.google.com/scholar?q=Weight+Normalization:+A+Simple+Reparameterization+to+Accelerate+Training+of+Deep+Neural+Networks 7. How Does Batch Normalization Help Optimization? — Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, Aleksander Madry, 2018 https://scholar.google.com/scholar?q=How+Does+Batch+Normalization+Help+Optimization? 8. Understanding Batch Normalization — Nils Bjorck, Carla P. Gomes, Bart Selman, Kilian Q. Weinberger, 2018 https://scholar.google.com/scholar?q=Understanding+Batch+Normalization 9. Norm Matters: Efficient and Accurate Normalization Schemes in Deep Networks — Elad Hoffer, Ron Banner, Itay Golan, Daniel Soudry, 2018 https://scholar.google.com/scholar?q=Norm+Matters:+Efficient+and+Accurate+Normalization+Schemes+in+Deep+Networks 10. Group Normalization — Yuxin Wu, Kaiming He, 2018 https://scholar.google.com/scholar?q=Group+Normalization 11. Residual Learning Without Normalization via Better Initialization — Hongyi Zhang, Yann N. Dauphin, Tengyu Ma, 2019 https://scholar.google.com/scholar?q=Residual+Learning+Without+Normalization+via+Better+Initialization 12. Tuning LayerNorm in Attention: Towards Efficient Multi-Modal LLM Finetuning — Bingchen Zhao et al., 2023 https://scholar.google.com/scholar?q=Tuning+LayerNorm+in+Attention:+Towards+Efficient+Multi-Modal+LLM+Finetuning 13. LayerNorm: A key component in parameter-efficient fine-tuning — Taha ValizadehAslani and Hualou Liang, 2024 https://scholar.google.com/scholar?q=LayerNorm:+A+key+component+in+parameter-efficient+fine-tuning 14. Efficiency in Focus: LayerNorm as a Catalyst for Fine-tuning Medical Visual Language Pre-trained Models — Jiawei Chen et al., 2024 https://scholar.google.com/scholar?q=Efficiency+in+Focus:+LayerNorm+as+a+Catalyst+for+Fine-tuning+Medical+Visual+Language+Pre-trained+Models 15. The Curse of Depth in Large Language Models — Wenfang Sun et al., 2025 https://scholar.google.com/scholar?q=The+Curse+of+Depth+in+Large+Language+Models 16. Just One Layer Norm Guarantees Stable Extrapolation — Juliusz Ziomek, George Whittle, Michael A. Osborne, 2025 https://scholar.google.com/scholar?q=Just+One+Layer+Norm+Guarantees+Stable+Extrapolation 17. Normalization Layer Per-Example Gradients are Sufficient to Predict Gradient Noise Scale in Transformers — Gavia Gray et al., 2024 https://scholar.google.com/scholar?q=Normalization+Layer+Per-Example+Gradients+are+Sufficient+to+Predict+Gradient+Noise+Scale+in+Transformers 18. AI Post Transformers: Keel: Post-LayerNorm Is Back: Stable, ExpressivE, and Deep — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/keel-post-layernorm-is-back-stable-expressive-and-deep/ 19. AI Post Transformers: Deep Kernel Fusion for Transformer Decoding — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-15-deep-kernel-fusion-for-transformer-decod-b1a703.mp3 20. AI Post Transformers: Long Short-Term Memory and Vanishing Gradients — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-19-long-short-term-memory-and-vanishing-gra-72448c.mp3
3 ngày trước

ReasonCACHE: Learning Reasoning Without Weight Updates

This episode explores ReasonCACHE, a method for improving multi-step reasoning in large language models by keeping the backbone frozen and training a compact per-layer key-value memory instead of updating billions of weights. It situates the paper against in-context learning, many-shot prompting, prefix tuning, LoRA, and context-distillation work, explaining how learned latent memory sits between raw prompting and full fine-tuning. The discussion centers on the paper’s real claim and its main point of skepticism: whether these learned caches actually teach a reusable reasoning procedure or mostly compress and elicit abilities the model already had. Listeners would find it interesting because it connects a concrete new method to a larger debate about how LLMs acquire reasoning skills, while also highlighting the practical payoff of avoiding huge prompts, quadratic attention costs, and brittle long-context setups. Sources: 1. ReasonCACHE: Teaching LLMs To Reason Without Weight Updates — Sharut Gupta, Phillip Isola, Stefanie Jegelka, David Lopez-Paz, Kartik Ahuja, Mark Ibrahim, Mohammad Pezeshki, 2026 http://arxiv.org/abs/2602.02366 2. Prefix-Tuning: Optimizing Continuous Prompts for Generation — Xiang Lisa Li, Percy Liang, 2021 https://arxiv.org/abs/2101.00190 3. The Power of Scale for Parameter-Efficient Prompt Tuning — Brian Lester, Rami Al-Rfou, Noah Constant, 2021 https://arxiv.org/abs/2104.08691 4. P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks — Xiao Liu, Kaixuan Ji, Yicheng Fu, Zhengxiao Du, Zhilin Yang, Jie Tang, 2022 https://arxiv.org/abs/2110.07602 5. LoRA: Low-Rank Adaptation of Large Language Models — Edward J. Hu, Yelong Shen, Phillip Wallis, Weizhu Chen, et al., 2021 https://arxiv.org/abs/2106.09685 6. Adapting Language Models to Compress Contexts — Alexis Chevalier, Alexander Wettig, Anirudh Ajith, Danqi Chen, 2023 https://arxiv.org/abs/2305.14788 7. Learning to Compress Prompts with Gist Tokens — Jesse Mu, Xiang Lisa Li, Noah Goodman, 2023 https://arxiv.org/abs/2304.08467 8. Deliberation in Latent Space via Differentiable Cache Augmentation — Luyang Liu, Jonas Pfeiffer, Jiaxing Wu, Jun Xie, Arthur Szlam, 2024 https://arxiv.org/abs/2412.17747 9. When Do Prompting and Prefix-Tuning Work? A Theory of Capabilities and Limitations — Aleksandar Petrov, Philip H. S. Torr, Adel Bibi, 2023 https://scholar.google.com/scholar?q=When+Do+Prompting+and+Prefix-Tuning+Work?+A+Theory+of+Capabilities+and+Limitations 10. Many-Shot In-Context Learning — Rishabh Agarwal et al., 2024 https://scholar.google.com/scholar?q=Many-Shot+In-Context+Learning 11. Cartridges: Lightweight and general-purpose long context representations via self-study — Sabri Eyuboglu et al., 2025 https://scholar.google.com/scholar?q=Cartridges:+Lightweight+and+general-purpose+long+context+representations+via+self-study 12. Great Memory, Shallow Reasoning: Limits of kNN-LMs — Shangyi Geng, Wenting Zhao, Alexander M. Rush, 2024 https://scholar.google.com/scholar?q=Great+Memory,+Shallow+Reasoning:+Limits+of+kNN-LMs 13. Training Plug-n-Play Knowledge Modules with Deep Context Distillation — Lucas Caccia, Alan Ansell, Edoardo Ponti, Ivan Vulić, Alessandro Sordoni, 2025 https://scholar.google.com/scholar?q=Training+Plug-n-Play+Knowledge+Modules+with+Deep+Context+Distillation 14. More is not always better? Enhancing Many-Shot In-Context Learning with Differentiated and Reweighting Objectives — Xiaoqing Zhang et al., 2025 https://scholar.google.com/scholar?q=More+is+not+always+better?+Enhancing+Many-Shot+In-Context+Learning+with+Differentiated+and+Reweighting+Objectives 15. HyperAttention: Long-context Attention in Near-Linear Time — Insu Han, Rajesh Jayaram, Amin Karbasi, Vahab Mirrokni, David P. Woodruff, Amir Zandieh, 2023 https://scholar.google.com/scholar?q=HyperAttention:+Long-context+Attention+in+Near-Linear+Time 16. Every Attention Matters: An Efficient Hybrid Architecture for Long-Context Reasoning — Ling Team et al., 2025 https://scholar.google.com/scholar?q=Every+Attention+Matters:+An+Efficient+Hybrid+Architecture+for+Long-Context+Reasoning 17. AI Post Transformers: When Many-Shot CoT Becomes Test-Time Learning — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-15-when-many-shot-cot-becomes-test-time-lea-c25bfe.mp3 18. AI Post Transformers: Can Models Learn from Long Context? — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-05-can-models-learn-from-long-context-77533e.mp3 19. AI Post Transformers: How Induction Heads Emerge in Transformers — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-03-how-induction-heads-emerge-in-transforme-a7bfcb.mp3 20. AI Post Transformers: Latent Reasoning with Normalizing Flows — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-06-06-latent-reasoning-with-normalizing-flows-6ee916.mp3 21. AI Post Transformers: Training LLMs for Divide-and-Conquer Reasoning — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-04-training-llms-for-divide-and-conquer-rea-ea6e22.mp3 22. AI Post Transformers: Why Open Relational Foundation Models Fail — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-06-22-why-open-relational-foundation-models-fa-c303c6.mp3 Interactive Visualization: ReasonCACHE: Learning Reasoning Without Weight Updates
4 ngày trước

HELM: Holistic Evaluation of Language Models

This episode explores the HELM framework for evaluating language models, arguing that once models become general-purpose infrastructure, single-dataset accuracy benchmarks are too narrow to capture their real-world behavior. It explains how HELM organizes evaluation across 30 models, 16 core scenarios, and seven metric families, measuring not just accuracy but also calibration, robustness, fairness, bias, toxicity, and efficiency under standardized conditions. The discussion highlights why HELM’s scenario-by-metric grid and targeted side studies on issues like reasoning, memorization, copyright, and disinformation matter: they make gaps in measurement visible instead of hiding them behind a single leaderboard score. A listener would find it interesting because it shows how benchmark design reflects values, and why model rankings can be misleading if they ignore confidence, harm, and cost. Sources: 1. HELM: Holistic Evaluation of Language Models https://arxiv.org/pdf/2211.09110 2. Equality of Opportunity in Supervised Learning — Moritz Hardt, Eric Price, Nathan Srebro, 2016 https://arxiv.org/abs/1610.02413 3. Language (Technology) is Power: A Critical Survey of "Bias" in NLP — Su Lin Blodgett, Solon Barocas, Hal Daume III, Hanna Wallach, 2020 https://arxiv.org/abs/2005.14050 4. StereoSet: Measuring stereotypical bias in pretrained language models — Moin Nadeem, Anna Bethke, Siva Reddy, 2020 https://arxiv.org/abs/2004.09456 5. BBQ: A Hand-Built Bias Benchmark for Question Answering — Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jana Thompson, Phu Mon Htut, Samuel R. Bowman, 2021 https://arxiv.org/abs/2110.08193 6. Nuanced Metrics for Measuring Unintended Bias with Real Data for Text Classification — Daniel Borkan, Lucas Dixon, Jeffrey Sorensen, Nithum Thain, Lucy Vasserman, 2019 https://arxiv.org/abs/1903.04561 7. RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models — Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, Noah A. Smith, 2020 https://arxiv.org/abs/2009.11462 8. Challenges in Detoxifying Language Models — Johannes Welbl, Amelia Glaese, Jonathan Uesato, Sumanth Dathathri, John Mellor, Lisa Anne Hendricks, Kirsty Anderson, Pushmeet Kohli, Ben Coppin, Po-Sen Huang, 2021 https://arxiv.org/abs/2109.07445 9. ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection — Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, Ece Kamar, 2022 https://arxiv.org/abs/2203.09509 10. On the Opportunities and Risks of Foundation Models — Rishi Bommasani et al., 2021 https://scholar.google.com/scholar?q=On+the+Opportunities+and+Risks+of+Foundation+Models 11. The EleutherAI Language Model Evaluation Harness — Leo Gao et al., 2021 https://scholar.google.com/scholar?q=The+EleutherAI+Language+Model+Evaluation+Harness 12. Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models — Aarohi Srivastava et al., 2022 https://scholar.google.com/scholar?q=Beyond+the+Imitation+Game:+Quantifying+and+Extrapolating+the+Capabilities+of+Language+Models 13. Dynabench: Rethinking Benchmarking in NLP — Douwe Kiela et al., 2021 https://scholar.google.com/scholar?q=Dynabench:+Rethinking+Benchmarking+in+NLP 14. What Will it Take to Fix Benchmarking in Natural Language Understanding? — Samuel R. Bowman, George Dahl, 2021 https://scholar.google.com/scholar?q=What+Will+it+Take+to+Fix+Benchmarking+in+Natural+Language+Understanding? 15. Rethinking Benchmark and Contamination for Language Models with Rephrased Samples — Shuo Yang et al., 2023 https://scholar.google.com/scholar?q=Rethinking+Benchmark+and+Contamination+for+Language+Models+with+Rephrased+Samples 16. Investigating Data Contamination in Modern Benchmarks for Large Language Models — Chunyuan Deng et al., 2023 https://scholar.google.com/scholar?q=Investigating+Data+Contamination+in+Modern+Benchmarks+for+Large+Language+Models 17. Benchmark Data Contamination of Large Language Models: A Survey — Cheng Xu et al., 2024 https://scholar.google.com/scholar?q=Benchmark+Data+Contamination+of+Large+Language+Models:+A+Survey 18. Inference-Time Scaling for Complex Tasks: Where We Stand and What Lies Ahead — Vidhisha Balachandran et al., 2025 https://scholar.google.com/scholar?q=Inference-Time+Scaling+for+Complex+Tasks:+Where+We+Stand+and+What+Lies+Ahead 19. WTU-EVAL: A Whether-or-Not Tool Usage Evaluation Benchmark for Large Language Models — Kangyun Ning et al., 2024 https://scholar.google.com/scholar?q=WTU-EVAL:+A+Whether-or-Not+Tool+Usage+Evaluation+Benchmark+for+Large+Language+Models 20. T-Eval: Evaluating the Tool Utilization Capability of Large Language Models Step by Step — Zehui Chen et al., 2023 https://scholar.google.com/scholar?q=T-Eval:+Evaluating+the+Tool+Utilization+Capability+of+Large+Language+Models+Step+by+Step 21. AI Post Transformers: IMO-Bench for Robust Mathematical Reasoning — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-imo-bench-for-robust-mathematical-reason-143489.mp3 22. AI Post Transformers: Real Context Size and Context Rot — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-07-real-context-size-and-context-rot-56cbb4.mp3 23. AI Post Transformers: Qwen3Guard: Streaming Three-Way Safety Classification for LLMs — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-16-qwen3guard-streaming-three-way-safety-cl-26b0ef.mp3 24. AI Post Transformers: Speculative Decoding in Real vLLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-speculative-decoding-in-real-vllm-servin-6f4e2b.mp3 Interactive Visualization: HELM: Holistic Evaluation of Language Models

Xem tất cả (738)

3,7

3 Xếp hạng

Nhà sáng tạo

mcgrof
Năm hoạt động

2025 - 2026
Tập

738
Xếp hạng

Sạch
Trang web chương trình

AI Post Transformers

Công nghệ

Công nghệ

Một tuần hai lần
Tin tức hằng ngày

Tin tức hằng ngày

Hằng ngày

AI Post Transformers

LLMServingSim 2.0 for Disaggregated LLM Serving

Information-Aware KV Cache Compression for Long Reasoning

JETSPEC and Parallel Tree Speculative Decoding

DAK: Direct GPU Memory Offloading for LLMs

Prefix-Tuning for Efficient Text Generation

RMSNorm: Simplifying Layer Normalization for Sequence Models

ReasonCACHE: Learning Reasoning Without Weight Updates

HELM: Holistic Evaluation of Language Models

Xếp Hạng & Nhận Xét

Giới Thiệu

Thông Tin

Có Thể Bạn Cũng Thích

AI Post Transformers

Tập

LLMServingSim 2.0 for Disaggregated LLM Serving

Information-Aware KV Cache Compression for Long Reasoning

JETSPEC and Parallel Tree Speculative Decoding

DAK: Direct GPU Memory Offloading for LLMs

Prefix-Tuning for Efficient Text Generation

RMSNorm: Simplifying Layer Normalization for Sequence Models

ReasonCACHE: Learning Reasoning Without Weight Updates

HELM: Holistic Evaluation of Language Models

Xếp Hạng & Nhận Xét

Giới Thiệu

Thông Tin

Có Thể Bạn Cũng Thích