AI Post Transformers

mcgrof

3.7 (3)
Technology
Updated Daily

AI-generated podcast where hosts Hal Turing and Dr. Ada Shannon discuss the latest research papers and reports in machine learning, AI systems, and optimization. Featuring honest critical analysis, proper citations, and nerdy humor.

1d ago

CacheFlow: Optimal 3D-Parallel KV Cache Restoration

This episode explores CacheFlow, a system for restoring the KV cache that lets LLM serving systems reload prior context into GPU memory quickly. The discussion centers on a scheduling problem: rather than choosing between recomputing attention states or loading cached tensors from CPU, disk, or another node, CacheFlow treats restoration as a coordination problem across tokens, layers, GPUs, and concurrent requests simultaneously. It highlights a two-pointer "meet in the middle" technique applied along both token chunks and model layers, with an offline-profiled crossover point determining which strategy dominates for a given sequence length. The hosts also stress that the paper proves an optimality bound for its scheduling policy rather than just benchmarking against baselines, distinguishing it from typical serving-infrastructure papers. Listeners interested in reducing time-to-first-token for long-context chatbots, coding agents, or retrieval-heavy pipelines will find the reframing of restoration as a multi-dimensional scheduling problem, rather than a single per-request tradeoff, particularly compelling. Sources: 1. CacheFlow: Efficient LLM Serving with 3D-Parallel KV Cache Restoration — Sean Nian, Jiahao Fang, Qilong Feng, Zhiyu Wu, Fan Lai, 2026 http://arxiv.org/abs/2604.25080 2. Compute or load KV cache? Why not both? (Cake) — Shuowei Jin, Xueshen Liu, Qingzhao Zhang, Z. Morley Mao, 2025 https://scholar.google.com/scholar?q=Compute+or+load+KV+cache%3F+Why+not+both%3F+%28Cake%29 3. Fast state restoration in LLM serving with HCache — Shiwei Gao, Youmin Chen, Jiwu Shu, 2025 https://scholar.google.com/scholar?q=Fast+state+restoration+in+LLM+serving+with+HCache 4. Mooncake: Trading more storage for less computation — a KVCache-centric architecture for serving LLM chatbot — Ruoyu Qin, Zheming Li, Weiran He, et al., 2025 https://scholar.google.com/scholar?q=Mooncake%3A+Trading+more+storage+for+less+computation+%E2%80%94+a+KVCache-centric+architecture+for+serving+LLM+chatbot 5. LMCache: An efficient KV cache layer for enterprise-scale LLM inference — Yuhan Liu, Yihua Cheng, Jiayi Yao, et al., 2025 https://scholar.google.com/scholar?q=LMCache%3A+An+efficient+KV+cache+layer+for+enterprise-scale+LLM+inference 6. Cacheblend: Fast large language model serving for RAG with cached knowledge fusion — Jiayi Yao, Hanchen Li, Yuhan Liu, et al., 2025 https://scholar.google.com/scholar?q=Cacheblend%3A+Fast+large+language+model+serving+for+RAG+with+cached+knowledge+fusion 7. Kvflow: Efficient prefix caching for accelerating LLM-based multi-agent workflows — Zaifeng Pan, Ajjkumar Patel, Zhengding Hu, et al., 2025 https://scholar.google.com/scholar?q=Kvflow%3A+Efficient+prefix+caching+for+accelerating+LLM-based+multi-agent+workflows 8. Continuum: Efficient and robust multi-turn LLM agent scheduling with KV cache time-to-live — Hanchen Li, Qiuyang Mang, Runyuan He, et al., 2026 https://scholar.google.com/scholar?q=Continuum%3A+Efficient+and+robust+multi-turn+LLM+agent+scheduling+with+KV+cache+time-to-live Interactive Visualization: CacheFlow: Optimal 3D-Parallel KV Cache Restoration
1d ago

Compute or Load? Cake's Smart KV Cache Scheduler

This episode examines "Compute Or Load KV Cache? Why not Both?" (Jin, Liu, et al., University of Michigan), which introduces Cake, a scheduling system for LLM inference. The discussion traces the KV cache problem from its roots in the attention mechanism through the rise of prefix caching in production systems like OpenAI, Anthropic, and DeepSeek, then explains why loading a cached prefix isn't automatically cheap — most cache hits land on slow disk tiers rather than fast GPU memory. The core insight covered is that compute cost per chunk rises across a sequence while I/O cost stays flat, which lets Cake run a "meet in the middle" two-pointer scheduler that computes early chunks on GPU while simultaneously loading later chunks from storage. Listeners interested in LLM serving infrastructure will find concrete numbers throughout — like the 30-second prefill tax on a 72,000-token input — that ground the tradeoff between recomputation and I/O in real system constraints rather than abstract theory. Sources: 1. Compute Or Load KV Cache? Why Not Both? — Shuowei Jin, Xueshen Liu, Qingzhao Zhang, Z. Morley Mao, 2024 http://arxiv.org/abs/2410.03065 2. Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving — Qin, R. et al. (Moonshot AI / Kimi), 2024 https://scholar.google.com/scholar?q=Mooncake%3A+A+KVCache-centric+Disaggregated+Architecture+for+LLM+Serving 3. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model — DeepSeek-AI, 2024 https://scholar.google.com/scholar?q=DeepSeek-V2%3A+A+Strong%2C+Economical%2C+and+Efficient+Mixture-of-Experts+Language+Model 4. CacheBlend: Fast Large Language Model Serving with Cached Knowledge Fusion — Yao, J., Li, H., Liu, Y., et al., 2024 https://scholar.google.com/scholar?q=CacheBlend%3A+Fast+Large+Language+Model+Serving+with+Cached+Knowledge+Fusion 5. Preble: Efficient Distributed Prompt Scheduling for LLM Serving — Srivatsa, V. et al., 2024 https://scholar.google.com/scholar?q=Preble%3A+Efficient+Distributed+Prompt+Scheduling+for+LLM+Serving Interactive Visualization: Compute or Load? Cake's Smart KV Cache Scheduler
1d ago

Fast State Restoration for Evicted LLM KV Caches

Fast State Restoration in LLM Serving with HCache tackles a hidden cost of running LLM chat services: when GPU memory pressure forces eviction of a conversation's KV cache, restoring that state currently means either recomputing it from scratch (20-26x slower than no restoration) or streaming the full cache back from storage over PCIe (6.5-13x slower). Drawing on traces from ShareGPT4 and L-Eval, the discussion lays out why eviction is the common case rather than an edge case — a single A100-40GB holds only enough KV cache for a handful of live conversations at once. The episode walks through the researchers' proposed middle path: caching the hidden state (one layer upstream of the key/value projection) instead of the KV cache itself, then reconstructing K and V on demand via a cheap matrix multiplication. It's a systems paper grounded in first-principles reasoning about transformer architecture before any benchmarks are run, making the case for why this approach should be faster on theoretical grounds alone. Listeners interested in the practical engineering trade-offs behind serving long, multi-turn LLM conversations at scale will find the framing of "recompute vs. offload vs. something smaller in between" a clear lens on a problem most users never realize is happening. Sources: 1. Fast State Restoration in LLM Serving with HCache — Shiwei Gao, Youmin Chen, Jiwu Shu, 2024 http://arxiv.org/abs/2410.05004 2. Training Deep Nets with Sublinear Memory Cost — Tianqi Chen, Bin Xu, Chiyuan Zhang, Carlos Guestrin, 2016 https://scholar.google.com/scholar?q=Training+Deep+Nets+with+Sublinear+Memory+Cost 3. Efficient Memory Management for Large Language Model Serving with PagedAttention — Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, et al., 2023 https://scholar.google.com/scholar?q=Efficient+Memory+Management+for+Large+Language+Model+Serving+with+PagedAttention 4. Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving — Ruoyu Qin, Zheming Li, Weiran He, et al. (Moonshot AI and Tsinghua University), 2024/2025 https://scholar.google.com/scholar?q=Mooncake%3A+A+KVCache-centric+Disaggregated+Architecture+for+LLM+Serving 5. CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving — Yuhan Liu, Junchen Jiang, et al. (University of Chicago), 2024 https://scholar.google.com/scholar?q=CacheGen%3A+KV+Cache+Compression+and+Streaming+for+Fast+Large+Language+Model+Serving 6. CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion — Jiayi Yao, Hanchen Li, Yuhan Liu, et al., 2025 https://scholar.google.com/scholar?q=CacheBlend%3A+Fast+Large+Language+Model+Serving+for+RAG+with+Cached+Knowledge+Fusion 7. SGLang: Efficient Execution of Structured Language Model Programs — Lianmin Zheng, Liangsheng Yin, Ying Sheng, et al., 2024 https://scholar.google.com/scholar?q=SGLang%3A+Efficient+Execution+of+Structured+Language+Model+Programs 8. Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention — Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu, Pengfei Zuo, 2024 (USENIX ATC) https://scholar.google.com/scholar?q=Cost-Efficient+Large+Language+Model+Serving+for+Multi-turn+Conversations+with+CachedAttention 9. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints — Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, Sumit Sanghai, 2023 (EMNLP) https://scholar.google.com/scholar?q=GQA%3A+Training+Generalized+Multi-Query+Transformer+Models+from+Multi-Head+Checkpoints 10. Prompt Cache: Modular Attention Reuse for Low-Latency Inference — In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khandelwal, Lin Zhong, 2024 (MLSys) https://scholar.google.com/scholar?q=Prompt+Cache%3A+Modular+Attention+Reuse+for+Low-Latency+Inference 11. Efficiently Programming Large Language Models using SGLang — Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Jeff Huang, Chuyue Sun, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, Ying Sheng, 2023/2024 https://scholar.google.com/scholar?q=Efficiently+Programming+Large+Language+Models+using+SGLang 12. FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU — Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher Re, Ion Stoica, Ce Zhang, 2023 (ICML) https://scholar.google.com/scholar?q=FlexGen%3A+High-Throughput+Generative+Inference+of+Large+Language+Models+with+a+Single+GPU Interactive Visualization: Fast State Restoration for Evicted LLM KV Caches
1d ago

IMPRESS: A Multi-Tier KV Storage System for Faster LLM Prefill

This episode explores IMPRESS, a systems paper from Zhejiang University and Huawei Cloud researchers presented at USENIX FAST 2025, which tackles a specific bottleneck in large language model inference: the time delay before a model produces its first response token when cached context has spilled onto disk. The discussion traces how modern LLM applications—retrieval-augmented generation, multi-turn chatbots, and plugin frameworks—prepend large chunks of context that inflate prefill costs superlinearly, with one cited example showing a 2,600-token plugin prompt stretching time-to-first-token by nine times on OPT-30B. It covers how prior work like vLLM's PagedAttention and AttentionStore addressed prefix KV caching but hit a wall once caches outgrew GPU and CPU memory and moved to slower disk storage, where I/O latency can consume up to 98 percent of total delay. The conversation traces IMPRESS's key insight: repurposing an importance-scoring technique from H2O—originally used to decide what to evict during decoding—to instead decide what's worth loading from disk before prefill even starts, claiming up to 2.8x lower latency with comparable accuracy. Listeners interested in the practical engineering tradeoffs behind making large-context LLM applications faster and cheaper to run will find the discussion's grounding in measured attention patterns, rather than benchmark tweaking, particularly compelling. Sources: 1. IMPRESS: A Multi-Tier KV Storage System for Faster LLM Prefill https://www.usenix.org/system/files/fast25-chen-weijian-impress.pdf 2. Efficient Memory Management for Large Language Model Serving with PagedAttention — Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica (UC Berkeley), 2023 https://scholar.google.com/scholar?q=Efficient+Memory+Management+for+Large+Language+Model+Serving+with+PagedAttention 3. H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models — Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, Zhangyang Wang, Beidi Chen, 2023 https://scholar.google.com/scholar?q=H2O%3A+Heavy-Hitter+Oracle+for+Efficient+Generative+Inference+of+Large+Language+Models 4. Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention (AttentionStore) — Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, and colleagues (industry/academic collaboration), 2024 https://scholar.google.com/scholar?q=Cost-Efficient+Large+Language+Model+Serving+for+Multi-turn+Conversations+with+CachedAttention+%28AttentionStore%29 5. Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving — Ruoyu Qin, Zheming Li, Weiran He, Mingxing Zhang, Yongwei Wu, Weimin Zheng, Xinran Xu (Moonshot AI, Tsinghua University), 2024 https://scholar.google.com/scholar?q=Mooncake%3A+A+KVCache-centric+Disaggregated+Architecture+for+LLM+Serving 6. AttentionStore: Cost-Effective Attention Reuse across Multi-Turn Conversations in Large Language Model Serving — Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu, Pengfei Zuo, 2024 https://scholar.google.com/scholar?q=AttentionStore%3A+Cost-Effective+Attention+Reuse+across+Multi-Turn+Conversations+in+Large+Language+Model+Serving 7. Retrieval Head Mechanistically Explains Long-Context Factuality — Wenhao Wu, Yizhong Wang, Guangxuan Xiao, Hao Peng, Yao Fu, 2024 https://scholar.google.com/scholar?q=Retrieval+Head+Mechanistically+Explains+Long-Context+Factuality 8. RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation — Chao Jin, Zili Zhang, Xuanlin Jiang, Fangyue Liu, Xin Liu, Xuanzhe Liu, Xin Jin, 2024 https://scholar.google.com/scholar?q=RAGCache%3A+Efficient+Knowledge+Caching+for+Retrieval-Augmented+Generation Interactive Visualization: IMPRESS: A Multi-Tier KV Storage System for Faster LLM Prefill
1d ago

Small Collectives, Big TLB Cost: Reverse Address Translation in Scale-Up GPU Pods

This episode examines the "last mile" problem in multi-GPU scale-up fabrics: when a remote memory request arrives at a destination GPU carrying a Network Physical Address, that address means nothing locally until it's converted back to a System Physical Address through Reverse Address Translation. The discussion traces why this destination-side translation problem is genuinely new — decades of TLB optimization research has assumed the initiating processor controls the access pattern, while NVLink and UALink fabrics flip that model, forcing the receiving GPU to translate incoming requests with no warning and no control. The hosts connect this seemingly niche hardware detail to a concrete workload: Mixture-of-Experts models rely on All-to-All dispatch and gather collectives, implemented in libraries like NCCL and RCCL, that cross this translation step twice per layer across dozens of layers. To quantify the impact, the paper extends the ASTRA-sim2.0 simulator with an Omnet++ network backend to model packet-level UALink Clos topologies, generating realistic All-to-All traffic via Microsoft's MSCCLang. Listeners interested in the unglamorous plumbing beneath large-scale AI infrastructure will find a compelling case that address translation, not just compute or bandwidth, may be a hidden bottleneck for the collective communication patterns underpinning today's largest inference deployments. Sources: 1. Amel Fatima's Reverse Address Translation in GPU Fabrics https://arxiv.org/pdf/2604.02473 2. Trans-FW: Short Circuiting Page Table Walk in Multi-GPU Systems via Remote Forwarding — Bingyao Li, Jieming Yin, Anup Holey, Youtao Zhang, Jun Yang, Xulong Tang, 2023 https://scholar.google.com/scholar?q=Trans-FW%3A+Short+Circuiting+Page+Table+Walk+in+Multi-GPU+Systems+via+Remote+Forwarding 3. ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale — William Won, Taekyung Heo, Saeed Rashidi, Srinivas Sridharan, Sudarshan Srinivasan, Tushar Krishna, 2023 https://scholar.google.com/scholar?q=ASTRA-sim2.0%3A+Modeling+Hierarchical+Networks+and+Disaggregated+Systems+for+Large-model+Training+at+Scale 4. MSCCLang: Microsoft Collective Communication Language — Meghan Cowan, Saeed Maleki, Madanlal Musuvathi, Olli Saarikivi, Yifan Xiong, 2023 https://scholar.google.com/scholar?q=MSCCLang%3A+Microsoft+Collective+Communication+Language 5. Tutel: Adaptive Mixture-of-Experts at Scale — Changho Hwang, Wei Cui, Yifan Xiong, Ziyue Yang, Ze Liu, Han Hu, Zilong Wang, Rafael Salas, Jithin Jose, Prabhat Ram, et al., 2023 https://scholar.google.com/scholar?q=Tutel%3A+Adaptive+Mixture-of-Experts+at+Scale 6. Optimizing distributed ML communication with fused computation-collective operations — Kishore Punniyamurthy, Khaled Hamidouche, Bradford M. Beckmann, 2024 https://scholar.google.com/scholar?q=Optimizing+distributed+ML+communication+with+fused+computation-collective+operations 7. TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference — Raja Gond, Nipun Kwatra, Ramachandran Ramjee, 2025 https://scholar.google.com/scholar?q=TokenWeave%3A+Efficient+Compute-Communication+Overlap+for+Distributed+LLM+Inference Interactive Visualization: Small Collectives, Big TLB Cost: Reverse Address Translation in Scale-Up GPU Pods
2d ago

Context is All You Need: Fixing OOD Drift Without Retraining

This episode explores CONTXT, a training-free method for correcting distribution shift by adding a single precomputed "context vector" directly into a model's internal activations—no fine-tuning, no paired prompts, and no gradient updates required. The discussion traces the paper's neuroscience grounding in dual-process theory, where the hippocampus rapidly encodes context and hands it to the prefrontal cortex to amplify relevant features and suppress irrelevant ones, and examines how faithfully that analogy maps onto a simple additive vector operation. It also clarifies the distinction between domain generalization (no access to target data at all) and test-time adaptation (unlabeled target data available at inference), situating CONTXT within existing activation-steering approaches like those requiring token-level paired prompts. Listeners get a walkthrough of the core math—h plus alpha times an index vector, extendable to multiple stacked contexts for simultaneous edits like adjusting tone while removing sarcasm—before the hosts turn to concrete demonstrations, including a striking out-of-distribution image classification example. The conversation is notable for its skepticism: one host pushes back hard on whether a two-region brain theory can really license a one-line vector subtraction, making this as much a critique of steering-paper rigor as an explainer of the method itself. Sources: 1. Context is All You Need — Jean Erik Delanois, Shruti Joshi, Ryan Golden, Teresa Nick, Maxim Bazhenov, 2026 http://arxiv.org/abs/2604.04364 2. In Search of Lost Domain Generalization — Ishaan Gulrajani, David Lopez-Paz, 2020 https://scholar.google.com/scholar?q=In+Search+of+Lost+Domain+Generalization 3. Deep CORAL: Correlation Alignment for Deep Domain Adaptation — Baochen Sun, Kate Saenko, 2016 https://scholar.google.com/scholar?q=Deep+CORAL%3A+Correlation+Alignment+for+Deep+Domain+Adaptation 4. Domain-Adversarial Training of Neural Networks — Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, Victor Lempitsky, 2016 https://scholar.google.com/scholar?q=Domain-Adversarial+Training+of+Neural+Networks 5. Deeper, Broader and Artier Domain Generalization (the PACS dataset) — Da Li, Yongxin Yang, Yi-Zhe Song, Timothy M. Hospedales, 2017 https://scholar.google.com/scholar?q=Deeper%2C+Broader+and+Artier+Domain+Generalization+%28the+PACS+dataset%29 6. Tent: Fully Test-Time Adaptation by Entropy Minimization — Dequan Wang, Evan Shelhamer, Shaoteng Liu, Bruno Olshausen, Trevor Darrell, 2021 https://scholar.google.com/scholar?q=Tent%3A+Fully+Test-Time+Adaptation+by+Entropy+Minimization 7. Improving Robustness Against Common Corruptions by Covariate Shift Adaptation — Steffen Schneider, Evgenia Rusak, Luisa Eck, Oliver Bringmann, Wieland Brendel, Matthias Bethge, 2020 https://scholar.google.com/scholar?q=Improving+Robustness+Against+Common+Corruptions+by+Covariate+Shift+Adaptation 8. Do We Really Need to Access the Source Data? Source Hypothesis Transfer for Unsupervised Domain Adaptation — Jian Liang, Dapeng Hu, Jiashi Feng, 2020 https://scholar.google.com/scholar?q=Do+We+Really+Need+to+Access+the+Source+Data%3F+Source+Hypothesis+Transfer+for+Unsupervised+Domain+Adaptation 9. MEMO: Test Time Robustness via Adaptation and Augmentation — Marvin Zhang, Sergey Levine, Chelsea Finn, 2022 https://scholar.google.com/scholar?q=MEMO%3A+Test+Time+Robustness+via+Adaptation+and+Augmentation 10. Steering Llama 2 via Contrastive Activation Addition — Panickssery, N., Gabrieli, N., Schulz, J., Tong, M., Hubinger, E., Turner, A. M., 2023 https://scholar.google.com/scholar?q=Steering+Llama+2+via+Contrastive+Activation+Addition 11. Representation Engineering: A Top-Down Approach to AI Transparency — Zou, A., Gao, L., Greenblatt, R., et al., 2023 https://scholar.google.com/scholar?q=Representation+Engineering%3A+A+Top-Down+Approach+to+AI+Transparency 12. Extracting Latent Steering Vectors from Pretrained Language Models — Subramani, N., Suresh, N., Peters, M. E., 2022 https://scholar.google.com/scholar?q=Extracting+Latent+Steering+Vectors+from+Pretrained+Language+Models 13. Distributionally Robust Neural Networks for Group Shifts (GroupDRO) — Sagawa, S., Koh, P. W., Hashimoto, T. B., Liang, P., 2020 https://scholar.google.com/scholar?q=Distributionally+Robust+Neural+Networks+for+Group+Shifts+%28GroupDRO%29 Interactive Visualization: Context is All You Need: Fixing OOD Drift Without Retraining
2d ago

Post-Training Science: Scaling Laws for SFT and LoRA

This episode explores "Post-Training Science for Supervised Fine-Tuning," a study from Baseten researchers that treats SFT hyperparameter choices — learning rate, batch size, LoRA rank, epochs, and optimizer — as empirical questions rather than inherited folklore. Using controlled, one-variable-at-a-time sweeps across Qwen3 and Llama models ranging from 0.6 billion to 235 billion parameters (including dense and mixture-of-experts architectures), the hosts unpack findings like a surprisingly stable optimal LoRA learning rate that holds flat across two orders of magnitude in model scale, and a batch size that behaves more like a compute-cost tradeoff than a quality lever. They dig into how LoRA stacks up against full fine-tuning, with LoRA recovering a median 98% of full fine-tuning's gains using a fraction of the trainable parameters, and discuss where increasing LoRA rank stops paying off. Along the way, they flag a methodological wrinkle worth scrutinizing: the same evaluator used to construct the training data is also used to judge the fine-tuned model's output quality. Listeners interested in practical, evidence-based guidance for production fine-tuning — rather than another one-off trick — will find concrete, scale-tested defaults here. Sources: 1. Post-Training Science: Scaling Laws for SFT and LoRA https://www.datocms-assets.com/104802/1781805778-baseten-research-sft.pdf 2. LoRA: Low-Rank Adaptation of Large Language Models — Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen (Microsoft Research), 2021 (ICLR 2022) https://scholar.google.com/scholar?q=LoRA%3A+Low-Rank+Adaptation+of+Large+Language+Models 3. QLoRA: Efficient Finetuning of Quantized LLMs — Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer (University of Washington), 2023 (NeurIPS 2023) https://scholar.google.com/scholar?q=QLoRA%3A+Efficient+Finetuning+of+Quantized+LLMs 4. LIMA: Less Is More for Alignment — Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, Susan Zhang, Gargi Ghosh, Mike Lewis, Luke Zettlemoyer, Omer Levy (Meta AI / external collaborators), 2023 (NeurIPS 2023) https://scholar.google.com/scholar?q=LIMA%3A+Less+Is+More+for+Alignment 5. Muon: An optimizer for hidden layers in neural networks / Muon is Scalable for LLM Training — Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, Jeremy Bernstein (original Muon, 2024); Moonshot AI / Kimi team, led by Jingyuan Liu and Jianlin Su, et al. (scaling follow-up, 2025), 2024-2025 https://scholar.google.com/scholar?q=Muon%3A+An+optimizer+for+hidden+layers+in+neural+networks+%2F+Muon+is+Scalable+for+LLM+Training 6. Model Soups: Averaging Weights of Multiple Fine-Tuned Models Improves Accuracy Without Increasing Inference Time — Mitchell Wortsman, Gabriel Ilharco, Samir Yitzhak Gadre, et al., 2022 https://scholar.google.com/scholar?q=Model+Soups%3A+Averaging+Weights+of+Multiple+Fine-Tuned+Models+Improves+Accuracy+Without+Increasing+Inference+Time 7. Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning — Armen Aghajanyan, Luke Zettlemoyer, Sonal Gupta, 2020 https://scholar.google.com/scholar?q=Intrinsic+Dimensionality+Explains+the+Effectiveness+of+Language+Model+Fine-Tuning 8. S-LoRA: Serving Thousands of Concurrent LoRA Adapters — Ying Sheng, Shiyi Cao, Dacheng Li, et al., 2023 https://scholar.google.com/scholar?q=S-LoRA%3A+Serving+Thousands+of+Concurrent+LoRA+Adapters Interactive Visualization: Post-Training Science: Scaling Laws for SFT and LoRA
2d ago

Prompt Boundary-Aware Scheduling with Event Tensors for Dynamic Kernels

This episode explores EventTensor, a compiler abstraction from Carnegie Mellon and collaborators (presented at MLSys 2026) that treats synchronization events as first-class tensors for compiling GPU megakernels. The discussion covers how encoding true data dependencies—rather than waiting for entire kernels to finish—enables fine-grained scheduling, illustrated through a split-K summation example and a symbolic batch-size template that avoids recompilation when shapes change. A key focus is how the system handles Mixture-of-Experts routing, where dependencies aren't known until runtime, via data-dependent event counters and task triggering computed from router outputs. The hosts also unpack the tradeoffs between static and dynamic scheduling, showing that static wins on predictable dense workloads while dynamic pays off only under genuine irregularity like MoE. Benchmark results show up to 1.40x speedups over cuBLAS+NCCL, 1.23x over Triton/FlashInfer on MoE layers, and end-to-end gains of 1.48x over vLLM, making this a concrete look at how compile-time and runtime scheduling can be unified without sacrificing performance. Sources: 1. Event Tensor: A Unified Abstraction for Compiling Dynamic Megakernel — Hongyi Jin, Bohan Hou, Guanjie Wang, Ruihang Lai, Jinqi Chen, Zihao Ye, Yaxing Cai, Yixin Dong, Xinhao Cheng, Zhihao Zhang, Yilong Zhao, Yingyi Huang, Lijie Yang, Jinchen Jiang, Gabriele Oliaro, Jianan Ji, Xupeng Miao, Vinod Grover, Todd C. Mowry, Zhihao Jia, Tianqi Chen, 2026 http://arxiv.org/abs/2604.13327 2. Legion: Expressing Locality and Independence with Logical Regions — Michael Bauer, Sean Treichler, Elliott Slaughter, Alex Aiken, 2012 https://scholar.google.com/scholar?q=Legion%3A+Expressing+Locality+and+Independence+with+Logical+Regions 3. StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures — Cédric Augonnet, Samuel Thibault, Raymond Namyst, Pierre-André Wacrenier, 2011 https://scholar.google.com/scholar?q=StarPU%3A+A+Unified+Platform+for+Task+Scheduling+on+Heterogeneous+Multicore+Architectures 4. Dynamic Control Flow in Large-Scale Machine Learning — Yuan Yu, Martín Abadi, Paul Barham, Eugene Brevdo, Mike Burrows, Andy Davis, Jeff Dean, Sanjay Ghemawat, Tim Harley, Peter Hawkins, Mark Hong, Rajat Monga, Derek Murray, Xiaoqiang Zheng, and others (Google Brain), 2018 https://scholar.google.com/scholar?q=Dynamic+Control+Flow+in+Large-Scale+Machine+Learning 5. Ray: A Distributed Framework for Emerging AI Applications — Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I. Jordan, Ion Stoica, 2018 https://scholar.google.com/scholar?q=Ray%3A+A+Distributed+Framework+for+Emerging+AI+Applications 6. Mirage Persistent Kernel: A Compiler and Runtime for Mega-Kernelizing Tensor Programs — Cheng, X., Zhang, Z., Zhou, Y., Ji, J., Jiang, J., Zhao, Z., et al. (overlapping author list with this paper), 2025 https://scholar.google.com/scholar?q=Mirage+Persistent+Kernel%3A+A+Compiler+and+Runtime+for+Mega-Kernelizing+Tensor+Programs 7. Look ma, no bubbles! Designing a low-latency megakernel for Llama-1B — Spector, B., Juravsky, J., Sul, S., Dugan, O., Lim, D., Fu, D., Arora, S., R, C., 2025 https://scholar.google.com/scholar?q=Look+ma%2C+no+bubbles%21+Designing+a+low-latency+megakernel+for+Llama-1B 8. A Framework for Fine-Grained Synchronization of Dependent GPU Kernels (CuSync) — Jangda, A., Maleki, S., Dehnavi, M. M., Musuvathi, M., Saarikivi, O., 2024 https://scholar.google.com/scholar?q=A+Framework+for+Fine-Grained+Synchronization+of+Dependent+GPU+Kernels+%28CuSync%29 9. Graphene: An IR for Optimized Tensor Computations on GPUs — Hagedorn, B., Fan, B., Chen, H., Cecka, C., Garland, M., Grover, V., 2023 https://scholar.google.com/scholar?q=Graphene%3A+An+IR+for+Optimized+Tensor+Computations+on+GPUs 10. FlashMoE: Fast Distributed MoE in a Single Kernel — Aimuyo, O. J., Oh, B., Singh, R., 2025 https://scholar.google.com/scholar?q=FlashMoE%3A+Fast+Distributed+MoE+in+a+Single+Kernel Interactive Visualization: Prompt Boundary-Aware Scheduling with Event Tensors for Dynamic Kernels

See All (801)

3.7

out of 5

3 Ratings

Creator

mcgrof
Years Active

2025 - 2026
Episodes

801
Rating

Clean
Show Website

AI Post Transformers

Technology

Technology

Updated Daily

AI Post Transformers

CacheFlow: Optimal 3D-Parallel KV Cache Restoration

Compute or Load? Cake's Smart KV Cache Scheduler

Fast State Restoration for Evicted LLM KV Caches

IMPRESS: A Multi-Tier KV Storage System for Faster LLM Prefill

Small Collectives, Big TLB Cost: Reverse Address Translation in Scale-Up GPU Pods

Context is All You Need: Fixing OOD Drift Without Retraining

Post-Training Science: Scaling Laws for SFT and LoRA

Prompt Boundary-Aware Scheduling with Event Tensors for Dynamic Kernels

Ratings & Reviews

About

Information

You Might Also Like

AI Post Transformers

Episodes

CacheFlow: Optimal 3D-Parallel KV Cache Restoration

Compute or Load? Cake's Smart KV Cache Scheduler

Fast State Restoration for Evicted LLM KV Caches

IMPRESS: A Multi-Tier KV Storage System for Faster LLM Prefill

Small Collectives, Big TLB Cost: Reverse Address Translation in Scale-Up GPU Pods

Context is All You Need: Fixing OOD Drift Without Retraining

Post-Training Science: Scaling Laws for SFT and LoRA

Prompt Boundary-Aware Scheduling with Event Tensors for Dynamic Kernels

Ratings & Reviews

About

Information

You Might Also Like