This episode explores JETSPEC, a 2026 inference paper on speculative decoding that asks whether a language model can draft an entire tree of future tokens in parallel while preserving causal consistency and actually reducing latency on long generations. It explains why autoregressive decoding remains a serving bottleneck for long proofs, code completions, and assistant replies, even when the underlying transformer model itself is unchanged. The discussion compares JetSpec’s approach with Medusa, EAGLE-3, and DFlash, focusing on the central tradeoff between stronger path-conditioned drafts that are slow to produce and cheaper parallel drafts that risk internally inconsistent branches. Listeners would find it interesting because it turns a very practical systems problem, why powerful GPUs still feel slow at inference time, into a concrete debate about the next generation of real-world decoding optimizations. Sources: 1. JetSpec: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting — Lanxiang Hu, Zhaoxiang Feng, Yulun Wu, Haoran Yuan, Yujie Zhao, Yu-Yang Qian, Bojun Wang, Peng Zhao, Daxin Jiang, Yibo Zhu, Tajana Rosing, Hao Zhang, 2026 http://arxiv.org/abs/2606.18394 2. Fast Inference from Transformers via Speculative Decoding — Yaniv Leviathan, Matan Kalman, Yossi Matias, 2023 https://scholar.google.com/scholar?q=Fast+Inference+from+Transformers+via+Speculative+Decoding 3. Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads — Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, Tri Dao, 2024 https://scholar.google.com/scholar?q=Medusa:+Simple+LLM+Inference+Acceleration+Framework+with+Multiple+Decoding+Heads 4. EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees — Yuhui Li, Fangyun Wei, Chao Zhang, Hongyang Zhang, 2024 https://scholar.google.com/scholar?q=EAGLE-2:+Faster+Inference+of+Language+Models+with+Dynamic+Draft+Trees 5. DFlash: Block Diffusion for Flash Speculative Decoding — Jian Chen, Yesheng Liang, Zhijian Liu, 2026 https://scholar.google.com/scholar?q=DFlash:+Block+Diffusion+for+Flash+Speculative+Decoding 6. EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test — Yuhui Li, Fangyun Wei, Chao Zhang, Hongyang Zhang, 2025 https://scholar.google.com/scholar?q=EAGLE-3:+Scaling+up+Inference+Acceleration+of+Large+Language+Models+via+Training-Time+Test 7. SpecInfer: Accelerating Generative Large Language Model Serving with Tree-based Speculative Inference and Verification — Xupeng Miao et al., 2023 https://scholar.google.com/scholar?q=SpecInfer:+Accelerating+Generative+Large+Language+Model+Serving+with+Tree-based+Speculative+Inference+and+Verification 8. DFlare: Scaling Up Draft Capacity for Block Diffusion Speculative Decoding — Jiebin Zhang et al., 2026 https://scholar.google.com/scholar?q=DFlare:+Scaling+Up+Draft+Capacity+for+Block+Diffusion+Speculative+Decoding 9. TriSpec: Ternary Speculative Decoding via Lightweight Proxy Verification — Haoyun Jiang et al., 2026 https://scholar.google.com/scholar?q=TriSpec:+Ternary+Speculative+Decoding+via+Lightweight+Proxy+Verification 10. ParallelSpec: Parallel Drafter for Efficient Speculative Decoding — Zilin Xiao et al., 2024 https://arxiv.org/abs/2410.05589 11. Mamba Drafters for Speculative Decoding — Daewon Choi et al., 2025 https://arxiv.org/abs/2506.01206 12. OmniDraft: A Cross-vocabulary, Online Adaptive Drafter for On-device Speculative Decoding — Ramchalam Kinattinkara Ramakrishnan et al., 2025 https://arxiv.org/abs/2507.02659 13. Clover: Regressive Lightweight Speculative Decoding with Sequential Knowledge — Bin Xiao et al., 2024 https://arxiv.org/abs/2405.00263 14. Make Every Draft Count: Hidden State based Speculative Decoding — Yuetao Chen et al., 2026 https://arxiv.org/abs/2602.21224 15. When Hidden States Drift: Can KV Caches Rescue Long-Range Speculative Decoding? — Tianyu Liu et al., 2026 https://arxiv.org/abs/2604.26412 16. MoE-Spec: Expert Budgeting for Efficient Speculative Decoding — Bradley McDanel et al., 2026 https://arxiv.org/abs/2602.16052 17. Beat the long tail: Distribution-Aware Speculative Decoding for RL Training — Zelei Shao et al., 2025 https://arxiv.org/abs/2511.13841 18. AI Post Transformers: Speculative Decoding in Real vLLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-speculative-decoding-in-real-vllm-servin-6f4e2b.mp3 19. AI Post Transformers: Deep Kernel Fusion for Transformer Decoding — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-15-deep-kernel-fusion-for-transformer-decod-b1a703.mp3 20. AI Post Transformers: Serving MoE Models with Disaggregated Expert Parallelism — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-19-serving-moe-models-with-disaggregated-ex-6979d2.mp3 Interactive Visualization: JETSPEC and Parallel Tree Speculative Decoding
Information
- Show
- FrequencyUpdated Daily
- PublishedJune 28, 2026 at 12:00 AM UTC
- RatingClean
