The provided documents describe the development and evolution of EAGLE, a high-efficiency framework designed to accelerate Large Language Model (LLM) inference through speculative sampling. By performing autoregression at the feature level rather than the token level and incorporating shifted token sequences to manage sampling uncertainty, the original EAGLE achieves significant speedups while maintaining the exact output distribution of the target model. The technology has progressed into EAGLE-2, which introduces dynamic draft trees, and EAGLE-3, which further enhances performance by fusing multi-layer features and removing feature regression constraints during training. These advancements allow for a latency reduction of up to 6.5x and a doubling of throughput, making them compatible with modern reasoning models and popular serving frameworks like vLLM and SGLang. Overall, the sources highlight a shift toward test-time scaling and more expressive draft models to overcome the inherent slow speeds of sequential text generation. Sources: 1) January 26, 2024 EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty.Peking University, Microsoft Research, University of Waterloo, Vector Institute.Yuhui Li, Fangyun Wei, Chao Zhang, Hongyang Zhang.https://arxiv.org/pdf/2401.15077 2) November 12, 2024 EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees.Peking University, Microsoft Research, University of Waterloo, Vector Institute.Yuhui Li, Fangyun Wei, Chao Zhang, Hongyang Zhang.https://aclanthology.org/2024.emnlp-main.422.pdf 4) April 23, 2025 EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test.Peking University, Microsoft Research, University of Waterloo, Vector Institute.Yuhui Li, Fangyun Wei, Chao Zhang, Hongyang Zhang.https://arxiv.org/pdf/2503.01840 1) September 17 2025An Introduction to Speculative Decoding for Reducing Latency in AI Inference.NVIDIA.Jamie Li, Chenhan Yu, Hao Guo.https://developer.nvidia.com/blog/an-introduction-to-speculative-decoding-for-reducing-latency-in-ai-inference/