The Gist Talk

Offloading LLM Attention: Q-Shipping and KV-Side Compute

The source provides an extensive overview of strategies, collectively termed Q-shipping and KV-side compute, aimed at overcoming the memory bandwidth bottleneck during Large Language Model (LLM) inference, particularly in the decode phase