18 June

When Quantization Hurts Reasoning Models

This episode explores how quantization affects reasoning models, asking how much weights, activations, and KV caches can be compressed before multi-step reasoning starts to fail. It explains the main quantization strategies in practical serving terms, from weight-only methods like AWQ and GPTQ to weight-activation schemes such as W8A8 and W4A4, and KV cache compression for long decoding traces. The discussion argues that reasoning models are unusually fragile because small numerical errors can compound across long solution paths, making calibration quality and benchmark choice far more important than they are for ordinary chat models. Listeners would find it interesting for its concrete look at the tradeoff between cheaper inference and reliable reasoning, grounded in evaluations across model families from 1.5B to 70B and difficult benchmarks in math, science, and code. Sources: 1. Quantization Hurts Reasoning? An Empirical Study on Quantized Reasoning Models — Ruikang Liu, Yuxuan Sun, Manyi Zhang, Haoli Bai, Xianzhi Yu, Tiezheng Yu, Chun Yuan, Lu Hou, 2025 http://arxiv.org/abs/2504.04823 2. SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models — Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, Song Han, 2023 https://scholar.google.com/scholar?q=SmoothQuant:+Accurate+and+Efficient+Post-Training+Quantization+for+Large+Language+Models 3. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration — Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, Song Han, 2024 https://scholar.google.com/scholar?q=AWQ:+Activation-aware+Weight+Quantization+for+LLM+Compression+and+Acceleration 4. KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization — Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, Amir Gholami, 2024 https://scholar.google.com/scholar?q=KVQuant:+Towards+10+Million+Context+Length+LLM+Inference+with+KV+Cache+Quantization 5. Quantization Meets Reasoning: Exploring LLM Low-Bit Quantization Degradation for Mathematical Reasoning — Zhen Li, Yupeng Su, Runming Yang, Congkai Xie, Zheng Wang, Zhongwei Xie, Ngai Wong, Hongxia Yang, 2025 https://scholar.google.com/scholar?q=Quantization+Meets+Reasoning:+Exploring+LLM+Low-Bit+Quantization+Degradation+for+Mathematical+Reasoning 6. Evaluating Quantized Large Language Models — Shiyao Li et al., 2024 https://scholar.google.com/scholar?q=Evaluating+Quantized+Large+Language+Models 7. FlatQuant: Flatness Matters for LLM Quantization — Yuxuan Sun et al., 2024 https://scholar.google.com/scholar?q=FlatQuant:+Flatness+Matters+for+LLM+Quantization 8. s1: Simple Test-Time Scaling — Niklas Muennighoff et al., 2025 https://scholar.google.com/scholar?q=s1:+Simple+Test-Time+Scaling 9. What Makes Low-Bit Quantization-Aware Training Work for Reasoning LLMs? A Systematic Study — Keyu Lv et al., 2026 https://arxiv.org/abs/2601.14888 10. Measuring Faithfulness Depends on How You Measure: Classifier Sensitivity in LLM Chain-of-Thought Evaluation — Richard J. Young, 2026 https://arxiv.org/abs/2603.20172 11. On the Hardness of Faithful Chain-of-Thought Reasoning in Large Language Models — Sree Harsha Tanneru et al., 2024 https://arxiv.org/abs/2406.10625 12. Activation Outliers in Transformer Quantization: Reproduction, Statistical Analysis, and Deployment Tradeoffs — Pranav Kumar Kaliaperumal, 2026 https://arxiv.org/abs/2603.04308 13. KVCOMM: Online Cross-context KV-cache Communication for Efficient LLM-based Multi-agent Systems — Hancheng Ye et al., 2025 https://arxiv.org/abs/2510.12872 14. AI Post Transformers: Affordable Large-Scale Decoding Through Model-System Co-Design — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-19-affordable-large-scale-decoding-through-e1d7ed.mp3 15. AI Post Transformers: Mooncake for KV Cache-Centric LLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-06-05-mooncake-for-kv-cache-centric-llm-servin-1086d0.mp3 16. AI Post Transformers: IndexMem: Learned KV-Cache Eviction for Long-Context LLMs — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-06-12-indexmem-learned-kv-cache-eviction-for-l-132c2a.mp3 17. AI Post Transformers: Speculative Decoding in Real vLLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-speculative-decoding-in-real-vllm-servin-6f4e2b.mp3

Show

AI Post Transformers
Frequency

Updated daily
Published

18 June 2026 at 00:00 UTC
Rating

Clean

When Quantization Hurts Reasoning Models

Information