AI Post Transformers

RMSNorm: Simplifying Layer Normalization for Sequence Models

This episode explores the 2019 RMSNorm paper, which asks whether LayerNorm’s mean-subtraction step is actually necessary or whether controlling activation scale is the part that really stabilizes training. It explains how RMSNorm keeps LayerNorm’s rescaling behavior while dropping explicit centering, and how the paper’s pRMSNorm variant estimates the normalization term from only a small subset of features to reduce cost further. The discussion covers experiments in machine translation, image classification, image-caption retrieval, and question answering, where model quality stayed roughly comparable while reported runtime improved, with smaller gains in transformers and much larger ones in older RNN-based systems. Listeners would find it interesting because it turns a seemingly minor mathematical tweak into a broader argument about efficiency, optimization stability, and how much claimed speedups depend on the era and quality of the baseline implementation. Sources: 1. Root Mean Square Layer Normalization — Biao Zhang, Rico Sennrich, 2019 http://arxiv.org/abs/1910.07467 2. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift — Sergey Ioffe, Christian Szegedy, 2015 https://scholar.google.com/scholar?q=Batch+Normalization:+Accelerating+Deep+Network+Training+by+Reducing+Internal+Covariate+Shift 3. Layer Normalization — Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E. Hinton, 2016 https://scholar.google.com/scholar?q=Layer+Normalization 4. Root Mean Square Layer Normalization — Biao Zhang, Rico Sennrich, 2019 https://scholar.google.com/scholar?q=Root+Mean+Square+Layer+Normalization 5. On Layer Normalization in the Transformer Architecture — Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, Tie-Yan Liu, 2020 https://scholar.google.com/scholar?q=On+Layer+Normalization+in+the+Transformer+Architecture 6. Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks — Tim Salimans, Diederik P. Kingma, 2016 https://scholar.google.com/scholar?q=Weight+Normalization:+A+Simple+Reparameterization+to+Accelerate+Training+of+Deep+Neural+Networks 7. How Does Batch Normalization Help Optimization? — Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, Aleksander Madry, 2018 https://scholar.google.com/scholar?q=How+Does+Batch+Normalization+Help+Optimization? 8. Understanding Batch Normalization — Nils Bjorck, Carla P. Gomes, Bart Selman, Kilian Q. Weinberger, 2018 https://scholar.google.com/scholar?q=Understanding+Batch+Normalization 9. Norm Matters: Efficient and Accurate Normalization Schemes in Deep Networks — Elad Hoffer, Ron Banner, Itay Golan, Daniel Soudry, 2018 https://scholar.google.com/scholar?q=Norm+Matters:+Efficient+and+Accurate+Normalization+Schemes+in+Deep+Networks 10. Group Normalization — Yuxin Wu, Kaiming He, 2018 https://scholar.google.com/scholar?q=Group+Normalization 11. Residual Learning Without Normalization via Better Initialization — Hongyi Zhang, Yann N. Dauphin, Tengyu Ma, 2019 https://scholar.google.com/scholar?q=Residual+Learning+Without+Normalization+via+Better+Initialization 12. Tuning LayerNorm in Attention: Towards Efficient Multi-Modal LLM Finetuning — Bingchen Zhao et al., 2023 https://scholar.google.com/scholar?q=Tuning+LayerNorm+in+Attention:+Towards+Efficient+Multi-Modal+LLM+Finetuning 13. LayerNorm: A key component in parameter-efficient fine-tuning — Taha ValizadehAslani and Hualou Liang, 2024 https://scholar.google.com/scholar?q=LayerNorm:+A+key+component+in+parameter-efficient+fine-tuning 14. Efficiency in Focus: LayerNorm as a Catalyst for Fine-tuning Medical Visual Language Pre-trained Models — Jiawei Chen et al., 2024 https://scholar.google.com/scholar?q=Efficiency+in+Focus:+LayerNorm+as+a+Catalyst+for+Fine-tuning+Medical+Visual+Language+Pre-trained+Models 15. The Curse of Depth in Large Language Models — Wenfang Sun et al., 2025 https://scholar.google.com/scholar?q=The+Curse+of+Depth+in+Large+Language+Models 16. Just One Layer Norm Guarantees Stable Extrapolation — Juliusz Ziomek, George Whittle, Michael A. Osborne, 2025 https://scholar.google.com/scholar?q=Just+One+Layer+Norm+Guarantees+Stable+Extrapolation 17. Normalization Layer Per-Example Gradients are Sufficient to Predict Gradient Noise Scale in Transformers — Gavia Gray et al., 2024 https://scholar.google.com/scholar?q=Normalization+Layer+Per-Example+Gradients+are+Sufficient+to+Predict+Gradient+Noise+Scale+in+Transformers 18. AI Post Transformers: Keel: Post-LayerNorm Is Back: Stable, ExpressivE, and Deep — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/keel-post-layernorm-is-back-stable-expressive-and-deep/ 19. AI Post Transformers: Deep Kernel Fusion for Transformer Decoding — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-15-deep-kernel-fusion-for-transformer-decod-b1a703.mp3 20. AI Post Transformers: Long Short-Term Memory and Vanishing Gradients — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-19-long-short-term-memory-and-vanishing-gra-72448c.mp3