APRIL 8

When Spectral Gradient Updates Help Deep Learning

This episode explores a theory paper that asks when spectral matrix updates should outperform standard Euclidean gradient methods in deep networks and transformers. It explains how spectral updates replace a gradient matrix with its polar factor—preserving singular-vector directions while flattening singular values—and argues that this geometry can help when incoming activations have low stable rank while gradients have high nuclear-rank-like spread. The discussion connects this criterion to practical excitement around spectral-style optimizers such as Muon, while contrasting them with curvature-based methods like K-FAC and Shampoo. Listeners would find it interesting because the episode turns a seemingly niche optimizer trick into a concrete, testable claim about the hidden geometry of neural network training. Sources: 1. When do spectral gradient updates help in deep learning? — Damek Davis, Dmitriy Drusvyatskiy, 2025 http://arxiv.org/abs/2512.04299 2. Shampoo: Preconditioned Stochastic Tensor Optimization — Vineet Gupta, Tomer Koren, Yoram Singer and others, 2018 https://scholar.google.com/scholar?q=Shampoo:+Preconditioned+Stochastic+Tensor+Optimization 3. K-FAC: Kronecker-Factored Approximate Curvature for Neural Network Optimization — James Martens, Roger Grosse, 2015 https://scholar.google.com/scholar?q=K-FAC:+Kronecker-Factored+Approximate+Curvature+for+Neural+Network+Optimization 4. Muon: An optimizer for hidden layers in neural networks — Keller Jordan and collaborators, 2024 https://scholar.google.com/scholar?q=Muon:+An+optimizer+for+hidden+layers+in+neural+networks 5. When do spectral gradient updates help in deep learning? — Damek Davis, Dmitriy Drusvyatskiy, 2026 https://scholar.google.com/scholar?q=When+do+spectral+gradient+updates+help+in+deep+learning? 6. Deep Transformers without Shortcuts: Modifying Self-attention for Faithful Signal Propagation — Anonymous/various authors depending on version; commonly cited in transformer dynamics discussions, 2021 https://scholar.google.com/scholar?q=Deep+Transformers+without+Shortcuts:+Modifying+Self-attention+for+Faithful+Signal+Propagation 7. On the Softmax Bottleneck of Recurrent Language Models — Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, William W. Cohen, Yoshua Bengio, 2018 https://scholar.google.com/scholar?q=On+the+Softmax+Bottleneck+of+Recurrent+Language+Models 8. Representation Degeneration Problem in Training Natural Language Generation Models — Junxian He, Daniel Spokoyny, Graham Neubig, Taylor Berg-Kirkpatrick, 2020 https://scholar.google.com/scholar?q=Representation+Degeneration+Problem+in+Training+Natural+Language+Generation+Models 9. Neural Collapse: A Terminal Phase of Deep Learning Training — Vardan Papyan, X. Y. Han, David L. Donoho, 2020 https://scholar.google.com/scholar?q=Neural+Collapse:+A+Terminal+Phase+of+Deep+Learning+Training 10. Understanding Dimensional Collapse in Contrastive Self-supervised Learning — Tianyu Hua, Wenxiao Wang, Zihang Dai and others, 2021 https://scholar.google.com/scholar?q=Understanding+Dimensional+Collapse+in+Contrastive+Self-supervised+Learning 11. The Intrinsic Dimension of Objective Landscapes — Chunyuan Li, Heerad Farkhoor, Rosanne Liu, Jason Yosinski, 2018 https://scholar.google.com/scholar?q=The+Intrinsic+Dimension+of+Objective+Landscapes 12. Random Features for Large-Scale Kernel Machines — Ali Rahimi, Benjamin Recht, 2007 https://scholar.google.com/scholar?q=Random+Features+for+Large-Scale+Kernel+Machines 13. A Random Matrix Perspective on Random Features for Compositional Kernels — Florent Krzakala, Lenka Zdeborová, and collaborators in the random-features theory community, 2019 https://scholar.google.com/scholar?q=A+Random+Matrix+Perspective+on+Random+Features+for+Compositional+Kernels 14. The Surprising Effectiveness of Random Features for Structured Data — Various authors across theory and applied ML; representative random-feature comparison literature, 2010s-2020s https://scholar.google.com/scholar?q=The+Surprising+Effectiveness+of+Random+Features+for+Structured+Data 15. Spectral Gradient Descent — Yair Carmon, John C. Duchi, Oliver Hinder, Aaron Sidford, 2021 https://scholar.google.com/scholar?q=Spectral+Gradient+Descent 16. A Kronecker-factored approximate Fisher matrix for convolution layers — Roger Grosse, Jimmy Ba, et al., 2016 https://scholar.google.com/scholar?q=A+Kronecker-factored+approximate+Fisher+matrix+for+convolution+layers 17. Feature Learning in Infinite-Width Neural Networks — Mario Geiger, Stefano Spigler, Arthur Jacot, Matthieu Wyart, 2020 https://scholar.google.com/scholar?q=Feature+Learning+in+Infinite-Width+Neural+Networks 18. Neural Collapse: A Review and Synthesis — Vardan Papyan, X.Y. Han, David L. Donoho, 2023 https://scholar.google.com/scholar?q=Neural+Collapse:+A+Review+and+Synthesis 19. Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning — Arora et al., 2021 https://scholar.google.com/scholar?q=Intrinsic+Dimensionality+Explains+the+Effectiveness+of+Language+Model+Fine-Tuning 20. Understanding transformers for time series: Rank structure, flow-of-ranks, and compressibility — approx. recent transformer interpretability / theory authors, recent https://scholar.google.com/scholar?q=Understanding+transformers+for+time+series:+Rank+structure,+flow-of-ranks,+and+compressibility 21. Tuning stable rank shrinkage: Aiming at the overlooked structural risk in fine-tuning — approx. recent fine-tuning / representation learning authors, recent https://scholar.google.com/scholar?q=Tuning+stable+rank+shrinkage:+Aiming+at+the+overlooked+structural+risk+in+fine-tuning 22. Unraveling the gradient descent dynamics of transformers — approx. recent optimization theory authors, recent https://scholar.google.com/scholar?q=Unraveling+the+gradient+descent+dynamics+of+transformers 23. AI Post Transformers: Adam: A Method for Stochastic Optimization — Hal Turing & Dr. Ada Shannon, 2025 https://podcast.do-not-panic.com/episodes/adam-a-method-for-stochastic-optimization/ 24. AI Post Transformers: AdamW: Decoupled Weight Decay Regularization for Adaptive Gradient Algorithms — Hal Turing & Dr. Ada Shannon, 2025 https://podcast.do-not-panic.com/episodes/adamw-decoupled-weight-decay-regularization-for-adaptive-gradient-algorithms/ 25. AI Post Transformers: In-Context Learning as Implicit Learning Algorithms — Hal Turing & Dr. Ada Shannon, 2025 https://podcast.do-not-panic.com/episodes/in-context-learning-as-implicit-learning-algorithms/ Interactive Visualization: When Spectral Gradient Updates Help Deep Learning

Show

AI Post Transformers
Frequency

Updated Daily
Published

April 8, 2026 at 12:00 AM UTC
Rating

Clean

When Spectral Gradient Updates Help Deep Learning

Information