This episode explores a paper claiming that reinforcement-learning post-training can produce large math-reasoning gains in 7B–8B instruction-tuned models while updating as few as 13 parameters through a TinyLoRA setup. The discussion explains how this differs from standard LoRA and full fine-tuning, why the result matters for ideas like intrinsic dimension, and why it may suggest RL is steering latent capabilities already present in pretrained models rather than teaching entirely new knowledge. It also contrasts supervised fine-tuning with RL for verifiable rewards, arguing that on benchmarks like GSM8K, AIME, AMC, and MATH500, RL may improve behaviors like search, persistence, and token allocation. Listeners would find it interesting because it probes whether headline-grabbing “reasoning” gains are genuine evidence of new reasoning ability or a surprisingly cheap way to better elicit and control capabilities models already have. Sources: 1. Learning to Reason in 13 Parameters — John X. Morris, Niloofar Mireshghallah, Mark Ibrahim, Saeed Mahloujifar, 2026 http://arxiv.org/abs/2602.04118 2. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models — Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, Denny Zhou, 2022 https://scholar.google.com/scholar?q=Chain-of-Thought+Prompting+Elicits+Reasoning+in+Large+Language+Models 3. STaR: Self-Taught Reasoner Bootstrapping Reasoning With Reasoning — Eric Zelikman, Yuhuai Wu, Jesse Mu, Noah Goodman, Percy Liang, 2022 https://scholar.google.com/scholar?q=STaR:+Self-Taught+Reasoner+Bootstrapping+Reasoning+With+Reasoning 4. Let’s Verify Step by Step — Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, Karl Cobbe, 2023 https://scholar.google.com/scholar?q=Let’s+Verify+Step+by+Step 5. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning — DeepSeek-AI authors, 2025 https://scholar.google.com/scholar?q=DeepSeek-R1:+Incentivizing+Reasoning+Capability+in+LLMs+via+Reinforcement+Learning 6. LoRA: Low-Rank Adaptation of Large Language Models — Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, et al., 2021 https://scholar.google.com/scholar?q=LoRA:+Low-Rank+Adaptation+of+Large+Language+Models 7. LoRA-XS — Bałazy et al., 2025 https://scholar.google.com/scholar?q=LoRA-XS 8. The Intrinsic Dimension of Objective Landscapes — Chunyuan Li, Heerad Farkhoor, Rosanne Liu, Jason Yosinski, 2018 https://scholar.google.com/scholar?q=The+Intrinsic+Dimension+of+Objective+Landscapes 9. Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning — Armen Aghajanyan, Luke Zettlemoyer, Sonal Gupta, 2020 https://scholar.google.com/scholar?q=Intrinsic+Dimensionality+Explains+the+Effectiveness+of+Language+Model+Fine-Tuning 10. VeRA — Kopiczko et al., 2023 https://scholar.google.com/scholar?q=VeRA 11. VB-LoRA — Li et al., 2024 https://scholar.google.com/scholar?q=VB-LoRA 12. AdaLoRA — Qingru Zhang, Minshuo Chen, Alexander Bukharin, et al., 2023 https://scholar.google.com/scholar?q=AdaLoRA 13. Prompt Tuning — Brian Lester, Rami Al-Rfou, Noah Constant, 2021 https://scholar.google.com/scholar?q=Prompt+Tuning 14. Prefix-Tuning: Optimizing Continuous Prompts for Generation — Xiang Lisa Li, Percy Liang, 2021 https://scholar.google.com/scholar?q=Prefix-Tuning:+Optimizing+Continuous+Prompts+for+Generation 15. BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models — Elad Ben Zaken, Yoav Goldberg, Shauli Ravfogel, 2022 https://scholar.google.com/scholar?q=BitFit:+Simple+Parameter-efficient+Fine-tuning+for+Transformer-based+Masked+Language-models 16. OpenAI o1 / Learning to Reason with Reinforcement Learning — OpenAI et al., 2024 https://scholar.google.com/scholar?q=OpenAI+o1+/+Learning+to+Reason+with+Reinforcement+Learning 17. DeepSeek-R1 / Incentivizing Reasoning Capability in LLMs via Reinforcement Learning — Shao et al., 2024 https://scholar.google.com/scholar?q=DeepSeek-R1+/+Incentivizing+Reasoning+Capability+in+LLMs+via+Reinforcement+Learning 18. One Example Is Enough: Learning to Reason from Single Demonstrations with RL — Wang et al., 2025 https://scholar.google.com/scholar?q=One+Example+Is+Enough:+Learning+to+Reason+from+Single+Demonstrations+with+RL 19. A Thousand Examples Are Enough: Data-efficient SFT for Reasoning — Ye et al., 2025 https://scholar.google.com/scholar?q=A+Thousand+Examples+Are+Enough:+Data-efficient+SFT+for+Reasoning 20. DoRA / Weight-Decomposed Low-Rank Adaptation — Liu et al., 2024 https://scholar.google.com/scholar?q=DoRA+/+Weight-Decomposed+Low-Rank+Adaptation 21. Beyond Two-Stage Training / Beyond two-stage training: Cooperative SFT and RL for LLM reasoning — approx. recent LLM reasoning training papers, exact author list not confirmed from snippet, 2025-2026 https://scholar.google.com/scholar?q=Beyond+Two-Stage+Training+/+Beyond+two-stage+training:+Cooperative+SFT+and+RL+for+LLM+reasoning 22. Beyond Outcome Verification: Verifiable Process Reward Models for Structured Reasoning — approx. recent RLVR/process-reward-model authors, exact author list not confirmed from snippet, 2025-2026 https://scholar.google.com/scholar?q=Beyond+Outcome+Verification:+Verifiable+Process+Reward+Models+for+Structured+Reasoning 23. RLVMR: Reinforcement Learning with Verifiable Meta-Reasoning Rewards for Robust Long-Horizon Agents — approx. recent RL/meta-reasoning authors, exact author list not confirmed from snippet, 2025-2026 https://scholar.google.com/scholar?q=RLVMR:+Reinforcement+Learning+with+Verifiable+Meta-Reasoning+Rewards+for+Robust+Long-Horizon+Agents 24. X-LoRA: Mixture of Low-Rank Adapter Experts, a Flexible Framework for Large Language Models with Applications in Protein Mechanics and Molecular Design — approx. X-LoRA authors, exact author list not confirmed from snippet, 2024-2025 https://scholar.google.com/scholar?q=X-LoRA:+Mixture+of+Low-Rank+Adapter+Experts,+a+Flexible+Framework+for+Large+Language+Models+with+Applications+in+Protein+Mechanics+and+Molecular+Design 25. Task-Aware LoRA Adapter Composition via Similarity Retrieval in Vector Databases — approx. recent adapter-composition authors, exact author list not confirmed from snippet, 2025-2026 https://scholar.google.com/scholar?q=Task-Aware+LoRA+Adapter+Composition+via+Similarity+Retrieval+in+Vector+Databases 26. AI Post Transformers: NeurIPS 2025: Reinforcement Learning for Reasoning in Large Language Models with One Training Example — Hal Turing & Dr. Ada Shannon, 2025 https://podcast.do-not-panic.com/episodes/neurips-2025-reinforcement-learning-for-reasoning-in-large-language-models-with/ 27. AI Post Transformers: Doc-to-LoRA: Internalizing Context as LoRA — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-29-doc-to-lora-internalizing-context-as-lor-8dd5ec.mp3 28. AI Post Transformers: In-Place Test-Time Training for Transformers — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-09-in-place-test-time-training-for-transfor-d0b976.mp3 29. AI Post Transformers: MEMSEARCHER: Reinforcement Learning for LLM Memory Management — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-memsearcher-reinforcement-learning-for-l-e9ad84.mp3 30. AI Post Transformers: Simple Self-Distillation for Better Code Generation — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-02-simple-self-distillation-for-better-code-cc88e0.mp3 Interactive Visualization: Learning to Reason with 13 Parameters