This episode explores a new benchmark suite, IMO-Bench, designed to test whether AI systems can do genuinely robust mathematical reasoning at Olympiad difficulty rather than merely produce correct final answers. It breaks down the benchmark into three distinct tasks—short-answer problem solving, full proof generation, and automatic proof grading—and argues that this decomposition better captures real mathematical competence than answer-centric evaluations like GSM8K or MATH, which may now be saturated or overly teachable. The discussion highlights why IMO-style problems are especially revealing: they require discovering invariants, constructions, and contradiction arguments that resist routine pattern matching and expose whether models can sustain long-horizon reasoning and self-correction. Listeners would find it interesting because it tackles a central question in AI evaluation—whether current benchmarks are measuring true reasoning or just benchmark-specific performance—and examines the promise and risks of using model-based autograders to scale proof assessment. Sources: 1. Towards Robust Mathematical Reasoning — Thang Luong, Dawsen Hwang, Hoang H. Nguyen, Golnaz Ghiasi, Yuri Chervonyi, Insuk Seo, Junsu Kim, Garrett Bingham, Jonathan Lee, Swaroop Mishra, Alex Zhai, Clara Huiyi Hu, Henryk Michalewski, Jimin Kim, Jeonghyun Ahn, Junhwi Bae, Xingyou Song, Trieu H. Trinh, Quoc V. Le, Junehyuk Jung, 2025 http://arxiv.org/abs/2511.01846 2. Training Verifiers to Solve Math Word Problems — Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, John Schulman, 2021 https://scholar.google.com/scholar?q=Training+Verifiers+to+Solve+Math+Word+Problems 3. Measuring Mathematical Problem Solving With the MATH Dataset — Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, Jacob Steinhardt, 2021 https://scholar.google.com/scholar?q=Measuring+Mathematical+Problem+Solving+With+the+MATH+Dataset 4. Solving Quantitative Reasoning Problems with Language Models — Aakanksha Chowdhery and collaborators at Google Research, 2022 https://scholar.google.com/scholar?q=Solving+Quantitative+Reasoning+Problems+with+Language+Models 5. FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI — Elliot Glazer and collaborators, 2024 https://scholar.google.com/scholar?q=FrontierMath:+A+Benchmark+for+Evaluating+Advanced+Mathematical+Reasoning+in+AI 6. Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models — Suzgun Mirac, et al. (BIG-bench collaboration), 2022 https://scholar.google.com/scholar?q=Beyond+the+Imitation+Game:+Quantifying+and+Extrapolating+the+Capabilities+of+Language+Models 7. Holistic Evaluation of Language Models — Percy Liang, Rishi Bommasani, Tony Lee, Dmitriy Turbiner, and collaborators, 2022 https://scholar.google.com/scholar?q=Holistic+Evaluation+of+Language+Models 8. Dynabench: Rethinking Benchmarking in NLP — Douwe Kiela, Max Bartolo, Yixin Nie, Divyansh Kaushik, Atticus Geiger, and collaborators, 2021 https://scholar.google.com/scholar?q=Dynabench:+Rethinking+Benchmarking+in+NLP 9. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena — Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Tianle Li, et al., 2023 https://scholar.google.com/scholar?q=Judging+LLM-as-a-Judge+with+MT-Bench+and+Chatbot+Arena 10. G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment — Jun Gao, Huanle Liu, et al., 2023 https://scholar.google.com/scholar?q=G-Eval:+NLG+Evaluation+using+GPT-4+with+Better+Human+Alignment 11. Automatic Evaluation of Mathematical Proofs in Natural Language: A Survey — Various survey authors in educational technology and AI, 2020-2024 https://scholar.google.com/scholar?q=Automatic+Evaluation+of+Mathematical+Proofs+in+Natural+Language:+A+Survey 12. Towards Robust Mathematical Reasoning — Thang Luong, Dawsen Hwang, Hoang H. Nguyen, Golnaz Ghiasi, Yuri Chervonyi, Insuk Seo, Junsu Kim, Garrett Bingham, Jonathan Lee, Swaroop Mishra, Alex Zhai, Clara Huiyi Hu, Henryk Michalewski, Jimin Kim, Jeonghyun Ahn, Junhwi Bae, Xingyou Song, Trieu H. Trinh, Quoc V. Le, Junehyuk Jung, 2025 https://scholar.google.com/scholar?q=Towards+Robust+Mathematical+Reasoning 13. Draft, Sketch, and Prove: Guiding Formal Theorem Provers with Informal Proofs — Various authors in neural theorem proving and autoformalization, 2022-2024 https://scholar.google.com/scholar?q=Draft,+Sketch,+and+Prove:+Guiding+Formal+Theorem+Provers+with+Informal+Proofs 14. Solving Olympiad Geometry without Human Demonstrations — Trieu H. Trinh, Yuhuai Wu, Quoc V. Le, He He, et al., 2024 https://scholar.google.com/scholar?q=Solving+Olympiad+Geometry+without+Human+Demonstrations 15. LeanDojo: Theorem Proving with Retrieval-Augmented Language Models — Kaiyu Yang, Aidan O'Gara, et al., 2023 https://scholar.google.com/scholar?q=LeanDojo:+Theorem+Proving+with+Retrieval-Augmented+Language+Models 16. FrontierMath — Glazer et al., 2024 https://scholar.google.com/scholar?q=FrontierMath 17. Humanity's Last Exam — Phan et al., 2025 https://scholar.google.com/scholar?q=Humanity's+Last+Exam 18. GSM8K: Training Verifiers to Solve Math Word Problems — Cobbe et al., 2021 https://scholar.google.com/scholar?q=GSM8K:+Training+Verifiers+to+Solve+Math+Word+Problems 19. Gemini Deep Think at IMO 2025 — Luong and Lockhart, 2025 https://scholar.google.com/scholar?q=Gemini+Deep+Think+at+IMO+2025 20. Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination — approx. 2025, authors unclear from snippet, 2025 https://scholar.google.com/scholar?q=Reasoning+or+Memorization?+Unreliable+Results+of+Reinforcement+Learning+Due+to+Data+Contamination 21. Right Is Not Enough: The Pitfalls of Outcome Supervision in Training LLMs for Math Reasoning — approx. 2025, authors unclear from snippet, 2025 https://scholar.google.com/scholar?q=Right+Is+Not+Enough:+The+Pitfalls+of+Outcome+Supervision+in+Training+LLMs+for+Math+Reasoning 22. Improve Mathematical Reasoning in Language Models by Automated Process Supervision — approx. 2025, authors unclear from snippet, 2025 https://scholar.google.com/scholar?q=Improve+Mathematical+Reasoning+in+Language+Models+by+Automated+Process+Supervision 23. MM-PRM: Enhancing Multimodal Mathematical Reasoning with Scalable Step-Level Supervision — approx. 2025, authors unclear from snippet, 2025 https://scholar.google.com/scholar?q=MM-PRM:+Enhancing+Multimodal+Mathematical+Reasoning+with+Scalable+Step-Level+Supervision 24. Solving Inequality Proofs with Large Language Models — approx. 2025, authors unclear from snippet, 2025 https://scholar.google.com/scholar?q=Solving+Inequality+Proofs+with+Large+Language+Models 25. Beyond Gold Standards: Epistemic Ensemble of LLM Judges for Formal Mathematical Reasoning — approx. 2025, authors unclear from snippet, 2025 https://scholar.google.com/scholar?q=Beyond+Gold+Standards:+Epistemic+Ensemble+of+LLM+Judges+for+Formal+Mathematical+Reasoning 26. A Survey on Deep Learning for Theorem Proving — approx. survey authors unclear from snippet, recent https://scholar.google.com/scholar?q=A+Survey+on+Deep+Learning+for+Theorem+Proving 27. Proving Theorems Recursively — approx. 2025, authors unclear from snippet, 2025 https://scholar.google.com/scholar?q=Proving+Theorems+Recursively 28. DICE: Detecting In-distribution Contamination in LLM's Fine-tuning Phase for Math Reasoning — approx. 2025, authors unclear from snippet, 2025 https://scholar.google.com/scholar?q=DICE:+Detecting+In-distribution+Contamination+in+LLM's+Fine-tuning+Phase+for+Math+Reasoning 29. AI Post Transformers: Schoenfeld Theory Applied to Large Reasoning Models — Hal Turing & Dr. Ada Shannon, Sat, https://podcast.do-not-panic.com/episodes/schoenfeld-theory-applied-to-large-reasoning-models/ 30. AI Post Transformers: LLM Benchmark Robustness to Linguistic Variation — Hal Turing & Dr. Ada Shannon, Tue, https://podcast.do-not-panic.com/episodes/llm-benchmark-robustness-to-linguistic-variation/ 31. AI Post Transformers: Generalist Reward Modeling with Inference-Time Scaling — Hal Turing & Dr. Ada Shannon, Tue, https://podcast.do-not-panic.com/episodes/generalist-reward-modeling-with-inference-time-scaling/ 32. AI Post Transformers: Evolving Language Models Without Labels: EVOL-RL — Hal Turing & Dr. Ada Shannon, Fri, https://podcast.do-not-panic.com/episodes/evolving-language-models-without-labels-evol-rl/ Interactive Visualization: IMO-Bench for Robust Mathematical Reasoning