As I’ve been recapping fundamentals of post-training to wrap up my RLHF / Post-training book I knew I needed to get Finbarr Timbers back on the podcast to talk about the state of play. Over the last few months we’ve had many discussions on what we’d need to do to take an Olmo-style recipe to the frontier, supported by Finbarr’s extensive reading of recent model technical reports. To prepare for this, I put together a summary slide deck on the key post-training recipes historically — the path from InstructGPT to today — and today — the key open frontier models. This deck is summarized below as the technical summary, but we do spend 20-35 minutes on it in the podcast, so watching on YouTube is likely the best experience for this one. I previously interviewed Finbarr in December of 2024, shortly after the release of o1 and Tülu 3 (and before he joined Ai2) on the “We are so back” era of RL. Chapters: * 00:00 Introduction & Olmo reflections * 06:28 Post-train recipes review (history) * 23:00 2026’s model recipes (MiMo Flash, DeepSeek V4, GLM 5, Kimi K2.6, etc.) * 39:05 Open-ended post-training discussions * 48:22 Career advice in the LLM race Listen on Apple Podcasts, Spotify, and where ever you get your podcasts. For other Interconnects interviews, go here. For more educational post-training videos, see the course I’m putting together. Technical Summary These are notes cleaned up from a slide-deck created with AI assistance — mostly useful as a discussion topic and reference. The shape of a post-training recipe has changed more in the last year than in the prior three. * 2022–2023 (InstructGPT): one pipeline — SFT → reward model → RL. * 2024 (Llama 3, Tülu 3, etc.): open recipes formalize SFT → DPO → RL with verifiable rewards. Closed recipes use many stages of RLHF. * 2025 (DeepSeek R1): reasoning RL (R1) makes large-scale RL the centerpiece. * 2026 (MiMo Flash V2): recipes fragment into many specialist models that are merged back into one. The new thing: MOPD Multi-teacher On-Policy Distillation (MOPD) is the pattern showing up across the 2026 frontier. * Train N domain-specialist teachers (each: SFT, then RL on the relevant domains). * Train one general student by sampling its own trajectories (this is the final post-trained model). * On each rollout, minimize reverse-KL to the relevant teacher’s output distribution, token by token. Lineage: MiMo Flash v2 introduced it → DeepSeek V4 & Nemotron 3 Ultra scale it to >10 teachers. Why did MOPD emerge? * RL got expensive and conflict-prone. Mixing math, code, and agentic RL in one run eventually trades capabilities off against each other. * Specialists are cheap to make / organizationally scalable. SFT-then-RL on a single domain is well understood and parallelizable. As post-training becomes more complex, scaling it across organizations is a big win. * On-policy distillation matured. Literature and know-how continued to emerge through the RLVR renaissance. Sources: DeepSeek V4 §5.1, MiMo-V2-Flash Key historical recipes InstructGPT (Mar. 2022) — the canonical 3 steps · paper * SFT on human demonstrations * Reward model trained on human comparisons * PPO against the reward model Llama 2 (Jul. 2023) — multi-stage RLHF · paper · interconnects recap * SFT, then iterative RLHF over multiple rounds * Each round: rejection sampling → PPO * Two reward models — separate helpfulness and safety Llama 3 (Jul. 2024) — a complex multi-stage recipe with simpler optimizers · paper · interconnects recap * Per round: reward model → sample K per prompt → rejection sampling → SFT → DPO * No online RL — the RM only filters; run over 6 rounds, best models seed the next Tülu 3 (Nov. 2024) — simple three-stage post-training · paper · interconnects recap Curated prompts → SFT → DPO → RLVR (RL with verifiable rewards — the acronym was coined in this paper). OLMo 3 (Dec. 2025) — a reasoning update to the Tülu 3 recipe · paper · interconnects recap DeepSeek R1 (Jan. 2025) — RL as the centerpiece · paper · interconnects recap The recipe: * R1-Zero — pure RL (GRPO) on the base, no SFT; used to seed reasoning behaviors for the full run, not a separate product * R1 — cold-start SFT → reasoning RL → rejection-sampling SFT → final RL → distill to dense * A big change in recipes: Large-scale RLVR as the primary driver, SFT to distill and refine RL behaviors DeepSeek evolution after V3 * V3 · Dec ‘24 — SFT + GRPO RL. * R1 · Jan ‘25 — multi-stage RL; reasoning emerges. * V3.1 · Aug ‘25 — hybrid think / non-think in one model. * V3.2 · Dec ‘25 — 6 specialists via RL → SFT distillation → one mixed GRPO. * V4 · Apr ‘26 — 10+ domain experts → MOPD. 2026 style recipes! MiMo Flash v2 (Jan. 2026) — where MOPD started · paper Stages: Stage 1 SFT → Stage 2 train ~6 domain-specialist teachers (with older style post-training recipes) → Stage 3 MOPD into a single student. First clean articulation of multi-teacher on-policy distillation as the consolidation step — replaces a single monolithic RL stage with distill-from-specialists. Nemotron 3 Ultra (Jun. 2026) — two rounds, many teachers · paper Stages: SFT → multi-teacher on-policy distillation, run over two iterations, with >10 teachers spanning reasoning, code, math, and agentic domains. Novel: multi-round MOPD across different domains — distill, then re-distill from refreshed teachers. MAI-Thinking-1 (Jun. 2026) — closer to R1 than V4 · announcement Stages: mid-trained base → 3 specialist RL “climbs” (e.g. STEM) → trace-distillation SFT to consolidate the climbs → a final RL climb → MAI-Thinking-1. Closer to DeepSeek R1 than to V4 — multi-stage RL with trace-distillation SFT to consolidate, not on-policy MOPD. Not the only lab without MOPD! Kimi K2.5 (Jan. 2026) — agentic, multimodal · paper · blog Stages: text-only SFT → joint text–vision RL across coding, vision, reasoning, agentic tasks. (No mention of MOPD.) GLM-5 (Feb. 2026) — staged RL by capability · paper Stages: Base → SFT → Reasoning RL → Agentic RL → General RL. Transcript 00:00:00 Nathan Lambert: Hello, we are back on a Interconnects conversation. I don’t really say I do interviews. People criticize me ‘cause I interrupt the guests too much. ‘Cause I’m not a good interviewer, but I’m here to entertain people. Um, this is also fun for me because I’m trying to make, like, a post-training course, and it kind of fits as, uh, in the advanced end of this. So it’s kind of a crossover between Interconnects content and other stuff that I’ve been spending my time on this summer. So I’m happy to welcome Finbarr back. I think... Are you the first return guest? I haven’t checked. 00:00:37 Finbarr Timbers: Oh, wow. 00:00:37 Nathan Lambert: Um, Finbarr and I worked on this sort of post-training recipe stuff for a while at AI2. Um, I left recently. This is one of Finbarr’s last days at AI2. It’s already been announced. It’s not a spoiler here. So we’re gonna kind of reflect on some things on building post-training recipes for OLMO. Um, then we have a little, like, review slide deck and notes on the kind of state and evolution of frontier post-training recipes over time, which is pretty interesting because there’s, what is it, like two to four kind of canonical recipes that there has been. So it’s kind of interesting when you see the field converge on something new, which it’s doing right now with multi-teacher on policy distillation. For some reason, that’s a bit of a mouthful. It is a long acronym. And then we’ll just kind of end with various discussion points on post-training and what we’re up to. So, happy to give you the floor if you have any hot takes you wanna start with to get people to, draw people in. Otherwise, I think, uh, I’m excited to kind of reflect on this, ‘cause I know you’ve been reading a ton of papers recently and kind of prep, laying some of this groundwork. 00:01:43 Finbarr Timbers: Well, yeah. I mean, today is my last day at AI2, so it- it’s ki- it feels very appropriate to be, to be talking to you as you’re the one who recruited me to AI2. So, uh, yeah, that’s pretty special, and it’s great to be, uh, yeah, the, the first repeat guest. I feel honored, uh, to be back on. So yeah, thanks, uh, for having me. 00:02:03 Nathan Lambert: Yeah. Do we wanna start with OLMO? I think that- 00:02:05 Finbarr Timbers: Sure 00:02:06 Nathan Lambert: ... people... I think I, uh, need to do this carefully, but I’ve talked about OLMO-3’s post-training many times to people. I haven’t done this in a very direct way on the podcast, but I would say that post-training OLMO-3 to make this reasoning model was a major accomplishment for many individuals to do this. But also, the complexity of what we were doing was pushing against the limits of AI2’s organizational capacity, and a lot of modern post-training is, like, your ability to wrangle compute data into a work stream. And in order to do that in a complicated way, you really are wrangling an org chart. And that’s like part of why it’s like OLMO-3 was, by its nature, pretty late as a reasoning model. It was, like, a pretty rigid reasoning model, and that’s, like, partially reflected in the recipe being pretty simple. But then when you, like, compare it to all these new recipes with tool use and multi-teacher distillation and all of this, it’s just like a, a, a fork in the road where it’s like you could do this very simple thing and make a strong recipe, but it is not representative of what all the frontier labs are doing. And I think that that kind of fork in being able to say that things are similar happened kind of after Tulou-3, where Tulou-3, I think, was also much simpler with this three-stage SFT-DPO RL recipe. But that simpler recipe was probably closer in outcome to what the labs are doing, but now d