For our second Benchtalks, the series dedicated to the researchers building the measurement toolkits that frontier labs hill-climb on, Snorkel AI co-founder Vincent Sunn Chen sat down with John Yang, a Stanford PhD student and creator of the SWE-bench franchise, SWE-smith, CodeClash, and most recently ProgramBench.
This interview covers:
- Why every frontier model scored 0% at launch — until GPT-5.5 cracked the first task (cmatrix)
- Why ProgramBench grades the runnable artifact, not the implementation — and lets models build in any language
- The post-training tell hiding in plain sight: how much models (especially GPT) love Python, even when it looks like a handicap
- The reward-hacking problem — models with internet access cheated up to 36% of the time, and why nine LLM judges still couldn't agree on what counts as cheating
- The lineage from SWE-bench to SWE-smith to CodeClash, and what ProgramBench needs from expert contributors to grow
Full interview/transcript: https://snorkel.ai/blog/benchtalks-john-yang-programbench/
Information
- Show
- PublishedJune 3, 2026 at 6:00 p.m. UTC
- Length54 min
- Episode2
- RatingClean
