Benchtalks

Benchtalks #2: John Yang (SWE-bench, ProgramBench) - The future of coding benchmarks

For our second Benchtalks, the series dedicated to the researchers building the measurement toolkits that frontier labs hill-climb on, Snorkel AI co-founder Vincent Sunn Chen sat down with John Yang, a Stanford PhD student and creator of the SWE-bench franchise, SWE-smith, CodeClash, and most recently ProgramBench.

This interview covers: 

  • Why every frontier model scored 0% at launch — until GPT-5.5 cracked the first task (cmatrix)
  • Why ProgramBench grades the runnable artifact, not the implementation — and lets models build in any language
  • The post-training tell hiding in plain sight: how much models (especially GPT) love Python, even when it looks like a handicap
  • The reward-hacking problem — models with internet access cheated up to 36% of the time, and why nine LLM judges still couldn't agree on what counts as cheating
  • The lineage from SWE-bench to SWE-smith to CodeClash, and what ProgramBench needs from expert contributors to grow

Full interview/transcript: https://snorkel.ai/blog/benchtalks-john-yang-programbench/