June 3
54 min
Video

Benchtalks #2: John Yang (SWE-bench, ProgramBench) - The future of coding benchmarks

For our second Benchtalks, the series dedicated to the researchers building the measurement toolkits that frontier labs hill-climb on, Snorkel AI co-founder Vincent Sunn Chen sat down with John Yang, a Stanford PhD student and creator of the SWE-bench franchise, SWE-smith, CodeClash, and most recently ProgramBench.

This interview covers:

Why every frontier model scored 0% at launch — until GPT-5.5 cracked the first task (cmatrix)
Why ProgramBench grades the runnable artifact, not the implementation — and lets models build in any language
The post-training tell hiding in plain sight: how much models (especially GPT) love Python, even when it looks like a handicap
The reward-hacking problem — models with internet access cheated up to 36% of the time, and why nine LLM judges still couldn't agree on what counts as cheating
The lineage from SWE-bench to SWE-smith to CodeClash, and what ProgramBench needs from expert contributors to grow

Full interview/transcript: https://snorkel.ai/blog/benchtalks-john-yang-programbench/

Show

Benchtalks
Published

June 3, 2026 at 6:00 p.m. UTC
Length

54 min
Episode

2
Rating

Clean

Benchtalks #2: John Yang (SWE-bench, ProgramBench) - The future of coding benchmarks

Information