Benchtalks

Snorkel AI

Benchtalks is Snorkel AI's podcast series at the intersection of AI evaluation, data quality, and real-world impact. Hosted by the Snorkel team, each episode brings together researchers, practitioners, and leaders to dig into the questions that matter most as AI benchmarks grow more sophisticated, dynamic, and reflective of the complexity found in real-world deployments. We explore the full stack of what it takes to build AI that actually works — from the design of rigorous, open benchmarks that close the gap between what we measure and what we encounter in production, to the expert-in-the-loop data creation and curation pipelines that make reliable evaluation possible. Along the way, we get into reinforcement learning, reward modeling, and the evolving science of data quality that underpins it all. Whether you're building agents that operate over long horizons, crafting rubrics that go beyond pass/fail, or trying to understand what "good" looks like for a multi-artifact deliverable — this is the conversation for you. New episodes drop regularly on YouTube and wherever you get your podcasts. Follow us @SnorkelAI (LinkedIn, X, YouTube) to stay current as the field moves fast. 

Episodes

  1. 6d ago ·  Video

    Benchtalks #3: Parth Asawa (Continual Learning Bench) - We Taught AI Everything Except How to Learn

    For our third Benchtalks, the series dedicated to the researchers building the measurement toolkits that frontier labs hill-climb on, Snorkel AI co-founder Vincent Sunn Chen sat down with Parth Asawa, a UC Berkeley PhD student advised by Matei Zaharia and Joey Gonzalez, and the creator of Continual Learning Bench, the first standardized benchmark for measuring whether AI systems actually learn from experience over time. This interview covers: Why your coding agent re-reads your entire codebase from scratch every single session, and why a human developer stops doing that after task oneHow the field got so good at scaling that it quietly set aside the original question: can a model actually learn on its own after training ends?The poker task: if your opponent calls every hand, a human exploits it in minutes, and models run in separate sessions don't figure it outThe gain metric: why cumulative reward alone can't tell you if a system is learning or just smarter to begin with, and how stateful minus stateless performance isolates actual learningWhat Anthropic's Fable release revealed, including how they used Continual Learning Bench to show Fable outperformed systems using Opus or Sonnet as a backboneWhy in-context learning still leads the leaderboard, and why Parth's long-run bet is on parametric systemsThe false dichotomy in AI safety: unsafe open models vs. consolidated power, and what Parth and Joey Gonzalez propose as the third pathFull interview/transcript: snorkel.ai/blog/benchtalks-parth-asawa-continual-learning-bench/

    51 min

About

Benchtalks is Snorkel AI's podcast series at the intersection of AI evaluation, data quality, and real-world impact. Hosted by the Snorkel team, each episode brings together researchers, practitioners, and leaders to dig into the questions that matter most as AI benchmarks grow more sophisticated, dynamic, and reflective of the complexity found in real-world deployments. We explore the full stack of what it takes to build AI that actually works — from the design of rigorous, open benchmarks that close the gap between what we measure and what we encounter in production, to the expert-in-the-loop data creation and curation pipelines that make reliable evaluation possible. Along the way, we get into reinforcement learning, reward modeling, and the evolving science of data quality that underpins it all. Whether you're building agents that operate over long horizons, crafting rubrics that go beyond pass/fail, or trying to understand what "good" looks like for a multi-artifact deliverable — this is the conversation for you. New episodes drop regularly on YouTube and wherever you get your podcasts. Follow us @SnorkelAI (LinkedIn, X, YouTube) to stay current as the field moves fast. 

You Might Also Like