−1 d

Ship faster, measure better: experimentation in the age of AI

Summary How do you know if the thing you just shipped actually worked? On this episode of The Experimentation Edge, host Ashley Stirrup, CMO of GrowthBook, sits down with Kevin Yang, Executive Director and Head of Experimentation at JPMorgan Chase, who has spent six years building experimentation across Chase's digital platforms. Kevin shares how his team turned experimentation into more than a billion dollars of estimated value, why the losing experiments matter more than the winners, and the simple chart exercise he uses to prove that a million-dollar change is invisible without a control group. He and Ashley also dig into measuring engagement without chasing vanity metrics, planning for failure to defeat confirmation bias, and why AI is pushing experimentation into a golden era. It's a practical look for product managers, data scientists, and engineers at how a bank operating at massive scale makes better decisions. Chapters 00:00 Welcome to the experimentation edge 01:45 Kevin's role leading experimentation at chase 04:15 Why chase invested in experimentation 06:45 A billion dollars and the value of losers 12:45 Plan for failure to beat confirmation bias 14:30 The million dollar change you can't see 18:45 Sharing learnings and experimentation wrapped 20:45 Engagement without vanity metrics 22:00 Experimentation's golden era with AI 23:30 Why AI needs more experimentation, not less Takeaways Chase estimates over a billion dollars of value from experimentation, and most of the lasting learning comes from the losing tests, not the winners.A control group is non-negotiable: at scale, a change worth millions is invisible under noise and seasonality, and no one can spot it by eye.Treat engagement carefully. For a bank, more time in the app isn't a win; trust, fast task completion, and healthy repeat engagement are.Plan for failure before you run a test. A pre-built playbook for a loss prevents confirmation bias and keeps teams from gaming the metrics.AI is ushering in a golden era for experimentation, because shipping faster only compounds mistakes unless you measure what you ship.Connect with the Guest LinkedIn: https://www.linkedin.com/in/kevintyang Website: https://www.jpmorganchase.com SponsorGrowthbook helps you ship features with confidence by bringing experimentation and feature flagging into one open-source platform. No more guessing whether that new checkout flow actually moved the needle, waiting weeks for data team bandwidth, or flying blind on rollouts. Growthbook gives you a single place to run A/B tests, manage feature flags, and analyze results against your existing data warehouse. With powerful stats built in, it takes the complexity out of experimentation, helps you catch regressions before they hit every user, and makes it easy to test ideas that keep your product improving and your metrics moving in the right direction. See a demo at https://www.growthbook.io/

26 min

−2 d

False negatives are killing your best product ideas

Summary How do you make a high-stakes product decision when the safe choice is to never test it at all? In this episode of The Experimentation Edge, host Ashley Stirrup talks with Arun Bodapati, director of data science at Twitch, about the discipline behind trustworthy experimentation. Drawing on his experience at Schwab, Uber, and Twitch, Arun explains why false negatives are the most dangerous result a team can produce, what hygiene to nail before you push play, and how Twitch used geo-fenced experiments and causal inference to finally settle a pricing question it had avoided for years. It's a practical conversation for product managers, engineers, data scientists, and growth leaders who want experiments that hold up and earn executive trust. Chapters 00:00 Welcome and introduction 01:15 Arun's background and marketing experimentation at Schwab 04:15 Uber's mature, experiment-driven culture 06:30 Coming to Twitch: from Python notebooks to a shared standard 08:30 The pricing problem Twitch had long avoided 10:30 Geo-fenced experiments, matched markets, and elasticity 13:15 The gifted-subs surprise and testing promotions 16:15 The discipline that matters before you push play 18:15 Why false negatives are worse than false positives 20:05 Enrollment triggers and broad explore experiments 22:45 AI, the Kiro tool, and what's next for experimentation Takeaways False negatives are more dangerous than false positives — they get institutionalized as "we tried that, it didn't work" and quietly kill good ideas for years.The most valuable experiment work happens before you push play: clear enrollment logic, a plain-English hypothesis, and no optimizing ahead of the test.If an intervention sounds weak when you write it out in plain English, don't run the experiment — you're just wasting time.Run a broad explore experiment first; small, over-narrowed populations lack power and raise the odds of a false negative. Find the responsive segment with heterogeneous treatment effects afterward.Twitch used geo-fenced experiments with matched markets and causal inference to measure true price elasticity, turning a feared pricing decision into a measured, accretive one.Connect with the Guest LinkedIn: https://www.linkedin.com/in/abodapati/ Website: https://www.twitch.tv SponsorGrowthbook helps you ship features with confidence by bringing experimentation and feature flagging into one open-source platform. No more guessing whether that new checkout flow actually moved the needle, waiting weeks for data team bandwidth, or flying blind on rollouts. Growthbook gives you a single place to run A/B tests, manage feature flags, and analyze results against your existing data warehouse. With powerful stats built in, it takes the complexity out of experimentation, helps you catch regressions before they hit every user, and makes it easy to test ideas that keep your product improving and your metrics moving in the right direction. See a demo at https://www.growthbook.io/

28 min

−3 d

Squarespace killed its blank template and built something better

SummaryWhat do you do when your big launch increases engagement and tanks conversion? On this episode of The Experimentation Edge, host Ashley Stirrup talks with Lina Blackman, Director of Product Analytics at Squarespace, about the blank template launch that flopped — and how its learnings became Blueprint, Squarespace's AI-guided website builder. Lina explains how her embedded analyst team runs 150–200 experiments a year for 3 million customers, the two questions she asks every time a test loses, why teams only need one or two big wins a quarter, how Squarespace calibrates statistical certainty to business stakes, and where AI belongs (and doesn't) in the A/B testing workflow. For product managers, data scientists, and experimentation leaders who want to extract more learning from every test. Chapters 00:00 Introduction: Lina Blackman, Director of Product Analytics at Squarespace 01:45 Squarespace's business and 3 million website customers 02:30 Decentralized analysts, centralized experimentation program 04:15 150–200 experiments a year: onboarding, mobile, checkout, pricing 04:55 The blank template disaster that became Blueprint AI 07:45 Two questions for every losing test 09:30 Moving ship-first teams up the experimentation maturity curve 12:30 A/B test logs and insights rituals 13:30 North Star metrics and the KPI tree 16:35 AI in the A/B testing workflow — and what stays manual. Takeaways Stated preference lies: users asked for a blank canvas, but behavior demanded guided design — and only the experiment could referee.Close every losing test with two questions: did it work for a granular segment, and is the idea worth further investment?One or two big wins a quarter is a healthy hit rate when you run 150–200 experiments a year.Calibrate certainty to stakes — tight bounds on revenue and pricing tests, wider bounds on engagement tests so teams don't spin on noise.Hand AI the mundane parts of the workflow (tracking, assignment setup), but if AI runs the brief and the analysis, ask why you're running the test at all. Connect with the GuestLinkedIn: https://www.linkedin.com/in/linanguyenWebsite: https://www.squarespace.com SponsorGrowthbook helps you ship features with confidence by bringing experimentation and feature flagging into one open-source platform. No more guessing whether that new checkout flow actually moved the needle, waiting weeks for data team bandwidth, or flying blind on rollouts. Growthbook gives you a single place to run A/B tests, manage feature flags, and analyze results against your existing data warehouse. With powerful stats built in, it takes the complexity out of experimentation, helps you catch regressions before they hit every user, and makes it easy to test ideas that keep your product improving and your metrics moving in the right direction. See a demo at https://www.growthbook.io/

23 min

17 juni

The "View All" page that made more money by showing less

SummaryCraig Kistler, VP of Experience Design, Personalization, and Experimentation at Signet Jewelers (the parent company of Kay, Jared, Zales, Peoples, and Banter), joins host Ashley Stirrup on The Experimentation Edge to unpack how a hybrid online-and-in-store jewelry retailer runs experimentation at scale. Craig shares the counterintuitive "view all" experiment where his team blocked the product grid, added friction on purpose, and grew revenue; why he optimizes for revenue per visitor instead of conversion rate; and how Signet deliberately traded a high-volume testing program for fewer, higher-value experiments. A practical listen for product managers, designers, and experimentation leaders building programs that compound. Chapters00:45 What Signet Jewelers actually is (Kay, Jared, Zales, and more)01:45 From art school to UX to experimentation: Craig's background04:45 How experimentation is organized: a centralized model across brands06:45 From 40–50 tests a quarter to 15–25 value-driven experiments08:45 The "view all" experiment: adding friction to grow revenue12:45 One product page, many stakeholders: financing, warranties, chat15:45 Extracting learnings when an experiment loses18:45 Why revenue per visitor beats conversion as the north star20:15 Intent-based personalization and "engagement season is every day"24:45 Bringing the whole org along by tying insights to dollars. Takeaways Friction can increase revenue. Blocking the "view all" grid and forcing a style choice sent shoppers deeper and lifted conversion and revenue, because the extra click added value.The three-click rule is conditional. Clicks only hurt when they're empty; a click that narrows thousands of options to dozens is a feature, not a cost.Revenue per visitor is the honest north star. Conversion rate can be gamed to 100% by making everything free or cutting bounce-heavy traffic; revenue per visitor can't.Fewer, bigger experiments beat high volume. Signet went from 40–50 tests a quarter to 15–25 because complex, value-driven tests produce reusable insights that small tweaks don't.Tie every result to dollars. Translating experiment outcomes into revenue is how Craig keeps financing, warranty, and chat stakeholders aligned and gets executives to act. Connect with the GuestLinkedIn: https://www.linkedin.com/in/craigkistler/Website: https://www.signetjewelers.com SponsorGrowthbook helps you ship features with confidence by bringing experimentation and feature flagging into one open-source platform. No more guessing whether that new checkout flow actually moved the needle, waiting weeks for data team bandwidth, or flying blind on rollouts. Growthbook gives you a single place to run A/B tests, manage feature flags, and analyze results against your existing data warehouse. With powerful stats built in, it takes the complexity out of experimentation, helps you catch regressions before they hit every user, and makes it easy to test ideas that keep your product improving and your metrics moving in the right direction. See a demo at https://www.growthbook.io/

27 min

15 juni

DART: The four metrics that actually measure AI agents

SummaryRingCentral's Director of Product Management for AI Products, Mayank Agarwal, joins host Ashley Stirrup to dismantle the metrics most teams use to judge AI agents. Drawing on his background founding an AI-first quantitative trading firm and scaling Groupon's bookable marketplace, Mayank explains why accuracy and thumbs-up/down feedback both mislead, and introduces DART — a four-metric behavioral framework (decay, acceptance, relevance, task completion) ported from how he measured trading strategies. He also breaks down a Groupon flash-discount experiment that backfired and the scarcity pivot that fixed it. Essential listening for product managers, engineers, and data scientists building or measuring AI features. Chapters 00:00 Welcome and Mayank's path from quant trading to RingCentral AI02:45 Why experimentation has to be owned cross-functionally04:55 Small experiments that compounded to a 12% lift at Groupon06:45 Why accuracy and thumbs-up/down fail for AI agents08:15 The DART framework, metric by metric12:45 Applying DART to AI-generated smart notes14:55 The Groupon flash-sale that dropped conversion16:45 Swapping price urgency for scarcity and social proof19:45 North Star metrics, guardrails, and Goodhart's law26:45 The future: experimenting on — and for — AI agents Takeaways Accuracy is a comfortable lie. It grades a narrow test set and can stay high while the agent fails real users.Thumbs-up/down feedback is sparse and skewed. Unhappy users rarely rate — they just quietly stop using the product.DART measures behavior, not opinions. Four signals read off logs and transcripts: decay, acceptance, relevance, and task completion.Acceptance rate is the trust metric. The share of output users keep without editing is the strongest available proxy for trust.A losing experiment is paid-for information. Groupon's flash-sale flop revealed the lever was wrong, not the goal — scarcity beat price-based urgency.Connect with the GuestLinkedIn: https://www.linkedin.com/in/mayank-agarwal-6223b04a/Website: https://www.ringcentral.com SponsorGrowthbook helps you ship features with confidence by bringing experimentation and feature flagging into one open-source platform. No more guessing whether that new checkout flow actually moved the needle, waiting weeks for data team bandwidth, or flying blind on rollouts. Growthbook gives you a single place to run A/B tests, manage feature flags, and analyze results against your existing data warehouse. With powerful stats built in, it takes the complexity out of experimentation, helps you catch regressions before they hit every user, and makes it easy to test ideas that keep your product improving and your metrics moving in the right direction. See a demo at https://www.growthbook.io/

32 min

2 juni

The 2% close rate increase that turned Ford Credit's product teams into believers

Summary On this edition of The Experimentation Edge, Ashley Stirrup talks with Geoffrey Bell, Experimentation Product Specialist at Ford Credit, about building an experimentation practice inside a captive auto lender. Geoffrey shares the losing test that earned his program credibility, the "experimentation piggy bank" he picked up at Microsoft, and the breakthrough of connecting online experiments to offline dealership receivables. The throughline: a program proves its worth not just by the wins it ships, but by the expensive mistakes it prevents and the revenue it can finally trace. It's for product managers, data scientists, and growth leaders who want experimentation taken seriously by the business. Chapters 00:00 Intro 01:15 Geoffrey's path: Lowe's, Microsoft, Ford Credit 07:15 How Ford Credit fits with Ford Motor 10:15 The teams behind every Ford Credit page 15:15 The vehicle selector test that lost on purpose 19:15 Why feature placement beats feature ideas 21:15 Personalization and the shrinking-audience problem 25:15 Telling the story when a test loses 30:45 Connecting an online test to an offline car sale 33:55 The experimentation piggy bank Takeaways 1. Losing tests often create more value than winners because they stop expensive mistakes before they ship. 2. Measure experimentation two ways: the revenue you earn from wins and the revenue you save by killing bad experiences. 3. A feature that fails early in a flow can succeed later; placement and timing often matter more than the idea itself. 4. Connecting online experiments to offline outcomes like receivables turns a small lift into a number leadership cares about. 5. When you struggle to land a result, lead with the story of what the customer did, then bring the numbers. Connect with the GuestLinkedIn: https://www.linkedin.com/in/geoffrey-bell-62a03617/ Website: https://www.ford.com/finance/ SponsorGrowthbook helps you ship features with confidence by bringing experimentation and feature flagging into one open-source platform. No more guessing whether that new checkout flow actually moved the needle, waiting weeks for data team bandwidth, or flying blind on rollouts. Growthbook gives you a single place to run A/B tests, manage feature flags, and analyze results against your existing data warehouse. With powerful stats built in, it takes the complexity out of experimentation, helps you catch regressions before they hit every user, and makes it easy to test ideas that keep your product improving and your metrics moving in the right direction. See a demo at https://www.growthbook.io/ (00:00) - Intro (01:15) - Geoffrey's path: Lowe's, Microsoft, Ford Credit (07:15) - How Ford Credit fits with Ford Motor (10:15) - The teams behind every Ford Credit page (15:15) - The vehicle selector test that lost on purpose (19:15) - Why feature placement beats feature ideas (21:15) - Personalization and the shrinking-audience problem (25:15) - Telling the story when a test loses (30:45) - Connecting an online test to an offline car sale (33:55) - The experimentation piggy bank

37 min

13 maj

Atlassian's Andrew Willingham on the Talent Product That A/B Testing Turned Around

This episode of The Experimentation Edge explores how A/B testing, feature flags, and user research transformed Atlassian's talent product after it failed with its first users. Andrew Willingham — 11 years at Amazon, now Head of Legal and People Products at Atlassian — shares how product experimentation works when you can't test at scale, why your customer and your user are not the same person, and how the metrics you choose decide which experiments you can even run. SummaryAndrew Willingham, Head of Legal and People Products at Atlassian, spent 11 years at Amazon before joining Atlassian a year ago. His path from running A/B tests on millions of Amazon shoppers to building talent management software for a few hundred thousand employees forced a fundamental shift: when you can't run tests at scale, you have to sit with your actual users and watch them fail. He shares how building a talent review product for Amazon's HR specialists completely flopped when handed to HRBPs — and why that failure taught him more than any winning experiment. Now at Atlassian, he's applying that same rigor to reimagining hiring processes with AI, testing everything from recruiter screens to interview sequences that the industry has run the same way for decades. Timestamps03:09 From marketing Amazon's mobile app to building HR software for 1.5 million associates 08:19 Why a talent review product loved by IO psych experts flopped with actual HRBPs 11:11 How A/B testing helps product managers escape opinion-based politics 15:25 Testing copy that changes behavior: "We'll generate that status report for you" 17:20 The two North Star metrics Andrew optimizes: efficiency and quality 19:05 Khan Academy's metric trap: measuring cognitive engagement, not just completion 21:10 Why product managers resist experimentation — and what changes when you admit you don't know Takeaways- Your customer and your user may not be the same person — building for HR specialists instead of the HRBPs who actually run talent reviews resulted in a feature nobody could use. - When you can't test at scale, desk rides replace A/B tests — sitting with users and watching them struggle reveals failures faster than any dashboard. - Experimentation short-circuits political debates by removing opinion from product decisions. - Test metrics before you test features — usage time could signal engagement or just mean your product takes too long to do its job. - The experiments that fail deliver the most valuable learnings, especially when you expected a slam dunk. Connect with the guestAndrew Willingham on LinkedIn: https://www.linkedin.com/in/andrewwillingham/Learn more about Atlassian: https://www.atlassian.com/ SponsorGrowthbook helps you ship features with confidence by bringing experimentation and feature flagging into one open-source platform. No more guessing whether that new checkout flow actually moved the needle, waiting weeks for data team bandwidth, or flying blind on rollouts. Growthbook gives you a single place to run A/B tests, manage feature flags, and analyze results against your existing data warehouse. With powerful stats built in, it takes the complexity out of experimentation, helps you catch regressions before they hit every user, and makes it easy to test ideas that keep your product improving and your metrics moving in the right direction. See a demo at https://www.growthbook.io/ Topics: A/B testing, product experimentation, feature flags, user research, talent management, qualitative research, metric design, experimentation at scale, growth experimentation. (03:09) - From marketing Amazon's mobile app to building HR software for 1.5 million associates (08:19) - Why a talent review product loved by IO psych experts flopped with actual HRBPs (11:11) - How A/B testing helps product managers escape opinion-based politics (15:25) - Testing copy that changes behavior: "We'll generate that status report for you" (17:20) - The two North Star metrics Andrew optimizes: efficiency and quality (19:05) - Khan Academy's metric trap: measuring cognitive engagement, not just completion (21:10) - Why product managers resist experimentation — and what changes when you admit you don't know

23 min

12 maj

How DoorDash's Experimentation Platform Saved Millions With One A/B Test

This episode of The Experimentation Edge unpacks how DoorDash's experimentation platform runs 12,000+ A/B tests per year across 42 million monthly active users — and now powers merchant-led testing on menu pricing and promotions. Ilya Izrailevsky, Senior Engineering Manager leading the platform, explains how feature flags, marketplace experimentation, and CEO-level experiment reviews built a multi-million-dollar experimentation culture across consumers, dashers, and merchants. SummaryMost companies struggle to scale experimentation beyond engineering teams. DoorDash runs over 12,000 experiments per year across 42 million monthly active users — and now they're enabling restaurant owners to run their own A/B tests on menu pricing and promotions. Ilya Izrailevsky, Senior Engineering Manager leading DoorDash's experimentation platform, shares how the company built a three-sided marketplace testing program that balances consumers, dashers, and merchants across 40+ countries. From his time scaling search at Amazon (where offline model evaluation narrowed hundreds of candidates down to 10 for live testing) to preventing DashPass churn at DoorDash, Ilya reveals what happens when experimentation scales beyond product teams — and why CEO-level experiment review emails drive cultural change faster than any training program. One standout learning: expanding delivery radius to 11+ miles increased grocery orders but tanked retail conversions. The lesson wasn't about distance — it was that one metric approach breaks in multi-dimensional marketplaces. DoorDash now segments experimentation by vertical, behavior pattern, and regional market, using AI agents to mine institutional knowledge from past tests and auto-generate experiment summaries that ship company-wide within hours of readout. Timestamps00:40 From building Wasabi (Intuit's open-source platform) to running ML at Amazon and Uber 03:04 Why product velocity without experimentation creates feature bloat, not impact 05:32 Scaling search at Amazon: billions of products, 10 visible results, 25% win rate 08:22 Offline evaluation as a filter — golden data sets cut model candidates before live traffic 10:23 DoorDash's three-sided marketplace: 300 million feature flag evaluations per second 12:38 CEO Tony Xu reads every experiment email and replies with alternative hypotheses 13:33 Democratization at scale: enabling merchants to A/B test menu pricing and promotions 17:05 DashPass churn experiment uncovered value perception gap — became a full product area 22:03 Why expanding delivery radius killed retail orders but boosted grocery conversions 24:16 No single North Star metric — balancing consumer quality, dasher earnings, merchant mix 27:29 Four-dimensional scale: democratization, global expansion, new verticals, AI agents 31:03 Agentic experimentation: AI mines past tests to generate hypotheses and debug imbalance Takeaways- Win rate matters less than learnings per test — DoorDash ships company-wide experiment summaries (win or lose) that the CEO actively reads and responds to, creating cultural accountability around testing rigor.- Offline evaluation acts as a pre-filter for model velocity — Amazon's search team used golden data sets to cut hundreds of ML candidates down to 10 for live A/B testing, preventing wasted experiment slots.- One-size metrics break in multi-dimensional marketplaces — DoorDash balances consumer retention, dasher utilization, and merchant inventory mix across verticals because optimizing one side degrades the ecosystem.- Democratization requires opinionated templates, not open-ended tools — enabling non-technical users to run tests means embedding success metrics and guardrails into pre-built experiment configs.- AI scales institutional knowledge, not just analysis speed — mining past experiment readouts to auto-generate new hypotheses turns your testing history into a compounding advantage. Connect with the guestLinkedIn: https://www.linkedin.com/in/ilyaizrailevsky/Learn more about DoorDash: https://www.doordash.com/ SponsorGrowthbook helps you ship features with confidence by bringing experimentation and feature flagging into one open-source platform. No more guessing whether that new checkout flow actually moved the needle, waiting weeks for data team bandwidth, or flying blind on rollouts. Growthbook gives you a single place to run A/B tests, manage feature flags, and analyze results against your existing data warehouse. With powerful stats built in, it takes the complexity out of experimentation, helps you catch regressions before they hit every user, and makes it easy to test ideas that keep your product improving and your metrics moving in the right direction. See a demo at https://www.growthbook.io/ Topics: A/B testing, experimentation platform, feature flags, marketplace experimentation, machine learning, growth experimentation, statistical significance, experimentation culture, agentic AI workflows. (00:40) - From building Wasabi (Intuit's open-source platform) to running ML at Amazon and Uber (03:04) - Why product velocity without experimentation creates feature bloat, not impact (05:32) - Scaling search at Amazon: billions of products, 10 visible results, 25% win rate (08:22) - Offline evaluation as a filter — golden data sets cut model candidates before live traffic (10:23) - DoorDash's three-sided marketplace: 300 million feature flag evaluations per second (12:38) - CEO Tony Xu reads every experiment email and replies with alternative hypotheses (13:33) - Democratization at scale: enabling merchants to A/B test menu pricing and promotions (17:05) - DashPass churn experiment uncovered value perception gap — became a full product area (22:03) - Why expanding delivery radius killed retail orders but boosted grocery conversions (24:16) - No single North Star metric — balancing consumer quality, dasher earnings, merchant mix (27:29) - Four-dimensional scale: democratization, global expansion, new verticals, AI agents (31:03) - Agentic experimentation: AI mines past tests to generate hypotheses and debug imbalance

31 min

The Experimentation Edge

Ship faster, measure better: experimentation in the age of AI

False negatives are killing your best product ideas

Squarespace killed its blank template and built something better

The "View All" page that made more money by showing less

DART: The four metrics that actually measure AI agents

The 2% close rate increase that turned Ford Credit's product teams into believers

Atlassian's Andrew Willingham on the Talent Product That A/B Testing Turned Around

How DoorDash's Experimentation Platform Saved Millions With One A/B Test

Om

Information

The Experimentation Edge

Avsnitt

Ship faster, measure better: experimentation in the age of AI

False negatives are killing your best product ideas

Squarespace killed its blank template and built something better

The "View All" page that made more money by showing less

DART: The four metrics that actually measure AI agents

The 2% close rate increase that turned Ford Credit's product teams into believers

Atlassian's Andrew Willingham on the Talent Product That A/B Testing Turned Around

How DoorDash's Experimentation Platform Saved Millions With One A/B Test

Om

Information