Developers Who Test

Testery, Inc

A podcast for developers who ship better software. We talk about all things software testing.

  1. Jun 22

    Building Production-Grade AI: Nikolai Grabner on Testing RAGs, LLMs, and the QA Mindset

    In this episode of Developers Who Test, host Chris Harbert sits down with Nikolai Grabner, a Senior Software Engineer and Technical Lead at Enigma Solutions, to talk about what it actually takes to build and test AI systems that are ready for production. Nikolai opens by demystifying retrieval augmented generation (RAG), using the analogy of a knowledgeable judge who consults a specialist library (a vector database) when a question falls outside their general expertise. He explains why companies are increasingly building private, in-network RAG systems: to keep proprietary information out of third-party models like OpenAI and Anthropic while still giving employees a single, instant point of access for things like HR policy questions and onboarding knowledge. Nikolai shares the origin story behind his own product, SAP Bot, which grew out of market research he did when founding Enigma Solutions. After hearing that many internal RAG systems were, in his words, not working properly, his QA instincts kicked in and he set out to prove a thoroughly tested private RAG could get close to the quality of the big public models. A central theme of the conversation is how testing AI is fundamentally different from traditional pass or fail test cases. Because the same prompt can return different answers each time, Nikolai built a scoring mechanism rooted in statistics, precision, and coverage to detect hallucination (making things up) and drift (staying on topic but giving wrong answers). Chris draws a parallel to Six Sigma and the idea of variability as the enemy of quality. The two get into the practical realities of building with AI, including using tools like Prompt Foo to fire the same set of prompts at OpenAI, Anthropic, and Gemini and compare results, tuning the temperature for creativity, and learning hard lessons about performance. Nikolai recounts how discovering CUDA and offloading the LLM to the GPU cut his response times from five minutes down to about fifteen seconds with streamed output. They also swap cautionary tales about AI getting things subtly wrong: a coding tool that inverted passes and fails in test results, and an MCP server with start and end dates reversed that the LLM quietly worked around, leaving a hidden landmine in the system. Much of the discussion centers on discipline. Nikolai argues that vibe coding can produce production-grade software, but only with clear requirements, a design spec, a roadmap, phased delivery, and regression testing after every change. He compares vibe coding to managing a junior dev team that still needs its work tested. Chris highlights how far modern tooling has come, pointing to Playwright MCP and the Testery MCP for running reliable end to end tests at scale and feeding results back to the LLM, and Nikolai contrasts that with the weeks it once took to script a single test in WinRunner back in 1999. The episode closes on continuous quality and staying current. Nikolai makes the case for always-on testing of AI systems (since a single faulty document can skew an entire RAG), dedicated research and development teams, and giving testers room to run proofs of concept on test infrastructure. Both reflect on how quickly organizations can become dinosaurs in the AI era, the value of conferences like TestCon for learning what is genuinely cutting edge, and how that spirit of learning by doing is exactly why the podcast exists.

    44 min
  2. Jun 22

    DORA and the AI Capabilities Model: Nathen Harvey on Why AI Amplifies the Best and Worst of Your SDLC

    In this episode of Developers Who Test, host Chris Harbert sits down with Nathen Harvey, who leads the DORA team at Google Cloud. Nathen has co-authored multiple reports on software delivery performance and was a contributor and editor for 97 Things Every Cloud Engineer Should Know. The conversation starts with what people get wrong about DORA. Nathen explains that the famous four (now five) software delivery metrics are just the surface. Treating them as the whole picture is like stepping on a scale every day and expecting the number to change: the metrics tell you how you are doing, but it is the underlying capabilities and practices that actually move them. He walks through how Accelerate introduced DORA to most of the industry, why so many readers stop at the four keys on page 19, and how the capabilities model in the appendix is where the real value lives. Chris and Nathen dig into a decade of findings: throughput and stability move together rather than in opposition, smaller batches lead to better outcomes, and trunk-based development is both one of the most effective and most controversial practices, including its surprising link to burnout on teams new to it. They talk about why alignment across practices matters, since you cannot adopt trunk-based development without also addressing test automation, test data management, and CI/CD, and why the goal should be to become an elite improver rather than an elite performer. The second half focuses on DORA's new AI Capabilities Model, published in December 2025. With roughly 90 percent of respondents now using AI professionally, the differentiator is no longer whether you use AI but how. Nathen lays out the seven capabilities that amplify AI's benefits: a clear and communicated AI stance, a healthy data ecosystem, AI-accessible data, working in small batches, strong version control, user centricity, and a high-quality internal platform. The core 2025 finding is that AI is an amplifier across the SDLC: high-performing teams get faster, while teams with bottlenecks feel that pain even more acutely when they push ten times more change into an unscaled review or testing process. They close on what this means in practice: AI is democratizing who can build software, so investing in platform guardrails, fast feedback, ephemeral environments, and high parallelism testing becomes more important than ever. Nathen points listeners to dora.dev and the dora.community to assess their own capabilities and start improving. Key Topics: Common misconceptions and anti-patterns around the DORA metricsWhy the four (now five) keys are only the surface of software delivery performanceHow Accelerate and the capabilities model fit togetherThroughput and stability improving together, not in tensionTrunk-based development, smaller batches, and the burnout findingAlignment across test automation, test data, and CI/CDBecoming an elite improver instead of an elite performerContextualizing findings and user centricityThe new AI Capabilities Model and its seven capabilitiesAI as an amplifier of both strengths and bottlenecks across the SDLCDemocratized building, platform guardrails, and the renewed importance of fast feedback

    42 min
  3. Apr 6

    Mastering ETL Testing: Data Pipelines, Healthcare QA, and the Future of Cross-Agent Testing with Jitendra Boddapati

    In this episode, Chris Harbert sits down with Jitendra Boddapati, a lead quality engineer with over 10 years of experience specializing in API testing, ETL, big data QA, and accessibility testing across healthcare, banking, and retail domains. Jitendra shares his journey from college graduate to leading complex ETL projects, including how he taught himself SQL under pressure to deliver a challenging data migration project. The conversation dives deep into the often-overlooked world of ETL (Extract, Transform, Load) testing—what it is, why it matters, and how it differs fundamentally from traditional UI testing. Key topics covered: Why ETL testing doesn't get the attention it deserves and how to change thatThe critical role of Source to Target Mapping documents for ETL testersReal-world production bugs: How a comma in a provider name caused data to shift into the wrong columnsData profiling techniques like pattern and frequency analysis to uncover hidden anomaliesUsing AI tools to generate SQL queries and automate data validationChris and Jitendra also explore how AI is transforming user interfaces and what that means for testers. When users interact with your product through Claude, ChatGPT, or Cursor instead of your website, who's responsible for testing that experience? The episode wraps up with practical advice for anyone looking to get started with ETL testing: understand your data scope, break pipelines into smaller chunks, and always start with clear requirements. Whether you're a seasoned data engineer or a developer curious about testing data pipelines, this episode offers valuable insights into a testing discipline that's becoming increasingly critical as organizations deal with ever-growing volumes of data.

    47 min
  4. Mar 3

    The Economics of Testing: Making the Business Case for Quality with Vitaly Sharovatov

    In this episode, Chris Harbert sits down with Vitaly Sharovatov, a seasoned developer and engineering manager with over 22 years of experience. Vitaly serves as a developer advocate at Qase, a test case management platform, and has written extensively about AI, testing methodology, and the economics of software quality. The conversation tackles a question every quality advocate faces: how do you convince leadership to invest in testing? Vitaly shares practical frameworks for quantifying the business value of quality and making the case for prevention over firefighting. Key topics covered: Why developers implicitly do testing already—and why they should understand it deeplyA simpler approach: quantifying the costs of bad quality you're already paying (support calls, lost sales, maintenance overhead)The social dynamics of selling quality ideas—finding allies and helping managers "show off" cost savingsWhen to automate vs. when to test manually: understanding the economic inflection pointThe hidden costs of poor quality on team morale, burnout, and employee retentionVitaly shares real-world examples, including a dating app where automated tests passed but a critical button was hidden below the viewport, and an insurance company that staffed 300 people for quarters to work around a poorly tested API. The episode wraps up with a key insight: most quality problems have social roots within organizations. Success requires not just good testing practices, but the ability to win allies, understand incentives, and sell ideas to stakeholders who aren't always rational economic actors. Whether you're trying to justify a testing initiative to leadership, optimize your team's approach to quality, or simply understand the true cost of defects, this episode provides a practical economic lens for thinking about software testing. Find Vitaly at beyondquality.org, a non-commercial community focused on collaborative research into testing economics, or connect with him on LinkedIn.

    44 min
  5. Feb 10

    From Broadway Drummer to Senior SDET: Angel Williams on AI-Assisted Testing, Flaky Tests, and the QA Mindset

    In this episode of Developers Who Test, host Chris Harbert sits down with Angel Williams, Senior SDET at CHG Healthcare, one of the largest healthcare staffing companies in the US. Angel's journey into software quality is unlike any other—she started as a percussionist trying to make it on Broadway before discovering a knack for debugging deployment scripts during IT contract work. The conversation explores the unique personality traits that draw people to quality engineering. Chris shares his fascinating discovery that every member of one of his QA teams scored high on "restorative" in StrengthsFinder—the same trait that had Angel taking apart the family stereo as a kid just to understand how it worked. Angel provides insight into testing in healthcare, where privacy and security aren't just nice-to-haves—they're essential. She explains how protecting both provider and patient data shapes testing strategies at CHG, from scrubbing logs to ensuring sensitive information never travels over live wires. The discussion takes a deep dive into AI-assisted testing. Angel shares practical examples of using Claude Code with Playwright's MCP integration to build performance dashboards and analyze code for risks. She emphasizes that AI shines brightest not when writing tests, but when helping SDETs understand unfamiliar code, identify risks, and—perhaps most valuably—keep documentation up to date. "Every time I look at a PR with major changes, I ask AI if the README reflects the new code," she explains. Chris and Angel swap war stories about flaky tests, including Angel's mysterious 5 PM failures that turned out to be a timezone shift issue—exactly matching one of the patterns in Chris's "14 Reasons for Flaky Tests" presentation. They discuss infrastructure-related flakiness, load balancer issues, and the critical importance of running tests before merge rather than after. The episode wraps with a thought-provoking discussion about leveraging MCP servers not just for automation, but for asking questions about quality itself—combining data from Jira, test results, and documentation to get a complete picture of project health. Key Topics: The "restorative" personality trait and QA professionalsTesting in healthcare: privacy, security, and compliancePractical AI applications for SDETs Running tests before merge vs. afterMCP servers as a new layer for quality insights

    46 min
  6. Jan 13

    Developer Productivity Metrics: DORA, SPACE, and What Really Drives Team Performance with Martijn Goossens

    Martijn Goossens is Director of Advisory Services at Cerios, a Dutch QA company with approximately 450 employees. Martijn has about 20 years of experience helping teams improve their quality and implement test automation. He is a regular speaker at developer and software quality conferences. In this episode, Chris talks with Martijn Goossens about developer experience, productivity metrics, and what actually drives team performance. Martijn shares insights from his recent conference talk at Hustef and breaks down the key frameworks teams use to measure their effectiveness. The conversation explores the DORA metrics (deployment frequency, lead time for changes, change failure rate, and mean time to recovery) and the SPACE framework (satisfaction/wellbeing, performance, adaptiveness/momentum, communication/collaboration, and efficiency/flow). Martijn explains why he prefers DORA for its practical, quantifiable nature, while SPACE tends to be more subjective and developer-focused. Key topics include: The Dutch testing community: Why the Netherlands has become a hub for software testing innovation and how strong community connections accelerate professional growthMeeting culture and productivity: The value of no-meeting days, the danger of "Swiss cheese calendars," and how to prepare teams for focused work timeHackathons and innovation: Different approaches to fostering creativity, from quarterly hackathons to dedicated innovation time, plus Chris's "hackcation" conceptIndividual vs. team metrics: Why metrics should be treated as sensors providing information rather than judgment tools, and the cautionary tale of the "Cobra problem" where rewarding the wrong behaviors leads to perverse outcomesThe flight level concept: How management can monitor high-level metrics and only drill down when signals indicate a problemMartijn emphasizes that metrics don't tell the whole story --- they help you know what questions to ask and who to ask them to. A developer with fewer commits might be the team's primary reviewer or architect, while someone with many commits might just be making small edits. Context matters. The episode wraps up with Martijn's experience speaking at Hustef in Hungary (held in a train museum complete with miniature train rides) and his upcoming keynote in Tokyo.

    44 min

About

A podcast for developers who ship better software. We talk about all things software testing.

You Might Also Like