Making of the SRE Omelette

Kevin Yu - Principal SRE, IBM Software

Wondering what an omelette has to do with SRE (Site Reliability Engineering)? It’s based on the analogy that culture is the outcome of what we do - so in the context of the chicken or the egg, it’s like an omelette. And that’s how this podcast was born: Making of the SRE Omelette. This show explores how the practice of SRE can help organizations achieve positive business and client success outcomes. Season 1 focused on culture—because reliability isn’t just about systems, it’s about people. Season 2 explored Reliable Sustainability—how SRE practices can help organizations deliver reliability today while building a sustainable future. Season 3 is all about Resilience—not just resilient systems, but resilient teams and resilient ways of working. Because resilience is what helps us adapt, respond, and thrive through adversity. Much like an avid chef who brings their own unique flair to a classic recipe, the art of SRE lies in thoughtfully adapting industry best practices to fit the distinct culture, needs, and goals of your team and organization. This podcast is designed to inspire thoughtful experimentation and encourage personalization—empowering you to forge your own path toward greater reliability, resilience, and long-term success. Join us on this journey to surface the ingredients that help drive business and client success through SRE.

  1. 3天前

    Episode 1 - Resilience Enablement

    Season 3 of Making of the SRE Omelette is here - and it’s all about resilience. Resilience isn’t just about surviving outages. It’s about building systems and cultures that adapt, learn, and thrive under pressure. In our kickoff episode, we sit down with Dr. Jennifer Petoff, co-editor of Site Reliability Engineering: How Google Runs Production Systems and leader of Google’s Global SRE Education. Jennifer shares why resilience starts with people, not just technology—and how psychological safety and confidence are the secret ingredients for reliability at scale. You’ll learn: * How to scale learning like a production system * Why postmortem culture drives improvement * How to apply SRE principles beyond infrastructure If you’ve ever wondered how to make reliability a business advantage, this episode is for you.   Check out How to SRE Anything here: https://www.reliablepgm.com/how-to-sre-anything/   Topics: * Origins of SRE and Education at Google How Google scaled SRE education globally. Why education is treated like a production system (repeatable, reliable, measurable). * Psychological Safety and Learning Why psychological safety is critical for resilience. Creating environments where teams can share mistakes without fear of blame. How this accelerates learning and reliability. * Hands-On Experience as a Learning Model Importance of experiential learning (e.g., game days, simulations). Why theory alone isn’t enough for building confidence under pressure. * Scaling Knowledge Across Large Organizations Strategies Google uses to scale SRE principles globally. Balancing standardization with flexibility for local teams. * Resilience Beyond Reliability How resilience differs from reliability. Building adaptive systems and teams that thrive through adversity. * Culture as a Foundation Why culture is the “secret ingredient” for successful SRE adoption. Encouraging curiosity and collaboration across roles. * Future of SRE Education Trends in learning for distributed teams. How continuous education supports evolving reliability practices.

    42 分钟
  2. 3天前 · 视频

    Episode 1 - Resilience Enablement (video version)

    Season 3 of Making of the SRE Omelette is here - and it’s all about resilience. Resilience isn’t just about surviving outages. It’s about building systems and cultures that adapt, learn, and thrive under pressure. In our kickoff episode, we sit down with Dr. Jennifer Petoff, co-editor of Site Reliability Engineering: How Google Runs Production Systems and leader of Google’s Global SRE Education. Jennifer shares why resilience starts with people, not just technology—and how psychological safety and confidence are the secret ingredients for reliability at scale. You’ll learn: * How to scale learning like a production system * Why postmortem culture drives improvement * How to apply SRE principles beyond infrastructure If you’ve ever wondered how to make reliability a business advantage, this episode is for you.   Check out How to SRE Anything here: https://www.reliablepgm.com/how-to-sre-anything/   Topics: * Origins of SRE and Education at Google How Google scaled SRE education globally. Why education is treated like a production system (repeatable, reliable, measurable). * Psychological Safety and Learning Why psychological safety is critical for resilience. Creating environments where teams can share mistakes without fear of blame. How this accelerates learning and reliability. * Hands-On Experience as a Learning Model Importance of experiential learning (e.g., game days, simulations). Why theory alone isn’t enough for building confidence under pressure. * Scaling Knowledge Across Large Organizations Strategies Google uses to scale SRE principles globally. Balancing standardization with flexibility for local teams. * Resilience Beyond Reliability How resilience differs from reliability. Building adaptive systems and teams that thrive through adversity. * Culture as a Foundation Why culture is the “secret ingredient” for successful SRE adoption. Encouraging curiosity and collaboration across roles. * Future of SRE Education Trends in learning for distributed teams. How continuous education supports evolving reliability practices.

    42 分钟

关于

Wondering what an omelette has to do with SRE (Site Reliability Engineering)? It’s based on the analogy that culture is the outcome of what we do - so in the context of the chicken or the egg, it’s like an omelette. And that’s how this podcast was born: Making of the SRE Omelette. This show explores how the practice of SRE can help organizations achieve positive business and client success outcomes. Season 1 focused on culture—because reliability isn’t just about systems, it’s about people. Season 2 explored Reliable Sustainability—how SRE practices can help organizations deliver reliability today while building a sustainable future. Season 3 is all about Resilience—not just resilient systems, but resilient teams and resilient ways of working. Because resilience is what helps us adapt, respond, and thrive through adversity. Much like an avid chef who brings their own unique flair to a classic recipe, the art of SRE lies in thoughtfully adapting industry best practices to fit the distinct culture, needs, and goals of your team and organization. This podcast is designed to inspire thoughtful experimentation and encourage personalization—empowering you to forge your own path toward greater reliability, resilience, and long-term success. Join us on this journey to surface the ingredients that help drive business and client success through SRE.