9 episodes

Engineers have been managing incidents for as long as they’ve been building software. But it’s only in the past few years that incident management has become a specialization in the world of DevOps. Join Robert Ross, an incident-obsessed career responder, as he talks to the folks at the forefront of the movement to better manage, remediate, and learn from incidents. You’ll get stories from the field, great advice, and useful best practices from practitioners managing incidents for companies like Zendesk, Udemy, VMware, and many more. Learn more about the push for a world of Better Incidents.

Better Incidents Podcast Better Incidents by FireHydrant

    • Technology
    • 5.0 • 1 Rating

Engineers have been managing incidents for as long as they’ve been building software. But it’s only in the past few years that incident management has become a specialization in the world of DevOps. Join Robert Ross, an incident-obsessed career responder, as he talks to the folks at the forefront of the movement to better manage, remediate, and learn from incidents. You’ll get stories from the field, great advice, and useful best practices from practitioners managing incidents for companies like Zendesk, Udemy, VMware, and many more. Learn more about the push for a world of Better Incidents.

    Navigating the SRE Landscape w/ Ricardo Castro

    Navigating the SRE Landscape w/ Ricardo Castro

    Join Robert Ross and special guest Ricardo Castro in a dynamic discussion that dives into the world of DevOps and the challenges and career progression of Site Reliability Engineers (SREs). They highlight the ambiguity surrounding the SRE role across different organizations and the difficulty in defining SRE levels. The importance of both technical and communication skills is emphasized, and the hosts address the difficulties in measuring the contribution of SREs, particularly in managing incidents.

    • 36 min
    Alerting, Incident Response, and the SDLC

    Alerting, Incident Response, and the SDLC

    In this episode we chat with veteran cloud architect Masaru Hoshi about the challenges of alert fatigue, the importance of effective alerting systems, and fostering ownership in software teams. Masaru shares insights from his 30-year career, emphasizing the need for balance, trust, and collaboration in incident response.

    • 32 min
    Practical Ways to Implement Incident Management with Shannon Schulte

    Practical Ways to Implement Incident Management with Shannon Schulte

    Engineers have been managing incidents for about as long as they've been building software, but it's only in the past few years that incident management has become a primary focus for software teams. Today I'm talking to Shannon Schulte, an engineering manager of incident response about practical ways to implement incident management.

    • 28 min
    Focus on Assembly Time with Great Circle's Brent Chapman

    Focus on Assembly Time with Great Circle's Brent Chapman

    When it comes to resolving an incident there are a number of metrics that can be misleading. Resolution time, for example, can fluctuate wildly. However, there’s one that we have a significant amount of influence over.

    Today, I’m talking to Brent Chapman, Founder at Great Circle, about how engineering teams should ditch metrics like MTTR and instead focus on what we can control; assembly time.



    Brent's Information:

    Website: https://greatcircle.com/

    LinkedIn: https://www.linkedin.com/in/brentchapman/

    Twitter: https://twitter.com/brent_chapman







    WW2 plane improvements

    Book - The Checklist Manifesto

    https://slack.com/events/resolve-incidents-faster-in-slack

    https://slack.com/blog/collaboration/engineers-netflix-pagerduty-slack

    https://slack.com/resources/using-slack/the-modern-incident-response

    https://slack.com/resources/using-slack/slack-for-incident-management

    https://slack.com/blog/transformation/incident-management-slack

    https://slack.com/intl/en-in/events/minimize-incident-response-times

    • 49 min
    The Importance of Retros with CrowdStrike's Chad Todd

    The Importance of Retros with CrowdStrike's Chad Todd

    After the dust of an incident settles, it's normal for us to want to move on and get back to less stressful work. But doing so would skip an essential part of the incident management process, the Retro.

    Today, I’m talking to Chad Todd, Site Reliability Manager at Crowdstrike, about the importance of retros to avoid what he calls “Incident amnesia”.

    • 36 min
    The hidden costs of incident management with Cowbell's MRZ

    The hidden costs of incident management with Cowbell's MRZ

    We all understand that incidents cause a loss in revenue, but the camouflaged costs of incidents can cause more damage than the immediate impact to revenue. What does the itemized receipt of an incident really look like? 

    In this episode of the Better Incidents podcast, we talk with MRZ, Sr. Director of Production Engineering at Cowbell, about the hidden costs of incidents and a concept he uses called Mean Time to Clue, or my preferred version, Mean Time to WTF?

    • 36 min

Customer Reviews

5.0 out of 5
1 Rating

1 Rating

Top Podcasts In Technology

Acquired
Ben Gilbert and David Rosenthal
All-In with Chamath, Jason, Sacks & Friedberg
All-In Podcast, LLC
Hard Fork
The New York Times
TED Radio Hour
NPR
Lex Fridman Podcast
Lex Fridman
Darknet Diaries
Jack Rhysider