9 episodes

Engineers have been managing incidents for as long as they’ve been building software. But it’s only in the past few years that incident management has become a specialization in the world of DevOps. Join Robert Ross, an incident-obsessed career responder, as he talks to the folks at the forefront of the movement to better manage, remediate, and learn from incidents. You’ll get stories from the field, great advice, and useful best practices from practitioners managing incidents for companies like Zendesk, Udemy, VMware, and many more. Learn more about the push for a world of Better Incidents.

Better Incidents Podcast Better Incidents by FireHydrant

- Technology
- 5.0 • 1 Rating

- OCT 27, 2023
Navigating the SRE Landscape w/ Ricardo Castro

Navigating the SRE Landscape w/ Ricardo Castro

Join Robert Ross and special guest Ricardo Castro in a dynamic discussion that dives into the world of DevOps and the challenges and career progression of Site Reliability Engineers (SREs). They highlight the ambiguity surrounding the SRE role across different organizations and the difficulty in defining SRE levels. The importance of both technical and communication skills is emphasized, and the hosts address the difficulties in measuring the contribution of SREs, particularly in managing incidents.
- 36 min
- OCT 5, 2023
Alerting, Incident Response, and the SDLC

Alerting, Incident Response, and the SDLC

In this episode we chat with veteran cloud architect Masaru Hoshi about the challenges of alert fatigue, the importance of effective alerting systems, and fostering ownership in software teams. Masaru shares insights from his 30-year career, emphasizing the need for balance, trust, and collaboration in incident response.
- 32 min
- AUG 18, 2023
Practical Ways to Implement Incident Management with Shannon Schulte

Practical Ways to Implement Incident Management with Shannon Schulte

Engineers have been managing incidents for about as long as they've been building software, but it's only in the past few years that incident management has become a primary focus for software teams. Today I'm talking to Shannon Schulte, an engineering manager of incident response about practical ways to implement incident management.
- 28 min
- MAY 18, 2023
Focus on Assembly Time with Great Circle's Brent Chapman

Focus on Assembly Time with Great Circle's Brent Chapman

When it comes to resolving an incident there are a number of metrics that can be misleading. Resolution time, for example, can fluctuate wildly. However, there’s one that we have a significant amount of influence over.

Today, I’m talking to Brent Chapman, Founder at Great Circle, about how engineering teams should ditch metrics like MTTR and instead focus on what we can control; assembly time.

Brent's Information:

Website: https://greatcircle.com/

LinkedIn: https://www.linkedin.com/in/brentchapman/

Twitter: https://twitter.com/brent_chapman

WW2 plane improvements

Book - The Checklist Manifesto

https://slack.com/events/resolve-incidents-faster-in-slack

https://slack.com/blog/collaboration/engineers-netflix-pagerduty-slack

https://slack.com/resources/using-slack/the-modern-incident-response

https://slack.com/resources/using-slack/slack-for-incident-management

https://slack.com/blog/transformation/incident-management-slack

https://slack.com/intl/en-in/events/minimize-incident-response-times
- 49 min
- MAY 9, 2023
The Importance of Retros with CrowdStrike's Chad Todd

The Importance of Retros with CrowdStrike's Chad Todd

After the dust of an incident settles, it's normal for us to want to move on and get back to less stressful work. But doing so would skip an essential part of the incident management process, the Retro.

Today, I’m talking to Chad Todd, Site Reliability Manager at Crowdstrike, about the importance of retros to avoid what he calls “Incident amnesia”.
- 36 min
- APR 10, 2023
The hidden costs of incident management with Cowbell's MRZ

The hidden costs of incident management with Cowbell's MRZ

We all understand that incidents cause a loss in revenue, but the camouflaged costs of incidents can cause more damage than the immediate impact to revenue. What does the itemized receipt of an incident really look like?

In this episode of the Better Incidents podcast, we talk with MRZ, Sr. Director of Production Engineering at Cowbell, about the hidden costs of incidents and a concept he uses called Mean Time to Clue, or my preferred version, Mean Time to WTF?
- 36 min