Better Incidents Podcast Better Incidents by FireHydrant
-
- Technologie
Engineers have been managing incidents for as long as they’ve been building software. But it’s only in the past few years that incident management has become a specialization in the world of DevOps. Join Robert Ross, an incident-obsessed career responder, as he talks to the folks at the forefront of the movement to better manage, remediate, and learn from incidents. You’ll get stories from the field, great advice, and useful best practices from practitioners managing incidents for companies like Zendesk, Udemy, VMware, and many more. Learn more about the push for a world of Better Incidents.
-
Navigating the SRE Landscape w/ Ricardo Castro
Join Robert Ross and special guest Ricardo Castro in a dynamic discussion that dives into the world of DevOps and the challenges and career progression of Site Reliability Engineers (SREs). They highlight the ambiguity surrounding the SRE role across different organizations and the difficulty in defining SRE levels. The importance of both technical and communication skills is emphasized, and the hosts address the difficulties in measuring the contribution of SREs, particularly in managing incidents.
-
Alerting, Incident Response, and the SDLC
In this episode we chat with veteran cloud architect Masaru Hoshi about the challenges of alert fatigue, the importance of effective alerting systems, and fostering ownership in software teams. Masaru shares insights from his 30-year career, emphasizing the need for balance, trust, and collaboration in incident response.
-
Practical Ways to Implement Incident Management with Shannon Schulte
Engineers have been managing incidents for about as long as they've been building software, but it's only in the past few years that incident management has become a primary focus for software teams. Today I'm talking to Shannon Schulte, an engineering manager of incident response about practical ways to implement incident management.
-
Focus on Assembly Time with Great Circle's Brent Chapman
When it comes to resolving an incident there are a number of metrics that can be misleading. Resolution time, for example, can fluctuate wildly. However, there’s one that we have a significant amount of influence over.
Today, I’m talking to Brent Chapman, Founder at Great Circle, about how engineering teams should ditch metrics like MTTR and instead focus on what we can control; assembly time.
Brent's Information:
Website: https://greatcircle.com/
LinkedIn: https://www.linkedin.com/in/brentchapman/
Twitter: https://twitter.com/brent_chapman
WW2 plane improvements
Book - The Checklist Manifesto
https://slack.com/events/resolve-incidents-faster-in-slack
https://slack.com/blog/collaboration/engineers-netflix-pagerduty-slack
https://slack.com/resources/using-slack/the-modern-incident-response
https://slack.com/resources/using-slack/slack-for-incident-management
https://slack.com/blog/transformation/incident-management-slack
https://slack.com/intl/en-in/events/minimize-incident-response-times -
The Importance of Retros with CrowdStrike's Chad Todd
After the dust of an incident settles, it's normal for us to want to move on and get back to less stressful work. But doing so would skip an essential part of the incident management process, the Retro.
Today, I’m talking to Chad Todd, Site Reliability Manager at Crowdstrike, about the importance of retros to avoid what he calls “Incident amnesia”. -
The hidden costs of incident management with Cowbell's MRZ
We all understand that incidents cause a loss in revenue, but the camouflaged costs of incidents can cause more damage than the immediate impact to revenue. What does the itemized receipt of an incident really look like?
In this episode of the Better Incidents podcast, we talk with MRZ, Sr. Director of Production Engineering at Cowbell, about the hidden costs of incidents and a concept he uses called Mean Time to Clue, or my preferred version, Mean Time to WTF?