٠٦‏/٠٨‏/٢٠٢٥

Uptime Labs and the Multi-Party Dilemma (Part II)

Watch on YouTube In Part II of the Multi-Party Dilemma (MPD) drill retrospective, we reconvene to dig deeper into the implications and nuances of the simulated incident exercise hosted on the Uptime Labs platform. Eric Dobbs (incident analyst), Alex Elman (deputy IC), and Sarah Butt (incident commander) continue their debrief with Courtney, reflecting on how team behavior evolved under stress, the importance of expertise in managing non-technical aspects of an incident like saturation, and how deeply held assumptions often go unspoken until tested under pressure. This episode emphasizes the complex social and cognitive dimensions of incident response, such as how people coordinate, communicate, and construct shared understanding. It highlights the value of analyzing drills not for failure points, but for what they reveal about real work, adaptation, and human coordination. Key Highlights Incident Analysis as a Practice:Eric Dobbs emphasized understanding how people make sense of unfolding events, rather than judging decisions in hindsight.The goal is to study the “why it made sense at the time,” not what was “right” or “wrong.”Drills Expose Hidden Assumptions:Even experienced responders bring unspoken mental models into incidents.The drill revealed assumptions about communication flows, authority boundaries, and vendor interactions that were not made explicit in planning.The Value of Human Expertise:Everyone involved in this incident brought an unparalleled level of expertise to the work. Often this kind of expertise goes unnoticed or is taken for granted, however this kind of knowledge is precisely what makes for smoother, better coordinated (and sometimes), faster incident response.Importance of Framing:The way questions are asked in retrospectives can shape what is revealed—e.g., “What made that hard?” is more productive than “What did you miss?”Reframing incidents around constraints and tradeoffs leads to deeper insight.Team Learning and Culture:Safe, high-trust environments enable better learning during drills.Psychological safety allows team members to admit confusion or raise alternate interpretations during real incidents.Resources and References Episode IModel of Overload/Saturation as part of the Theory of Graceful ExtensibilityLorin's Law

٥٦ من الدقائق

٢٩‏/٠٧‏/٢٠٢٥

Uptime Labs and the Multi-Party Dilemma (Part I)

Watch on YouTube In this episode I'm joined by a group of seasoned incident response professionals to discuss a simulated incident drill conducted on the Uptime Labs platform. The conversation centers around the Multi-party Dilemma—the challenge of coordinating incident response across teams or organizations with different missions, contexts, or incentives. Eric Dobbs, our incident analyst, joins to break down the drill and provide deep insights into the incident dynamics, team interactions, and what true incident analysis looks like when it's done well. Participants Alex Elman and Sarah Butt, who served as deputy and lead incident commanders respectively during the drill, recount their roles and experiences, highlighting realistic stress responses, decision-making, and coordination failures and successes. Hamed Silatani, CEO of Uptime Labs, provides context and insights into the behind-the-scenes work he and his team provide as the other "characters" driving the narrative of the drill. The episode uniquely showcases the value of structured incident analysis and the benefits of using drills to expose hidden assumptions and improve resilience in complex systems. A few key highlights include: How detailed incident analysis leads to an understanding of the context and rationale behind responders' actions, rather than identifying errors or assigning blame.The real goal is to learn how the system and people actually function, not just fix a broken component.Themes revealed by the analysis and subsequent discussionSaturation and the value of trust in delegation (especially between Sarah and Alex).The role of deep expertise and how it often makes work appear effortless.Importance of recognizing the real work done during incidents—often messy and improvisational.References/Resources What Experts See That the Rest of Us Miss During IncidentsIncident Fest (Uptime Labs event)Law of Fluency Handling the Multi-Party Dilemma (Sarah & Alex paper)Embracing the Multi-Party Dilemma (Sarah & Alex conference talk)

٤٨ من الدقائق

١٤‏/٠٥‏/٢٠٢٥

Canva and the Thundering Herd

Greetings fellow incident nerds, and welcome to Season 2 of The VOID podcast. The main new thing for this new season is we’re now available in video—so if you’re listening to this and prefer watching me make odd faces and nod a lot, you can find us here on YouTube. The other new thing is we now have sponsors! These folks help make this podcast possible, but they don’t have any say over who joins us or what we talk about, so fear not. This episode’s sponsor is Uptime Labs. Uptime Labs is a pioneering platform specializing in immersive incident response training. Their solution helps technical teams build confidence and expertise through realistic simulations that mirror real-world outages and security incidents. When most of investment these days in the incident space goes to technology and process, Uptime Labs focuses on sharpening the human element of incident response. In this episode, we talk to Simon Newton, Head of Platforms at Canva, about their first public incident report. It’s not their first incident by any means, but it’s the first time they chose as a company to invest in sharing the details of an incident with the rest of us, which of course we’re big fans of here at the VOID. We discuss: What led to Canva finally deciding to publish a public incident reportWhat the size and nature of their incident response looks like (this incident involved around 20 different people!)Their progression from a handful of engineers handling incidents to having a dedicated Incident Command (IC) roleAvoiding blame when a known performance fix was ready to be deployed but hadn't yet, which contributed to the incident getting worse as it progressedThe various ways the people involved in the incident collaborated and improvised to resolve it

٣٧ من الدقائق

٢٨‏/٠٢‏/٢٠٢٥

Episode 8: A Tale of A Near Miss

On this episode of the VOID podcast, I’m joined by Nick Travaglini, who is a Technical Customer Success Manager at Honeycomb. Nick wrote up a near miss that his team tackled towards the end of 2023, and I’ve been really wanting to discuss a near miss incident report for a very long time. What’s a Near Miss you might ask, or how is that an incident, or is it? What IS an incident? Keep listening, because we’re going to get into those questions, along with discussing whether or not it’s a good idea to say nasty things about other companies in your incident reports. Related Resources Preempting Problems in a Sociotechnical System (the incident report)Work as Imagined vs Work as DoneResilience in Software FoundationOn the Mode of Existence of Technical ObjectsHitting the Brakes2024 VOID Report

٣٦ من الدقائق

٣٠‏/٠١‏/٢٠٢٥

Episode 7: When Uptime Met Downtime

We took a bit of a hiatus from recording last year, but we're back with an episode that I think everyone is really going to enjoy. Late last year, John Allspaw told me about this new company called Uptime Labs. They simulate software incidents, giving people a safe and constructive environment in which to experience incidents, practice what response is like, and bring what they learn back to their own organizations. For the record, this is not a sponsored podcast. I legitimately just love what they do. And I had the sincere privilege to meet Uptime's cofounder and CEO, Hamed Silatani at SRECon EMEA in November, where he gave a fantastic talk about some of the things they've learned about incident response for running hundreds of simulations for their customers. They recently had their first serious outage of their own platform. And so Hamed is joined by Joe McEvitt, cofounder and director of engineering at Uptime to discuss with me the one time that Uptime met downtime.

٥٢ من الدقائق

٢٥‏/٠٤‏/٢٠٢٣

Episode 6: Laura Nolan and Control Pain

In the second episode of the VOID podcast, Courtney Wang, an SRE at Reddit, said that he was inspired to start writing more in-depth narrative incident reports after reading the write-up of the Slack January 4th, 2021 outage. That incident report, along with many other excellent ones, was penned by Laura Nolan and I've been trying to get her on this podcast since I started it. So, this is a very exciting episode for me. And for you all, it's going to be a bit different because instead of just discussing a single incident that Laura has written about, we get to lean on and learn from her accumulated knowledge doing this for quite a few organizations. And she's come with opinions. A fun fact about this episode, I was going to title it "Laura Nolan and Control Plane Incidents," but the automated transcription service that I use, which is typically pretty spot on (thanks, Descript!), kept changing "plane" to "pain" and well, you're about to find out just how ironic that actually is... We discussed: A set of incidents she's been involved with that featured some form of control plane or automation as a contributing factor to the incident.What we can learn from fields of study like Resilience Engineering, such as the notion of Joint Cognitive SystemsOther notable incidents that have similar factorsWays that we can better factor in human-computer collaboration in tooling to help make our lives easier when it comes to handling incidentsReferences: Slack's Outage on Jan 4th 2021 A Terrible, Horrible, No-Good, Very Bad Day at Slack Google's "satpocalypse" Meta (Facebook) outage Reddit Pi-day outage Ironies of Automation (Lissane Bainbridge)

٢٨ من الدقائق

١٤‏/٠٢‏/٢٠٢٣

Episode 5: Incident.io and The First Big Incident

What happens when you use your own incident management software to manage your own incidents but said incident takes out your own incident management product? Tune in to find out... We chat with engineer Lawrence Jones about: How their product is designed, and how that both contributed to, but also helped them quickly resolve, the incidentThe role that organizational scaling (hiring lots of folks quickly) can play in making incident response challengingWhat happens when reality doesn't line up with your assumptions about how your system(s) worksThe importance of taking a step back and making sure the team is taking care of each other when you can get a break in the urgency of an incident

٣٢ من الدقائق

١٢‏/٠١‏/٢٠٢٣

Episode 4: Emily Ruppe and The Inaugural LFI Conference

In this episode we take a delightful detour from our usual VOID programming to have Emily Ruppe, a Solutions Engineer at Jeli.io and member of the Learning From Incidents (LFI) community, on the program to discuss the upcoming LFI Conference happening in Denver in February. Find out more about the goals and some of the featured speakers for the event, and we hope to see you there! Discussed in this episode: Jeli.io Learning From Incidents The LFI Conference (Feb 15-16, 2023 in Denver, CO)

١٢ من الدقائق

٢٠‏/١٠‏/٢٠٢٢

Episode 3: Spotify and A Year of Incidents

If you or anyone you know has listened to Spotify, you're likely familiar with their year end Wrapped tradition. You get a viral, shareable little summary of your favorite songs, albums and artists from the year. In this episode, I chat with Clint Byrum, an engineer whose team helps keep Spotify for Artists running, which in turn keeps well, Spotify running. Each year, the team looks back at the incidents they've had in their own form of Wrapped. They tested hypotheses with incident data that they've collected, found some interesting results and patterns, and helped push their team and larger organization to better understand what they can learn from incidents and how they can make their systems better support artists on their platform. We discussed: Metrics, both good and badMoving away from MTTR after they found it to be unreliableHow incident analysis is akin to archeologyGetting managers/executives interested in incident reviewsThe value of studying near misses along with actual incidents

٣٢ من الدقائق

٠١‏/١٢‏/٢٠٢١

Episode 2: Reddit and the Gamestop Shenanigans

At the end of January, 2021, a group of Reddit users organized what's called a "short squeeze." They intended to wreak havoc on hedge funds that were shorting the stock of a struggling brick and mortar game retailer called GameStop. They were coordinating to buy more stock in the company and drive its price further up. In large part, they were successful—at least for a little while. One hedge fund lost somewhere around $2 billion and one Reddit user purportedly made off with around $13 million. Things managed to get even weirder from there, when online trading company Robinhood restricted trading for GameStop shares and sent its values plummeting losing three fourths of its value in just over an hour. But that's less relevant to this episode. What matters is that while all this was happening, traffic to a very specific page on Reddit, called a subreddit, r/wallstreetbets went to the moon. Long after the dust had settled, and the team had a chance to recover and reflect, some of the engineers wrote up an anthology of reports based on the numerous incidents they had that week. We talk to Courtney Wang, Garrett Hoffman, and Fran Garcia about those incidents, and their write-ups, in this episode. A few of the things we discussed include: The precarious dynamic where business successes (traffic surges based on cultural whims) are hard to predict, and can hit their systems in wild and surprising ways.How incidents like these have multiple contributing factors, not all of which are purely technicalHow much they learned about their company's processes, assumptions, organizational boundaries, and other "non-technical" factorsHow people are the source of resilience in these complex sociotechnical systemsCreating psychologically safe environments for people who respond to incidentsTheir motivation for investing so much time and energy into analyzing, writing, and publishing these incident reviewsWhat studying near misses illuminated for them about how their systems work Resources mentioned in this episode include: Reddit's r/wallstreetsbets incident anthology, which links to all the reports we discuss."Work as imagined and work as done" by Steven Shorrock (video)

٤٤ من الدقائق

٠١‏/١١‏/٢٠٢١

Episode 1: Honeycomb and the Kafka Migration

"We no longer felt confident about what the exact operational boundaries of our cluster were supposed to be." In early 2021, observability company Honeycomb dealt with a series of outages related to their Kafka architectural migration, culminating in a 12-hour incident, which is an extremely long outage for the company. In this episode, we chat with two engineers involved in these incidents, Liz Fong-Jones and Fred Hebert, about the backstory that is summarized in this meta-analysis they published in May. We cover a wide range of topics beyond the specific technical details of the incident (which we also discuss), including: Complex socio-technical systems and the kinds of failures that can happen in them (they're always surprises)Transparency and the benefits of companies sharing these outage reportsSafety margins, performance envelopes, and the role of expertise in developing a sense for themHoneycomb's incident response philosophy and processThe cognitive costs of responding to incidentsWhat we can (and can't) learn from incident reportsResources mentioned in the episode: Kafka Migration and Lessons Learned by HoneycombManaging the Hidden Costs of Coordination by Laura McGuireAbove the Line, Below the Line by Richard Cook"Those found responsible have been sacked": Some observations on the usefulness of error by Richard Cook and Christopher P. Nemeth Published in partnership with Indeed.

٣٢ من الدقائق

The VOID

الحلقات

Uptime Labs and the Multi-Party Dilemma (Part II)

Uptime Labs and the Multi-Party Dilemma (Part I)

Canva and the Thundering Herd

Episode 8: A Tale of A Near Miss

Episode 7: When Uptime Met Downtime

Episode 6: Laura Nolan and Control Pain

Episode 5: Incident.io and The First Big Incident

Episode 4: Emily Ruppe and The Inaugural LFI Conference

Episode 3: Spotify and A Year of Incidents

Episode 2: Reddit and the Gamestop Shenanigans

Episode 1: Honeycomb and the Kafka Migration

حول

المعلومات