Mike opens by framing “production incidents” with a vivid non-software story. As a teenager he smashed bathroom tile with a dead-blow hammer, drove his pinky knuckle into a jagged shard, and had to manage both the injury and the panic of his little brother who got sick from seeing it. He uses that as the metaphor for on-call life. Bad things happen, reactions vary, and what you do in the first moments matters, especially staying calm, reassuring others, and focusing on the most urgent next step. The group riffs on modern incident response, starting with humor about “just ask the LLM,” but landing on a real point. AI can be excellent at sifting noisy logs, even if you should not blindly trust it mid-emergency. Dave pivots to the idea that the best loyalty, from customers and coworkers, is earned when something goes wrong and support is excellent. He describes jumping into a long outage call ready to tear apart his own recent work with zero ego, because people remember who shows up with “two tow trucks” when everything’s on fire. Mike and Justin emphasize composure and delegation. If you are overwhelmed, hand off to someone with a cool head. Prioritize restoring service, “stop the bleeding,” before deep root-cause analysis. Invest ahead of time in rollback plans, feature flags, staged rollouts, and observability. From there, they broaden into practical triage and long-term resilience. Verify the issue, look at metrics and dashboards to identify symptoms like CPU, disk, network, traffic spikes, and database issues, and narrow the delta between last-known-good and broken. They discuss how constraints differ in mobile, including App Store review delays, crash loops, and reliance on the user’s device and network. They also cover security incidents, where you need monitoring to detect attacks, plus coordinated mitigation like blocking traffic and working with vendors. They stress the importance of having an incident quarterback, a playbook, and a contact list for after-hours escalation. The close focuses on what comes after the band-aid. Do postmortems and cleanup so temporary fixes do not become permanent donuts. Balance realistic risk planning with business needs. Emphasize strong observability and the ability to recover quickly, alongside prevention, echoing practices like Chaos Monkey and the idea that monitoring prevents historical events from re-happening. Transcript: MIKE: Hello, and welcome to another episode of the Acima Development Podcast. I'm Mike, and I'm hosting again today. We've got a good crew here today, and I'm excited about this one. We've got Kyle Archer, Eddy Lopez. We've got Dave Brady. Hello, Justin Ellis, Thomas Wilcox. We've got Ramses Bateman, and Will Archer. So, I think we've all been here before multiple times [chuckles]. We've got a familiar crew to talk about an important topic that's always fresh because [chuckles] there's a constant need. I was racking my brain what story to tell for this, and I ended up going back to...I don't even remember exactly when it was, but it was somewhere in my late teens, early twenties, in that era. So, admission, that's quite a long time ago [laughs]. That's more than halfway back [laughs]. And I was helping out at my parents' house with some remodeling they were doing. They were tearing out the...they were redoing the bathroom. And so, they were tearing out...they had a wall that had some tile on it, and they were tearing out the tile. And they were going to put some new...I don't even remember. They shifted things around, but they were tearing out the tile. That's the important part. And I had my little brother with me nearby. He was too young to really help. He was, like, six. And, you know, he was just hanging out and chatting with me, and I was taking a...they call it a dead blow hammer. It's a hammer with sand in it, so when you hit, it just stops. So, it's a weighted hammer, but it has a soft landing, so it doesn't have a...it doesn't bounce back, right? It just kind of stops, rather than having a strong bounce. It's good for situations where you want to do that, right, where you don't...you really don't want it bouncing back and hitting you in the face. And I was breaking up the tile wall. Context, there I am with, like, a six-year-old breaking up a tile wall. And there was some wire mesh behind it, and I was gradually peeling back. As I broke it, I was peeling back this wire mesh that was embedded in some sort of mortar. And I was pulling out [inaudible 02:26] the cement behind the tile. And so, as I'm banging, I pull back a piece, you know, pull it back because I'm making some progress, and I swing in. And because that broken tile is now hanging out and mounted on that wire behind, with the, you know, the cement that's holding it together, when I swing with that hammer at full force, right after peeling, you know, an extra layer back, I sunk my knuckle of my pinky finger right into a piece of broken tile. And I go, oh, and I look down. And I look down into my knuckle, maybe five eighths of an inch, a couple of centimeters more than you should be looking down into a knuckle [laughs]. Oh [laughs], that moment, that's not good. And then the blood starts, right? A rather remarkable amount of blood, I'll say [laughs], was coming out of the finger. Remember, there's a six-year-old here in the room with me. And he yells, "Mom, dad, come help Mike. He's really hurt bad." And, of course, they're thinking the worst. I'm like, "No, no, no, no, no, it's okay [laughs]," yelling. But, you know, there's the moment of panic there. And so, I had some choices in that moment, right? What do I do? Luckily, I think I handled it pretty well. I comforted the people around me to let them know this isn't a disaster. I'm going to need to do something, but you don't need to, you know, call 911. Unfortunately...so, we got everything up, went to one of those urgent care places. They stitched me up. I could tell some other weird stories about it there. A few weeks later, I noticed a little white mark on my finger, and I started pulling, and it was a piece of the thread from the gauze that had somehow got stuck in my finger. And I pulled out, like, a foot [laughs] of this string out of my finger, and then it snapped down near the bottom, and some of it zipped back in. I've never seen it again, like, oooh [vocalization] [laughs]. And I still, when I touch my knuckle, I feel weird sensations all the way down the rest of my finger. It's a [inaudible 04:21] impact of that one. But my poor little brother [chuckles], he got sick from seeing it, and he was throwing up and just not okay. And I felt bad, and I had to comfort him, "This is really okay. I get some stitches, and it'll be fine [chuckles]. It will be fine." And [chuckles] I felt really bad because I was not really even thinking about it. I didn't realize that he was not okay. So, when I discovered before I left, like, 10 minutes later, he wasn't okay, you know, I gave him a hug, you know, tried to help him feel like things were okay, get a ride over to the urgent care facility. They stitched me up, and I'm fine. Today, we're going to talk about dealing with production incidents. And I bring up this example because it's outside of software, but it's a production incident, right? You've got the bad things happen, and what do you do? What do you do now? And I think that there's some aspects to that story we can riff on as well as others. But it helps set the stage for a lot of what happens when we have these production incidents and what we do in that moment because it matters a lot. And how some of the reactions, you know, there's a variety of reactions to this moment among the various parties in place that had some better, some worse, you know, impact. So, servers are down, you know, how do you keep cool? Things are on fire. And that's our topic today. And I've got definitely some thoughts on this. I've written down some notes, but, as usual, I don't want to...I've told the story, right? I've laid out the context. So, I am really hoping some of you all will have some initial thoughts to lead out with. EDDY: Sorry, is the answer not ask AI to see what's wrong with your server [inaudible 06:02]? MIKE: [laughs] DAVE: How do you think the server went down? EDDY: I was thinking, is that not the go-to answer now? I'm sorry, podcast over. Ask the LLM. [laughter]. WILL: Not not the answer. DAVE: The AI is going to say, "You are absolutely right to be upset that the server is down." JUSTIN: So, related to that -- WILL: I mean, I'm just saying that's not not the answer. Like, AI is great at reading a log. Like, it took me -- DAVE: Yeah, actually. WILL: Years, if not decades, to get, like, pretty decent at reading log vomit, you know what I mean, like, filtering through the chicken innards that [laughter], you know, a log will, like, throw up all over you and just be like, "Oh yeah, that's actually it." AI is actually super duper at that. I don't trust it, especially in an emergency but, like, do that. Sure. Yes. Do it. EDDY: I was literally pairing with someone, and we were looking at a Grafana log, right? And I'm like, "Oh, it's because of this." And they're like, "Where? Where is that?" And I'm like, "Oh, I read it somewhere here. Hold on, let me find it again." And, like, you get so good at ignoring all the clutter, you know, and just filtering everything. But, oh my God, dude, like, AI can sift through, like, raw JSON, like candy. DAVE: I have a thought to throw out. I have a bunch. I always do. But one of the things that...and this is not really a production thing, well, maybe it is: loyalty. The thing that makes somebody loyal, a customer, in particular, is you get this graph of, like, did they have a good time, or did they have a bad time? And then did they receive good support, or did they receive bad support? And the most vehement haters of a