13 episodes

Tom and Jamie talk through the postmortems of outages that have affected high profile sites.

The Downtime Project Tom Kleinpeter and Jamie Turner

    • Technology
    • 5.0 • 22 Ratings

Tom and Jamie talk through the postmortems of outages that have affected high profile sites.

    7 Lessons From 10 Outages

    7 Lessons From 10 Outages

    After 10 post-mortems in their first season, Tom and Jamie reflect on the common issues they’ve seen. Click through for details! Summing Up Downtime We’re just about through our inaugural season of The Downtime Project podcast, and to celebrate, we’re reflecting back on recurring themes we’ve noticed in many of the ten outages we’ve poured […]

    • 46 min
    Salesforce Publishes a Controversial Postmortem (and breaks their DNS)

    Salesforce Publishes a Controversial Postmortem (and breaks their DNS)

    On May 11, 2021, Salesforce had a multi hour outage that affected numerous services.  Their public writeup was somewhat controversial — it’s the first one we’ve done on this show that called out the actions of a single individual in a negative light. The latest SRE Weekly has a good list of some different articles […]

    • 40 min
    Kinesis Hits the Thread Limit

    Kinesis Hits the Thread Limit

    During a routine addition of some servers to the Kinesis front end cluster in US-East-1 in November 2020, AWS ran into an OS limit on the max number of threads. That resulted in a multi hour outage that affected a number of other AWS servers, including ECS, EKS, Cognito, and Cloudwatch. We probably won’t do […]

    • 44 min
    How Coinbase Unleashed a Thundering Herd

    How Coinbase Unleashed a Thundering Herd

    In November 2020, Coinbase had a problem while rotating their internal TLS certificates and accidentally unleashed a huge amount of traffic on some internal services. This was a refreshingly non-database related incident that led to an interesting discussion about the future of infrastructure as code, the limits of human code review, and how many load […]

    • 38 min
    Auth0’s Seriously Congested Database

    Auth0’s Seriously Congested Database

    Just one day after we released Episode 5 about Auth0’s 2018 outage, Auth0 suffered a 4 hour, 20 minute outage that was caused by a combination of several large queries and a series of database cache misses.  This was a very serious outage, as many users were unable to log in to sites across the […]

    • 1 min
    Talkin’ Testing with Sujay Jayakar

    Talkin’ Testing with Sujay Jayakar

    Tom was feeling under the weather after joining Team Pfizer last week, so today we have a special guest episode with Sujay Jayakar, Jamie’s co-founder and engineer extraordinaire. While it’s great to respond well to an outage, it’s even better to design and test systems in such a way that outages don’t happen. As we […]

    • 29 min

Customer Reviews

5.0 out of 5
22 Ratings

22 Ratings

Conderoga ,

Excellent systems advice in every episode!

I think y’all are doing a fantastic job of throughly digging into these outages while keeping all of the feedback purely constructive. I’ve been highly recommending this to all of my coworkers since discovering it. Keep up the great work!

ayyjohnsonly ,

New favorite podcast

As a relatively junior software engineer there are a ton of “common sense” things that Tom and Jamie review for outages that I’ve never heard about before and that are low hanging fruit for most companies. This podcast is a must listen for software engineers everywhere.

gabriel from pittsburgh ,

Helps me learn about system design

I’ve been learning about system design concepts from The System Design Primer and other resources. This podcast makes for a good pairing, I find it instructive to learn how different components can break down and lead to an outage. It is easier to wrap my head around the components when I have this context.

Top Podcasts In Technology

Lex Fridman
Jason Calacanis
NPR
Jack Rhysider
Gregg Phillips
Jason Calacanis

You Might Also Like

Jamison Dance and Dave Smith
Changelog Media
Software Engineering Daily
Andreessen Horowitz
Jack Rhysider
The Economist