SEP 15
EPISODE 19
44 MIN

What Is SRE? Site Reliability Engineering Explained | Episode 19

Most companies are doing SRE wrong.

Hiring SREs doesn’t make you reliable. Metrics dashboards don’t guarantee accountability. And cultural change doesn’t happen because you wrote it on a slide deck.

In this episode, Duncan Mapes and Jason Ehmke push back against the misconceptions. They argue that SRE isn’t a bolt-on team but a systemic shift in how engineering works. Without shared accountability, meaningful metrics, and cultural buy-in, SRE will fail.

And no, copying Google’s model isn’t the answer.

If you think SRE is just a headcount play, this episode will challenge everything you believe. Got a different perspective? Drop us a review, share your comments, and send your toughest SRE questions our way.

Top Takeaways:

SRE is a complex practice that varies across organizations.
Defining SRE upfront can prevent chaos later.
SRE is not just about taking over responsibilities; it's about collaboration.
The role of SREs is to guide and support application teams.
Key metrics for SRE success include mean time to detect and restore.
Cultural transformation is essential for successful SRE implementation.
Finding early wins can help demonstrate the value of SRE.
Effective communication is crucial for SREs to succeed.
SRE teams should focus on toil reduction and automation.
Building a strong relationship between SREs and app teams is vital.

Mentioned in this Episode:
Site Reliability Engineering: How Google Runs Production Systems - https://www.oreilly.com/library/view/site-reliability-engineering/9781491929117/

Connect with us:

Duncan Mapes

Jason Ehmke

DevGrid.io

DevGrid on LinkedIn

DevGrid on X

Show

Tech Council
Frequency

Updated Weekly
Published

September 15, 2025 at 9:00 AM UTC
Length

44 min
Episode

19
Rating

Clean

What Is SRE? Site Reliability Engineering Explained | Episode 19

Information