Tech Council

What Is SRE? Site Reliability Engineering Explained | Episode 19

Most companies are doing SRE wrong.

Hiring SREs doesn’t make you reliable. Metrics dashboards don’t guarantee accountability. And cultural change doesn’t happen because you wrote it on a slide deck.

In this episode, Duncan Mapes and Jason Ehmke push back against the misconceptions. They argue that SRE isn’t a bolt-on team but a systemic shift in how engineering works. Without shared accountability, meaningful metrics, and cultural buy-in, SRE will fail. 

And no, copying Google’s model isn’t the answer.

If you think SRE is just a headcount play, this episode will challenge everything you believe. Got a different perspective? Drop us a review, share your comments, and send your toughest SRE questions our way.

Top Takeaways:

  • SRE is a complex practice that varies across organizations.
  • Defining SRE upfront can prevent chaos later.
  • SRE is not just about taking over responsibilities; it's about collaboration.
  • The role of SREs is to guide and support application teams.
  • Key metrics for SRE success include mean time to detect and restore.
  • Cultural transformation is essential for successful SRE implementation.
  • Finding early wins can help demonstrate the value of SRE.
  • Effective communication is crucial for SREs to succeed.
  • SRE teams should focus on toil reduction and automation.
  • Building a strong relationship between SREs and app teams is vital.

Mentioned in this Episode:
Site Reliability Engineering: How Google Runs Production Systems - https://www.oreilly.com/library/view/site-reliability-engineering/9781491929117/

Connect with us:

Duncan Mapes

Jason Ehmke

DevGrid.io

DevGrid on LinkedIn

DevGrid on X