45 episodes

Software reliability is a tough topic for engineers in many organizations. The Reliability Enablers (Ash Patel and Sebastian Vietz) know this from experience. Join us as we demystify reliability jargon like SRE, DevOps, and more. We interview experts and share practical insights. Our mission is to help you boost your success in reliability-enabling areas like observability, incident response, release engineering, and more.

read.srepath.com

Reliability Enablers Ash Patel & Sebastian Vietz

- Technology
- 5.0 • 2 Ratings

- MAY 30, 2024
#44 - Making SLOs Matter to Stakeholders

#44 - Making SLOs Matter to Stakeholders

Bonus episode on SLOs because Sebastian and I felt that we did not cover the why of SLOs and make them relevant to stakeholders.

This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
- 20 min
- MAY 28, 2024
#43 - SLOs: a Deeper Dive into its Mechanics

#43 - SLOs: a Deeper Dive into its Mechanics

This episode continues our coverage of Chapter 4 of the Site Reliability Engineering book (2016). In this second part, we take a deeper dive into the mechanics of SLOs.
Here are 5 takeaways from the show:
* Start Small with SLOs: Begin with a limited number of SLOs and iteratively refine them based on experience and feedback. Avoid overwhelming teams with too many objectives at once.
* Defend and Enforce SLOs: Ensure that selected SLOs have real consequences attached to them. If conversations about priorities cannot be influenced by SLOs, reconsider their relevance and enforceability.
* Continuous Improvement: Embrace the idea that SLOs are not static targets but evolve over time. Start with loose targets and refine them as you learn more about the system's behavior. Commit to ongoing maintenance and improvement of SLOs for long-term success.
* Effective Communication Skills: Recognize the importance of effective communication, especially for technology professionals. Develop the ability to translate technical concepts into plain language that stakeholders can understand and appreciate.
* Understanding User Needs: Prioritize understanding and aligning with the expectations of users/customers when defining service level objectives (SLOs) and metrics. User feedback should guide the selection of meaningful SLOs.

This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
- 31 min
- MAY 21, 2024
#42 - Hitting Software SLA Targets through SLOs and SLIs

#42 - Hitting Software SLA Targets through SLOs and SLIs

In this first part of a 2-part coverage, Sebastian Vietz and I work out how to meet SLAs through SLOs and SLIs. This episode covers Chapter 4 of the Site Reliability Engineering book (2016).
Here are 7 takeaways from the show:
* Involve Technical Stakeholders Early: Ensure that technical stakeholders, such as SREs, are involved in discussions about SLAs and SLOs from the beginning. Their expertise can help ensure that objectives are feasible and aligned with the technical capabilities of the service.
* Differentiate Between SLAs and SLOs: Understand the distinction between SLAs, which are legal contracts, and SLOs, which are based on customer expectations. Avoid using SLAs as a substitute for meaningful service level objectives.
* Prioritize Meaningful Metrics: Focus on a select few service level indicators (SLIs) that truly reflect what users want from the system. Avoid the temptation to monitor everything and instead choose indicators that provide valuable insights into service performance.
* Align with Customer Expectations: Start by understanding and prioritizing the expectations of your customers. Use their feedback to define service level objectives (SLOs) that align with their needs and preferences.
* Avoid Alert Fatigue: Be mindful of the number of metrics being monitored and the associated alerts. Too many indicators can lead to alert fatigue and make it difficult to prioritize and respond to issues effectively. Focus on a few key indicators that matter most.
* Start Top-Down with SLIs: Take a top-down approach to defining SLIs, starting with customer expectations and working downwards. This ensures that the selected metrics are meaningful and relevant to users' needs.
* Prepare for Deep Dives: Anticipate the need for deeper exploration of specific topics, such as SLOs, and allocate time and resources to thoroughly understand and implement them in your work.

This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
- 29 min
- MAY 14, 2024
#41 Curbing High Observability Costs

#41 Curbing High Observability Costs

No one wants to get Coinbase’s $65 million observability bill in the future. Sure, observability comes with a necessary cost. But that cost cannot exceed the concrete and perceived value on balance sheets and the minds of leaders.
Sofia Fosdick shares practical insights on curbing high observability costs. She’s a senior account executive at Honeycomb.io and has held similar titles at Turbunomic, Dynatrace, and Grafana. Like always, this is not a sponsored episode!
We tackled the cost issue by covering ideas like aligning cost with value, event-based systems, and dynamic sampling. You will not want to miss this conversation if your observability bill is starting to look dangerous.
You can ⁠connect with Sofia via LinkedIn

This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
- 24 min
- MAY 7, 2024
#40 How to Enable Observability for Success

#40 How to Enable Observability for Success

Observability is more than a set of technologies. It’s a practice. Timothy Mahoney is no stranger to this practice, enabling many developer teams to take on better practices in observability.
He’s a senior systems engineer at IKEA and is part of its observability enabling team.
Tim highlighted the importance of developing and driving frameworks for observability. He also covered the antipattern of teams having a tool-driven mindset and the challenges of switching them out of this.
You can ⁠connect with Timothy via LinkedIn

This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
- 27 min
- APR 30, 2024
#39 How Chaos Engineering Helps Reduce Incident Risk

#39 How Chaos Engineering Helps Reduce Incident Risk

Chaos Engineering is no longer a nice to have, as Ananth Movva explains in this episode of the SREpath podcast. His experiences with it drove a reduced number and severity of serious incidents and outages.
He’s been at the helm of reliability-focused decision-making at one of Canada’s largest banks, BMO, since 2020. Having completed 12 years at the bank, Ananth has seen the evolution of banking technology from archaic to user-centric, where incidents are considered seriously.
Ananth highlighted the use of chaos principles and tooling to identify future points of failure well ahead of time. He also talked about issues in bringing developers to integrate chaos into SDLC. You will not want to miss this conversation!
You can ⁠connect with Ananth via LinkedIn

This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
- 24 min