43 episodes

Most advice on Site Reliability Engineering (SRE) is by BigTech for BigTech. SREpath's more realistic approach helps you make SRE work in a "normal" organization. Join your hosts, Ash Patel and Sebastian Vietz, as we demystify SRE jargon, interview experts, and share practical insights. Our mission is to help you boost your SRE efforts to succeed in areas like observability, incident response, release engineering, and more. We're reliability-focused professionals from companies where software is critical, but is not the product itself.

read.srepath.com

S.R.E.path Podcast Ash P

- Technology
- 5.0 • 2 Ratings

- MAY 21, 2024
#42 - Hitting Software SLA Targets through SLOs and SLIs

#42 - Hitting Software SLA Targets through SLOs and SLIs

In this first part of a 2-part coverage, Sebastian Vietz and I work out how to meet SLAs through SLOs and SLIs. This episode covers Chapter 4 of the Site Reliability Engineering book (2016).
Here are 7 takeaways from the show:
* Involve Technical Stakeholders Early: Ensure that technical stakeholders, such as SREs, are involved in discussions about SLAs and SLOs from the beginning. Their expertise can help ensure that objectives are feasible and aligned with the technical capabilities of the service.
* Differentiate Between SLAs and SLOs: Understand the distinction between SLAs, which are legal contracts, and SLOs, which are based on customer expectations. Avoid using SLAs as a substitute for meaningful service level objectives.
* Prioritize Meaningful Metrics: Focus on a select few service level indicators (SLIs) that truly reflect what users want from the system. Avoid the temptation to monitor everything and instead choose indicators that provide valuable insights into service performance.
* Align with Customer Expectations: Start by understanding and prioritizing the expectations of your customers. Use their feedback to define service level objectives (SLOs) that align with their needs and preferences.
* Avoid Alert Fatigue: Be mindful of the number of metrics being monitored and the associated alerts. Too many indicators can lead to alert fatigue and make it difficult to prioritize and respond to issues effectively. Focus on a few key indicators that matter most.
* Start Top-Down with SLIs: Take a top-down approach to defining SLIs, starting with customer expectations and working downwards. This ensures that the selected metrics are meaningful and relevant to users' needs.
* Prepare for Deep Dives: Anticipate the need for deeper exploration of specific topics, such as SLOs, and allocate time and resources to thoroughly understand and implement them in your work.

This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
- 29 min
- MAY 14, 2024
#41 Curbing High Observability Costs

#41 Curbing High Observability Costs

No one wants to get Coinbase’s $65 million observability bill in the future. Sure, observability comes with a necessary cost. But that cost cannot exceed the concrete and perceived value on balance sheets and the minds of leaders.
Sofia Fosdick shares practical insights on curbing high observability costs. She’s a senior account executive at Honeycomb.io and has held similar titles at Turbunomic, Dynatrace, and Grafana. Like always, this is not a sponsored episode!
We tackled the cost issue by covering ideas like aligning cost with value, event-based systems, and dynamic sampling. You will not want to miss this conversation if your observability bill is starting to look dangerous.
You can ⁠connect with Sofia via LinkedIn

This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
- 24 min
- MAY 7, 2024
#40 How to Enable Observability for Success

#40 How to Enable Observability for Success

Observability is more than a set of technologies. It’s a practice. Timothy Mahoney is no stranger to this practice, enabling many developer teams to take on better practices in observability.
He’s a senior systems engineer at IKEA and is part of its observability enabling team.
Tim highlighted the importance of developing and driving frameworks for observability. He also covered the antipattern of teams having a tool-driven mindset and the challenges of switching them out of this.
You can ⁠connect with Timothy via LinkedIn

This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
- 27 min
- APR 30, 2024
#39 How Chaos Engineering Helps Reduce Incident Risk

#39 How Chaos Engineering Helps Reduce Incident Risk

Chaos Engineering is no longer a nice to have, as Ananth Movva explains in this episode of the SREpath podcast. His experiences with it drove a reduced number and severity of serious incidents and outages.
He’s been at the helm of reliability-focused decision-making at one of Canada’s largest banks, BMO, since 2020. Having completed 12 years at the bank, Ananth has seen the evolution of banking technology from archaic to user-centric, where incidents are considered seriously.
Ananth highlighted the use of chaos principles and tooling to identify future points of failure well ahead of time. He also talked about issues in bringing developers to integrate chaos into SDLC. You will not want to miss this conversation!
You can ⁠connect with Ananth via LinkedIn

This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
- 24 min
- APR 23, 2024
#38 The Real Cost of Software Reliability & Downtime

#38 The Real Cost of Software Reliability & Downtime

This episode covers Chapter 3 of the Site Reliability Engineering book (2016). In this second part, we talk about the costs behind reliability and choosing not to do it well or at all.
Here are key takeaways from our conversation:
* Prioritize Risk Mitigation: Recognize SRE as a discipline focused on mitigating risks within your organization, including technology, reputation, and financial risks. Allocate resources accordingly to address these risks proactively.
* Consider Cost-Effectiveness: When aiming to improve reliability, consider the cost-effectiveness of incremental improvements. Evaluate the balance between investment in reliability and the value it brings to your organization.
* Advocate Continuously: Continuously advocate for the importance of reliability engineering within your organization. Communicate transparently about the value SRE teams add and the impact of their work on the organization's success.
* Explore Alternative Metrics: Explore alternative availability metrics beyond traditional time-based measurements. Consider event-based metrics to gain a more nuanced understanding of service availability and performance.
* Embrace Regional Focus: Shift from relying solely on global availability metrics to more granular regional metrics. Understand the varying impacts on different customer audiences and prioritize improvements accordingly.
* Navigate Regulatory Challenges: Be mindful of regulatory challenges, such as GDPR, and understand their implications on service availability and reliability. Adapt strategies and solutions to comply with regulations while maintaining operational efficiency.
* Align Reliability with Revenue: Recognize the direct correlation between service availability and revenue generation, particularly for revenue-driven services like ad platforms. Invest in reliability engineering to ensure consistent revenue streams.
* Tier Services Strategically: Implement a tiered approach to prioritize reliability efforts, with revenue-generating services like ad platforms placed in the top tier. Allocate resources based on the criticality of services to the organization's objectives.

This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
- 23 min
- APR 16, 2024
#37 An SRE Approach to Managing Technology Risk

#37 An SRE Approach to Managing Technology Risk

This episode covers Chapter 3 of the Site Reliability Engineering book (2016). In this first part, we talk about embracing risk from the SRE perspective.

We'll cover how it's very different to the typical IT risk management mindset.

Here are key takeaways from our conversation:

Embrace Risk with Velocity: Rather than being hindered by traditional governance models and change approval boards, consider embracing risk while maintaining development velocity. Strive to find a balance between risk management and the speed of innovation.

Reevaluate Risk Management Approaches: Challenge traditional approaches to risk management, especially in larger organizations with extensive governance procedures. Explore alternative methods that prioritize agility and efficiency without compromising reliability.

Conceptualize Risk as a Continuum: View risk as a continuous spectrum and assess it based on various dimensions, such as the complexity of changes, the criticality of systems, and the impact on user experience. Continuously evaluate and adjust risk management strategies accordingly.

Balance Stability and Innovation: Recognize that extreme reliability comes at a cost and may hinder the pace of innovation. Aim for an optimal balance between stability and innovation, prioritizing user satisfaction and efficient service operations.

Implement Service-Level Objectives (SLOs): Deliver services with explicitly delineated levels of service, allowing clients to make informed risk and cost trade-offs when building their systems. Define SLOs based on the importance and criticality of services to enable better decision-making.

Visualize Risk Assessment: Utilize visual representations, such as whiteboard diagrams, to assess and communicate different levels of risk within your software systems. Encourage collaborative discussions among team members to determine acceptable risk levels.

Prioritize Customer Impact: Consider the impact of changes on customer experience and prioritize risk management efforts accordingly. Differentiate between critical user journeys and cosmetic changes to allocate scrutiny appropriately.

This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
- 30 min