41 episodes

Most advice on Site Reliability Engineering (SRE) is by BigTech for BigTech. SREpath's more realistic approach helps you make SRE work in a "normal" organization. Join your hosts, Ash Patel and Sebastian Vietz, as we demystify SRE jargon, interview experts, and share practical insights. Our mission is to help you boost your SRE efforts to succeed in areas like observability, incident response, release engineering, and more. We're reliability-focused professionals from companies where software is critical, but is not the product itself.

read.srepath.com

S.R.E.path Podcast Ash P

- Technology

- 7 MAY 2024
#40 How to Enable Observability for Success

#40 How to Enable Observability for Success

Observability is more than a set of technologies. It’s a practice. Timothy Mahoney is no stranger to this practice, enabling many developer teams to take on better practices in observability.
He’s a senior systems engineer at IKEA and is part of its observability enabling team.
Tim highlighted the importance of developing and driving frameworks for observability. He also covered the antipattern of teams having a tool-driven mindset and the challenges of switching them out of this.
You can ⁠connect with Timothy via LinkedIn

This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
- 27 min
- 30 APR 2024
#39 How Chaos Engineering Helps Reduce Incident Risk

#39 How Chaos Engineering Helps Reduce Incident Risk

Chaos Engineering is no longer a nice to have, as Ananth Movva explains in this episode of the SREpath podcast. His experiences with it drove a reduced number and severity of serious incidents and outages.
He’s been at the helm of reliability-focused decision-making at one of Canada’s largest banks, BMO, since 2020. Having completed 12 years at the bank, Ananth has seen the evolution of banking technology from archaic to user-centric, where incidents are considered seriously.
Ananth highlighted the use of chaos principles and tooling to identify future points of failure well ahead of time. He also talked about issues in bringing developers to integrate chaos into SDLC. You will not want to miss this conversation!
You can ⁠connect with Ananth via LinkedIn

This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
- 24 min
- 23 APR 2024
#38 The Real Cost of Software Reliability & Downtime

#38 The Real Cost of Software Reliability & Downtime

This episode covers Chapter 3 of the Site Reliability Engineering book (2016). In this second part, we talk about the costs behind reliability and choosing not to do it well or at all.
Here are key takeaways from our conversation:
* Prioritize Risk Mitigation: Recognize SRE as a discipline focused on mitigating risks within your organization, including technology, reputation, and financial risks. Allocate resources accordingly to address these risks proactively.
* Consider Cost-Effectiveness: When aiming to improve reliability, consider the cost-effectiveness of incremental improvements. Evaluate the balance between investment in reliability and the value it brings to your organization.
* Advocate Continuously: Continuously advocate for the importance of reliability engineering within your organization. Communicate transparently about the value SRE teams add and the impact of their work on the organization's success.
* Explore Alternative Metrics: Explore alternative availability metrics beyond traditional time-based measurements. Consider event-based metrics to gain a more nuanced understanding of service availability and performance.
* Embrace Regional Focus: Shift from relying solely on global availability metrics to more granular regional metrics. Understand the varying impacts on different customer audiences and prioritize improvements accordingly.
* Navigate Regulatory Challenges: Be mindful of regulatory challenges, such as GDPR, and understand their implications on service availability and reliability. Adapt strategies and solutions to comply with regulations while maintaining operational efficiency.
* Align Reliability with Revenue: Recognize the direct correlation between service availability and revenue generation, particularly for revenue-driven services like ad platforms. Invest in reliability engineering to ensure consistent revenue streams.
* Tier Services Strategically: Implement a tiered approach to prioritize reliability efforts, with revenue-generating services like ad platforms placed in the top tier. Allocate resources based on the criticality of services to the organization's objectives.

This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
- 23 min
- 16 APR 2024
#37 An SRE Approach to Managing Technology Risk

#37 An SRE Approach to Managing Technology Risk

This episode covers Chapter 3 of the Site Reliability Engineering book (2016). In this first part, we talk about embracing risk from the SRE perspective.

We'll cover how it's very different to the typical IT risk management mindset.

Here are key takeaways from our conversation:

Embrace Risk with Velocity: Rather than being hindered by traditional governance models and change approval boards, consider embracing risk while maintaining development velocity. Strive to find a balance between risk management and the speed of innovation.

Reevaluate Risk Management Approaches: Challenge traditional approaches to risk management, especially in larger organizations with extensive governance procedures. Explore alternative methods that prioritize agility and efficiency without compromising reliability.

Conceptualize Risk as a Continuum: View risk as a continuous spectrum and assess it based on various dimensions, such as the complexity of changes, the criticality of systems, and the impact on user experience. Continuously evaluate and adjust risk management strategies accordingly.

Balance Stability and Innovation: Recognize that extreme reliability comes at a cost and may hinder the pace of innovation. Aim for an optimal balance between stability and innovation, prioritizing user satisfaction and efficient service operations.

Implement Service-Level Objectives (SLOs): Deliver services with explicitly delineated levels of service, allowing clients to make informed risk and cost trade-offs when building their systems. Define SLOs based on the importance and criticality of services to enable better decision-making.

Visualize Risk Assessment: Utilize visual representations, such as whiteboard diagrams, to assess and communicate different levels of risk within your software systems. Encourage collaborative discussions among team members to determine acceptable risk levels.

Prioritize Customer Impact: Consider the impact of changes on customer experience and prioritize risk management efforts accordingly. Differentiate between critical user journeys and cosmetic changes to allocate scrutiny appropriately.

This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
- 30 min
- 9 APR 2024
#36 Avoiding Critical Platform Engineering Mistakes

#36 Avoiding Critical Platform Engineering Mistakes

Platform engineering is replacing SRE and DevOps. Jokes aside, knowing the path to better platforms is key. Abby Bangser is the right person to tell us how to achieve greater maturity in this aspect of software operations.

She's previously held SRE roles and currently works as Principal Engineer at Syntasso, the company behind the popular Kratix platform framework.

Abby highlighted the need for concrete definitions and maturity models in platform engineering trends, cautioning against equating developer portals with fully functional platforms.

We also dived into the need to understand your socio-technical landscape with an emphasis on the value of frameworks and method-based approaches.

You can ⁠connect with Abby via LinkedIn

This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
- 26 min
- 2 APR 2024
#35 Boosting Your Observability Data's Usability

#35 Boosting Your Observability Data's Usability

The observability (o11y) data revolution is well underway, but are we getting the most from the data that is being collected?

Richard Benwell thinks we have room for improvement, especially at the usage stage where we query and visualize the o11y data.

He is the founder and CEO of SquaredUp, a dashboard software company based out of Maidenhead, UK with over 10 years of experience in the monitoring space.

Richard highlighted the importance of converging human intuition with technical o11y implementations and moving from a narrow focus on collecting data to leveraging it for actionable insights.

You can connect with Richard via LinkedIn

This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
- 35 min