38 episodes

Most advice on Site Reliability Engineering (SRE) is by BigTech for BigTech. SREpath's more realistic approach helps you make SRE work in a "normal" organization. Join your hosts, Ash Patel and Sebastian Vietz, as we demystify SRE jargon, interview experts, and share practical insights. Our mission is to help you boost your SRE efforts to succeed in areas like observability, incident response, release engineering, and more. We're reliability-focused professionals from companies where software is critical, but is not the product itself.

S.R.E.path Podcast Ash Patel, Sebastian Vietz

- Technology
- 5.0 • 1 Rating

- 16 APR 2024
#37 An SRE Approach to Managing Technology Risk

#37 An SRE Approach to Managing Technology Risk

This episode covers Chapter 3 of the Site Reliability Engineering book (2016). In this first part, we talk about embracing risk from the SRE perspective.

We'll cover how it's very different to the typical IT risk management mindset.

Here are key takeaways from our conversation:

Embrace Risk with Velocity: Rather than being hindered by traditional governance models and change approval boards, consider embracing risk while maintaining development velocity. Strive to find a balance between risk management and the speed of innovation.

Reevaluate Risk Management Approaches: Challenge traditional approaches to risk management, especially in larger organizations with extensive governance procedures. Explore alternative methods that prioritize agility and efficiency without compromising reliability.

Conceptualize Risk as a Continuum: View risk as a continuous spectrum and assess it based on various dimensions, such as the complexity of changes, the criticality of systems, and the impact on user experience. Continuously evaluate and adjust risk management strategies accordingly.

Balance Stability and Innovation: Recognize that extreme reliability comes at a cost and may hinder the pace of innovation. Aim for an optimal balance between stability and innovation, prioritizing user satisfaction and efficient service operations.

Implement Service-Level Objectives (SLOs): Deliver services with explicitly delineated levels of service, allowing clients to make informed risk and cost trade-offs when building their systems. Define SLOs based on the importance and criticality of services to enable better decision-making.

Visualize Risk Assessment: Utilize visual representations, such as whiteboard diagrams, to assess and communicate different levels of risk within your software systems. Encourage collaborative discussions among team members to determine acceptable risk levels.

Prioritize Customer Impact: Consider the impact of changes on customer experience and prioritize risk management efforts accordingly. Differentiate between critical user journeys and cosmetic changes to allocate scrutiny appropriately.
- 30 min
- 9 APR 2024
#36 Avoiding Critical Platform Engineering Mistakes

#36 Avoiding Critical Platform Engineering Mistakes

Platform engineering is replacing SRE and DevOps. Jokes aside, knowing the path to better platforms is key. Abby Bangser is the right person to tell us how to achieve greater maturity in this aspect of software operations.

She's previously held SRE roles and currently works as Principal Engineer at Syntasso, the company behind the popular Kratix platform framework.

Abby highlighted the need for concrete definitions and maturity models in platform engineering trends, cautioning against equating developer portals with fully functional platforms.

We also dived into the need to understand your socio-technical landscape with an emphasis on the value of frameworks and method-based approaches.

You can ⁠connect with Abby via LinkedIn
- 26 min
- 2 APR 2024
#35 Boosting Your Observability Data's Usability

#35 Boosting Your Observability Data's Usability

The observability (o11y) data revolution is well underway, but are we getting the most from the data that is being collected?

Richard Benwell thinks we have room for improvement, especially at the usage stage where we query and visualize the o11y data.

He is the founder and CEO of SquaredUp, a dashboard software company based out of Maidenhead, UK with over 10 years of experience in the monitoring space.

Richard highlighted the importance of converging human intuition with technical o11y implementations and moving from a narrow focus on collecting data to leveraging it for actionable insights.

You can connect with Richard via LinkedIn
- 35 min
- 26 MAR 2024
#34 From Cloud to Concrete: Should You Return to On-Prem?

#34 From Cloud to Concrete: Should You Return to On-Prem?

This episode continues our coverage of Chapter 2 of the Site Reliability Engineering book (2016). We talk about the age-old debate of cloud vs on-prem, which is analogous to that other debate we have in the technology of build vs buy.

Here are key takeaways from our conversation:

Adapt your storage solutions to business needs: Understand the diverse storage options available and tailor them to specific business needs, considering factors like data type, access patterns, and scalability requirements.
Optimize your load balancing: Implement global load balancing strategies to optimize user experience and performance by directing traffic to the nearest data center to minimize latency, and maximize resource utilization.
Don't hesitate to continuously evaluate your cloud: Assess the suitability of cloud solutions against your organization's needs, considering factors like cost, control, scalability, and security, and be open to reevaluating decisions based on evolving requirements.
Make strategic decisions for your operations footprint: Lean on decisions based on thorough analysis that considers:
Encourage objective evaluation and formal planning processes in decision-making: avoid emotional reactions or being swayed by external influences, to ensure decisions are based on sound analysis and truly aligned with organizational goals.
- 22 min
- 19 MAR 2024
#33 Inside Google's Data Center Design

#33 Inside Google's Data Center Design

This episode covers Chapter 2 of the Site Reliability Engineering book (2016). In this first part, we talk about the intricacies of data center design outlined in the book. One thing is for sure. Building a data center for your own needs is HARD work with many considerations you must make.

Here are key takeaways from our conversation:

Importance of understanding data center fundamentals: Even if you're not operating at the scale of companies like Google, understanding the fundamentals behind data center infrastructure can help. This knowledge can inform decisions on cloud services, high availability strategies, and the architectural design of systems to ensure resilience and scalability.
The impetus to leverage cloud infrastructure: The transition from traditional on-premises infrastructure to cloud-based solutions is a critical trend. Organizations can learn from how tech giants manage resources efficiently at scale, to improve their resource allocation.
Cyclical trends in technology adoption: trends in technology are cyclical and that can inform strategic decisions. As there's a current discussion around moving from cloud-centric models back to more traditional data center approaches, understanding the history and evolution of tech infrastructure can prepare organizations to adapt to and anticipate future shifts in the technological landscape.
- 23 min
- 14 MAR 2024
#32 Clarifying Platform Engineering's Role (with Ajay Chankramath) BONUS EP

#32 Clarifying Platform Engineering's Role (with Ajay Chankramath) BONUS EP

Will Platform Engineering replace DevOps or SRE or both? I don’t think this is the case at all. Neither does Ajay Chankramath.

He is the Head of Platform Engineering at ThoughtWorks North America, an innovator consulting group. I’d take his word for it since he’s held senior leadership roles in release engineering and more since 2002.

In this bonus episode of the SREpath podcast, Ajay shared his perspective on the debate about SRE vs DevOps vs Platform Engineering.
- 16 min