48 episodes

Software reliability is a tough topic for engineers in many organizations. The Reliability Enablers (Ash Patel and Sebastian Vietz) know this from experience. Join us as we demystify reliability jargon like SRE, DevOps, and more. We interview experts and share practical insights. Our mission is to help you boost your success in reliability-enabling areas like observability, incident response, release engineering, and more.

read.srepath.com

Reliability Enablers Ash Patel & Sebastian Vietz

- Technology

- 18 JUN 2024
#47 How to Grow Team Impact Through Learning Culture

#47 How to Grow Team Impact Through Learning Culture

The common refrain after an incident is “We could and should learn from this”.
To me, that alludes to the need for a robust learning culture.
We might think we already have a good learning culture because we talk about problems and deep-dive them into retrospectives.
But how often do we explore the nuances of how we are learning?
Sorrel Harriet is an expert in supporting software engineering teams to develop a stronger learning culture. She was a “Continuous Learning Lead” at Armakuni (software consultancy) and now does the same work under her own banner.
Her work ties in well with the ideas shared by Manuel Pais in episode #45 about how enabling teams can support a continuous learning culture.
We tackled issues like the value of certifications, comparing technical with non-technical skills, and more.
You can ⁠connect with Sorrel via LinkedIn
Learn more about what Sorrel does via LaaS.consulting
Here’s a bonus section because you read all this way. It covers 5 public outages and how the affected teams could improve their learning culture:
1. Slack Outage (February 2023)
Slack experienced a global outage disrupting communication for hours due to backend infrastructure issues. Perhaps the team could focus their learning on more robust infrastructure management and resilience improvement.
2. Twitter Algorithm Glitch (April 2023)
A glitch in Twitter's algorithm caused timeline issues, stemming from a problematic software update. Perhaps the team could focus their learning on thorough testing and game days to rectify critical system errors swiftly.
3. Microsoft Azure AD Outage (March 2023)
Azure Active Directory faced a significant outage due to an internal configuration change. Perhaps the team could focus their learning on the importance of rigorous change management and how to address misconfigurations quickly.
4. Google Cloud Platform Networking Issue (May 2023)
Google Cloud Platform experienced widespread service disruptions from a software bug in its networking infrastructure. Perhaps the team could focus their learning on the need for comprehensive testing and preventing disruptions.
5. GitHub Outage (June 2023)
GitHub suffered a major outage caused by a cascading failure in its storage infrastructure. Perhaps the team could focus their learning on robust fault-tolerance mechanisms and ways to address the root causes of failures.

This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
- 28 min
- 11 JUN 2024
#46 Platform Team Design According to Team Team Topologies

#46 Platform Team Design According to Team Team Topologies

I continue my conversation with Manuel Pais, co-author of the seminal Team Topologies book about team topologies suitable for reliability teams.
In this second part, we will talk about platform teams.
A quick refresher on what platform teams do
In the team topologies context:
Platform teams provide a curated set of self-service capabilities to enable stream-aligned teams (product or feature teams) to deliver work with greater speed and reduced complexity.

They achieve this directive by abstracting away common infrastructure and operational concerns. By doing this, they aim to allow stream-aligned teams to focus on delivering business value.
Here are the key takeaways from our conversation
For those who don’t have time to listen to this episode (but you’re missing out on a great conversation):
* Focus on User-Centric Design: Prioritize the user experience in platform development. Regularly collaborate with internal teams to ensure the platform meets their needs and reduces their pain points.
* Build and Maintain Trust: Establish and nurture trust with your platform’s users. Trust is crucial for platform adoption and can prevent resistance thus assuring sustained use.
* Justify Platform Value: Continuously demonstrate the value of your platform to management and stakeholders, especially during economic downturns. Highlight its contributions to avoid cuts and maintain support.
* Understand Adoption Lifecycle: Recognize that platforms go through different stages of adoption. Identify and support early adopters, and gradually bring in late adopters by showcasing successful use cases.
* Enhance Collaboration: Foster open communication between platform teams and other teams. Avoid rigid roadmaps and be adaptable to changing needs to prevent barriers and build stronger internal relationships.
* Manage Cognitive Load: Be mindful of the cognitive load on your teams. Simplify processes and reduce unnecessary complexities to enhance productivity and efficiency.
* Use Tools to Measure Cognitive Load: Implement tools like Teamperature to assess the cognitive load on your teams regularly. Use the insights to identify and mitigate factors contributing to cognitive overload.
* Leverage Experienced Product Managers: Ensure experienced product managers are part of your platform team. They can balance long-term goals with the flexibility needed to adapt to the evolving needs of internal users.
I think the uncommon takeaway here is #9 in that platform teams should treat their platform as a product. Product Managers like and Marty Cagan are doing great work in laying out the roadmap for product management.
Did you end up checking out the reliability workstreams map I published last week?
It’s free and can help you stay focused on the right priorities at work.
Check it out via this link

This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
- 24 min
- 4 JUN 2024
#45 How Team Topologies Can Guide Enabling Teams

#45 How Team Topologies Can Guide Enabling Teams

I got the inside word from Manuel Pais, co-author of the seminal Team Topologies book, to explain in a 2-part series about 2 of the most relevant team topologies for reliability work.
In this first part, we will talk about enabling teams.
A quick refresher on what enabling teams do
In the team topologies context:
Enabling teams help stream-aligned teams (product or feature teams) to overcome obstacles and improve their capabilities in specific areas.

This kind of team is available to provide expertise, guidance, and support to other teams working to adopt new technologies, practices, or skills.
In other news…
This podcast has a new name
What more a fitting moment to announce renaming the SREpath podcast to “The Reliability Enablers” podcast?
This name change reinforces our quest to demystify and enable reliability efforts so that more organizations successfully implement SRE principles and beyond.
Before we get to the 8 takeaways
Here’s something relevant to enabling reliability work — a reliability workflows map I’ve had in my private notes for years, now going public.
What is a workstream? 🤔
You might have heard of “value streams”. They show the end-to-end journey of creating and delivering value to a customer.
Workstreams support your value streams.
They cover the activities carried out to do so. In summary: Value streams are the goals and workstreams are the activities you do to achieve those goals.
Okay, now time for the erudite takeaways that Manuel gave me from our talk.
Takeaways from the episode
Here are the key takeaways from our conversation for those who don’t have time to listen (but you’re missing out on a great audio conversation):
* Create Enabling Teams:
Form SRE-focused enabling teams to facilitate technical training, optimize cloud architecture, improve documentation, and overall help other teams build their capabilities.
* Work to Minimize Cognitive Load:
Minimize the cognitive load on engineers by centralizing complex and repetitive tasks, allowing engineers to concentrate on innovation and high-value work. You can measure cognitive load and manage it through the Teamperature tool
* Facilitate Learning and Adoption of Best Practices:
Use SRE enabling teams to educate product teams on critical practices like error budgets and service level objectives, making the learning process gradual and manageable.
* Collaborate among Topologies for Effective Tooling:
Enable teams should work with platform teams to inform their plans to develop and co-evolve tools and services that support reliability and observability practices, like automated dashboards and alerting systems.
* Adapt Approaches Based on Organizational Capacity:
Tailor the mix of enabling and platform support based on the organization’s resources and constraints, ensuring flexibility and efficiency.
* Avoid Traditional Ops Work for SRE Teams:
Ensure SRE teams focus on empowering product teams rather than performing traditional operations tasks, promoting a culture of shared responsibility.
* Build an Effective Learning Culture:
Foster a culture of continuous learning and improvement, integrating learning opportunities into the daily workflow rather than relying solely on formal training programs.
* Scale Capabilities Across the Organization:
When needed, scale enabling efforts to build organization-wide capabilities, ensuring that expertise is distributed and not bottlenecked within specialized departments.

This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
- 25 min
- 30 MAY 2024
#44 - Making SLOs Matter to Stakeholders

#44 - Making SLOs Matter to Stakeholders

Bonus episode on SLOs because Sebastian and I felt that we did not cover the why of SLOs and make them relevant to stakeholders.

This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
- 20 min
- 28 MAY 2024
#43 - SLOs: a Deeper Dive into its Mechanics

#43 - SLOs: a Deeper Dive into its Mechanics

This episode continues our coverage of Chapter 4 of the Site Reliability Engineering book (2016). In this second part, we take a deeper dive into the mechanics of SLOs.
Here are 5 takeaways from the show:
* Start Small with SLOs: Begin with a limited number of SLOs and iteratively refine them based on experience and feedback. Avoid overwhelming teams with too many objectives at once.
* Defend and Enforce SLOs: Ensure that selected SLOs have real consequences attached to them. If conversations about priorities cannot be influenced by SLOs, reconsider their relevance and enforceability.
* Continuous Improvement: Embrace the idea that SLOs are not static targets but evolve over time. Start with loose targets and refine them as you learn more about the system's behavior. Commit to ongoing maintenance and improvement of SLOs for long-term success.
* Effective Communication Skills: Recognize the importance of effective communication, especially for technology professionals. Develop the ability to translate technical concepts into plain language that stakeholders can understand and appreciate.
* Understanding User Needs: Prioritize understanding and aligning with the expectations of users/customers when defining service level objectives (SLOs) and metrics. User feedback should guide the selection of meaningful SLOs.

This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
- 31 min
- 21 MAY 2024
#42 - Hitting Software SLA Targets through SLOs and SLIs

#42 - Hitting Software SLA Targets through SLOs and SLIs

In this first part of a 2-part coverage, Sebastian Vietz and I work out how to meet SLAs through SLOs and SLIs. This episode covers Chapter 4 of the Site Reliability Engineering book (2016).
Here are 7 takeaways from the show:
* Involve Technical Stakeholders Early: Ensure that technical stakeholders, such as SREs, are involved in discussions about SLAs and SLOs from the beginning. Their expertise can help ensure that objectives are feasible and aligned with the technical capabilities of the service.
* Differentiate Between SLAs and SLOs: Understand the distinction between SLAs, which are legal contracts, and SLOs, which are based on customer expectations. Avoid using SLAs as a substitute for meaningful service level objectives.
* Prioritize Meaningful Metrics: Focus on a select few service level indicators (SLIs) that truly reflect what users want from the system. Avoid the temptation to monitor everything and instead choose indicators that provide valuable insights into service performance.
* Align with Customer Expectations: Start by understanding and prioritizing the expectations of your customers. Use their feedback to define service level objectives (SLOs) that align with their needs and preferences.
* Avoid Alert Fatigue: Be mindful of the number of metrics being monitored and the associated alerts. Too many indicators can lead to alert fatigue and make it difficult to prioritize and respond to issues effectively. Focus on a few key indicators that matter most.
* Start Top-Down with SLIs: Take a top-down approach to defining SLIs, starting with customer expectations and working downwards. This ensures that the selected metrics are meaningful and relevant to users' needs.
* Prepare for Deep Dives: Anticipate the need for deeper exploration of specific topics, such as SLOs, and allocate time and resources to thoroughly understand and implement them in your work.

This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
- 29 min