Reliability Enablers

Ash Patel & Sebastian Vietz

Software reliability is a tough topic for engineers in many organizations. The Reliability Enablers (Ash Patel and Sebastian Vietz) know this from experience. Join us as we demystify reliability jargon like SRE, DevOps, and more. We interview experts and share practical insights. Our mission is to help you boost your success in reliability-enabling areas like observability, incident response, release engineering, and more. read.srepath.com

  1. What the Agentic AI is happening to SRE?

    2d ago

    What the Agentic AI is happening to SRE?

    What if agentic AI makes SRE more important, not less? Bennett Gould explains why autonomous AI systems may create more demand for reliability thinking — not less. Everyone seems to think AI is coming for SRE in a hard way. You might have heard the same story: “AI will write the code.” “Agents will handle incidents.” “Copilots will generate the runbooks.” “Automation will reduce operational load.” Yes, the job question is real. If AI can write code, summarize incidents, query observability tools, generate runbooks, and operate across systems, then engineers are right to ask what happens to the work. But here’s the part that gets missed: AI does not just automate reliability work. It creates more objects and surface areas that need to be made reliable. Agentic AI is moving from demos into real workflows. These systems are no longer just answering questions. They are querying tools, pulling context, generating changes, and in some cases taking action around production environments. That makes this a Monday morning problem. Teams are already using LLMs for incidents, documentation, observability, infrastructure, and operational decision-making. Somewhere, a team is one demo away from giving an agent access to tools originally designed for humans. That is exactly why I wanted to have this conversation. Bennett Gould is currently a solution engineer at Neubird.ai. His career in SRE and SRE-adjacent work spans large enterprises, cloud, industrial technology, and startups, including AWS, IBM, Siemens, and a YC startup. I wanted to ask him a simple question: What in the agentic AI is happening to SRE? Here are 3 highlights from our talk: 1. Agentic AI increases the reliability surface area The obvious fear is that AI reduces the need for reliability engineers. Bennett’s view was more nuanced. He was clear that engineers still need to adapt. If people do not reskill, stay current, and learn how these systems are forming, there may absolutely be pressure in the job market. But he also argued that AI could create more demand for reliability skills because production complexity is increasing. More code is going into production. More AI-generated code is going into production. More systems that people do not fully understand are going into production. And now autonomous agents are starting to enter production workflows too. That means more surface area. More automation. More operational uncertainty. More ways for things to go wrong. Bennett compared this to Terraform: Infrastructure as code created enormous efficiency gains. But it also created new ways to make very big mistakes very quickly. Before Terraform, most people could not delete all their production resources with a single command. After Terraform, that became technically possible if the system was designed badly enough. Agentic AI follows a similar pattern. With great automation comes great responsibility. Agents can help engineers move faster, query tools, summarize context, and reduce toil. But they can also amplify weak engineering practices, poor boundaries, bad assumptions, and unclear operational ownership. That is not the end of reliability work. That is reliability work entering a new phase. 2. Agents can reduce toil, but context is the ceiling One of the strongest parts of the conversation was Bennett’s explanation of where agents can help in incident response. A lot of SRE work involves moving across tools. You may need to query Prometheus, Dynatrace, logs, traces, cloud consoles, ticketing systems, documentation, runbooks, dashboards, and architecture diagrams. The problem is not always that the engineer lacks judgment. Sometimes the problem is that the information is scattered across too many tools, each with its own query language and interface. Bennett gave a simple example: an engineer might be very good at PromQL and very fast when Prometheus is the source of truth. But if the same engineer has to work in a different observability platform with a different query language, their response time can suffer. That is an obvious place where agents can help. The engineer may not need to know every query language perfectly. They need to know what they are looking for and how to reason about the system. The agent can help translate that intent into the right tool calls, queries, and summaries. That could reduce MTTR. It could reduce toil. It could help engineers move faster during incidents. But Bennett also made the limitation clear: You are only as good as the context you have. This is where he introduced two useful concepts: * Context mining * Context distillation Context mining means proactively finding the information that might be useful in a given operational situation. Context distillation means taking large amounts of information — runbooks, Confluence pages, diagrams, documentation, prior incidents — and reducing it into the minimum useful context an LLM or agent can use. That sounds powerful. But there is a catch. Sometimes the context simply is not there. Many of the largest and most complex organizations still run legacy systems where knowledge lives in people’s heads, stale documentation, tribal memory, and unwritten assumptions. There may not be a clean process for turning that into usable context. That matters because agents do not magically understand your system. They work with the context they are given. If the context is missing, outdated, or wrong, the agent’s usefulness maxes out early. 3. Agentic systems are not just LLM demos A basic LLM workflow is relatively easy to demo: You give it a prompt. You connect a few tools. You add some APIs. You get a useful answer. That is impressive, but it is not the same thing as running an agentic system in a meaningful production environment. Bennett made a useful analogy here: running your own infrastructure versus using a hyperscaler. Cloud providers removed a lot of undifferentiated heavy lifting. Most companies do not want to spend half their time racking servers, managing data centers, and dealing with low-level infrastructure when they are trying to serve customers. Agentic systems create similar questions: * What parts of the work should be handled by the system? * What parts still need engineering discipline? * And what has to exist around the model before it is safe and useful? That surrounding structure is where the real work begins. Bennett called this harness engineering. Once you move beyond an LLM demo, you have to think about memory, learning, tool usage, identity, federation, security, evaluations, and guardrails. That is a very different problem from “the model gave a good answer on my laptop.” SREs know why that distinction matters. “It works on my machine” is not an acceptable reliability strategy. A runbook that recovers a thousand-node database cannot be non-deterministic, undocumented, and dependent on someone’s local setup. If it is part of the operational backbone, it needs to be reliable. Agentic AI does not remove that requirement. It makes it more important. Bonus: Agents expose weak engineering practices Agentic AI not only introduces new problems but it also reveals old ones. * Weak APIs. * Brittle runbooks. * Missing context. * Poor evals. * Unclear tool boundaries. * Operational shortcuts. Systems that were designed assuming careful human use may behave very differently when AI agents start using them. That is why this conversation matters for SRE. Agentic AI is not only a productivity story. It is a reliability story. It forces teams to ask whether their existing practices are strong enough for a world where more actions can be generated, recommended, or executed by autonomous systems. The silver lining for reliability work Agentic AI does not remove the need for reliability thinking. It raises the bar for it. The tools will change. The workflows will change. Some tasks will absolutely be automated or reshaped. But the hardest parts of reliability are still the hard parts: * understanding the system * knowing the trade-offs * building reliable operational processes * making good judgment calls under uncertainty and * owning the outcome when something changes in production That is why SRE does not disappear in an agentic AI world. It becomes one of the disciplines that makes the agentic AI world survivable. So if your team is already using AI around incidents, observability, runbooks, infrastructure, or production workflows, the question is not whether the future is coming. The future is already in the workflow. The real question is whether your reliability practices are ready for it. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com

    24 min
  2. 12/02/2025

    You (and AI) can't automate reliability away

    What if the hardest part of reliability has nothing to do with tooling or automation? Jennifer Petoff explains why real reliability comes from the human workflows wrapped around the engineering work. Everyone seems to think AI will automate reliability away. I keep hearing the same story: “Our tooling will catch it.” “Copilots will reduce operational load.” “Automation will mitigate incidents before they happen.” But here’s a hard truth to swallow: AI only automates the mechanical parts of reliability — the machine in the machine. The hard parts haven’t changed at all. You still need teams with clarity on system boundaries.You still need consistent approaches to resolution.You still need postmortems that drive learning rather than blame. AI doesn’t fix any of that. If anything, it exposes every organizational gap we’ve been ignoring. And that’s exactly why I wanted today’s guest on. Jennifer Petoff is Director of Program  Management for Google Cloud Platform and Technical Infrastructure education. Every day, she works with SREs at Google, as well as with SREs at other companies through her public speaking and Google Cloud Customer engagements. Even if you have never touched GCP, you have still been influenced by her work at some point in your SRE career. She is co-editor of Google’s original Site Reliability Engineering book from 2016. Yeah, that one! It was my immense pleasure to have her join me to discuss the internal dynamics behind successful reliability initiatives. Here are 5 highlights from our talk: 3 issues stifling individual SREs’ work To start, I wanted to know from Jennifer the kinds of challenges she has seen individual SREs face when attempting to introduce or reinforce reliability improvements within their teams or the broader organization. She categorized these challenges into 3 main categories * Cultural issues (with a look into Westrum’s typology of organizational culture) * Insufficient buy-in from stakeholders * Inability to communicate the value of reliability work Organizations with generative cultures have 30% better organizational performance. A key highlight from this topic came from her look at DORA research, an annual survey of thousands of tech professionals and the research upon which the book Accelerate is based. It showed that organizations with generative cultures have 30% better organizational performance. In other words, you can have the best technology, tools, and processes to get good results, but culture further raises the bar. A generative culture also makes it easier to implement the more technical aspects of DevOps or SRE that are associated with improved organizational performance. Hands-on is the best kind of training We then explored structured approaches that ensure consistency, build capability, and deliberately shape reliability culture. As they say – Culture eats strategy for breakfast! One key example Jennifer gave was the hands-on approach they take at Google. She believes that adults learn by doing. In other words, SREs gain confidence by doing hands-on work. Where possible, training programs should move away from passive listening to lectures toward hands-on exercises that mimic real SRE work, especially troubleshooting. One specific exercise that Google has built internally is Simulating Production Breakages. Engineers undergoing that training have a chance to troubleshoot a real system built for this purpose in a safe environment. The results have been profound, with a tremendous amount of confidence that Jennifer’s team saw in survey results. This confidence is focused on job-related behaviors, which when repeated over time reinforce that culture of reliability. Reliability is mandatory for everybody Another thing Jennifer told me Google did differently was making reliability a mandatory part of every engineer’s curriculum, not only SREs. When we first spun up the SRE Education team, our focus was squarely on our SREs. However, that’s like preaching to the choir. SREs are usually bought into reliability. A few years in, our leadership was interested in propagating the reliability-focused culture of SRE to all of Google’s development teams, a challenge an order of magnitude greater than training SREs. How did they achieve this mandate? * They developed a short and engaging (and mandatory) production safety training * That training has now been taken by tens of thousands of Googlers * Jennifer attributes this initiative’s success to how they“SRE’ed the program”. “We ran a canary followed by a progressive roll-out. We instituted monitoring and set up feedback loops so that we could learn and drive continuous improvement.” The result of this massive effort? A very respectable 80%+ net promoter score with open text feedback: “best required training ever.” What made this program successful is that Jennifer and her team SRE’d its design and iterative improvement. You can learn more about “How to SRE anything” (from work to life) using her rubric: https://www.reliablepgm.com/how-to-sre-anything/ Reliability gets rewarded just like feature work Jennifer then talked about how Google mitigates a risk that I think every reliability engineer wishes could be solved at their organization. That is, having great reliability work rewarded at the same level as great feature work. For development and operations teams alike at Google, this means making sure “grungy work” like tech debt reduction, automation, and other activities that improve reliability are rewarded equally to shiny new product features. Organizational reward programs that recognize outstanding work typically have committees. These committees not only look for excellent feature development work, but also reward and celebrate foundational activities that improve reliability. This is explicitly built into the rubric for judging award submissions. Keep a scorecard of reliability performance Jennifer gave another example of how Google judges reliability performance, but more specifically for SRE teams this time. Google’s Production Excellence (ProdEx) program was created in 2015 to assess and improve production excellence (aka reliability improvements) across SRE teams. ProdEx acts like a central scorecard to aggregate metrics from various production health domains to provide a comprehensive overview of an SRE team’s health and the reliability of the services they manage. Here are some specifics from the program: * Domains include SLOs, on-call workload, alerting quality, and postmortem discipline * Reviews are conducted live every few quarters by senior SREs (directors or principal engineers) who are not part of the team’s direct leadership * There is a focus on coaching and accountability without shame (to elicit psychological safety) ProdEx serves various levels of the SRE organization through: * providing strategic situational awareness regarding organizational and system health to leadership and * keeping forward momentum around reliability and surfacing team-level issues early to support engineers in addressing them Wrapping up Having an inside view of reliability mechanisms within a few large organizations, I know that few are actively doing all — or sometimes any — of the reliability enhancers that Google uses and Jennifer has graciously shared with us. It’s time to get the ball rolling. What will you do today to make it happen? This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com

    28 min
  3. 07/15/2025

    #67 Why the SRE Book Fails Most Orgs — Lessons from a Google Veteran

    A new or growing SRE team. A copy of the book. A company that says it cares about reliability. What happens next? Usually… not much. In this episode, I sit down with Dave O’Connor, a 16-year Google SRE veteran, to talk about what happens when organizations cargo-cult reliability practices without understanding the context they were born in. You might know him for his self-deprecating wit and legendary USENIX blurb about being “complicit in the development of the SRE function.” This one’s a treat — less “here’s a shiny new tool” and more “here’s what reliability actually looks like when you’ve seen it all.” ✨ No vendor plugs from Dave at all, just a good old-fashioned chat about what works and what doesn’t. Here’s what we dive into: * The adoption trap: Why SRE efforts often fail before they begin—especially when new hires care more about reliability than the org ever intended. * The SRE book dilemma: Dave’s take on why following the SRE book chapter-by-chapter is a trap for most companies (and what to do instead). * The cost of “caring too much”: How engineers burn out trying to force reliability into places it was never funded to live. * You build it, you run it (but should you?): Not everyone’s cut out for incident command—and why pretending otherwise sets teams up to fail. * Buying vs. building: The real reason even conservative enterprises are turning into software shops — and the reliability nightmare that follows. We also discuss the evolving role of reliability in organizations today, from being mistaken for “just ops” to becoming a strategic investment (when done right). Dave's seen the waves come and go in SRE — and he's still optimistic. That alone is worth a listen. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com

    31 min
  4. #66 - Unpacking 2025 SRE Report’s Damning Findings

    07/01/2025

    #66 - Unpacking 2025 SRE Report’s Damning Findings

    I know it’s already six months into 2025, but we recorded this almost three months ago. I’ve been busy with my foray into the world of tech consulting and training —and, well, editing these podcast episodes takes time and care. This episode was prompted by the 2025 Catchpoint SRE Report, which dropped some damning but all-too-familiar findings: * 53% of orgs still define reliability as uptime only, ignoring degraded experience and hidden toil * Manual effort is creeping back in, reversing five years of automation gains * 41% of engineers feel pressure to ship fast, even when it undermines long-term stability To unpack what this actually means inside organizations, I sat down with Sebastian Vietz, Director of Reliability Engineering at Compass Digital and co-host of the Reliability Enablers podcast. Sebastian doesn’t just talk about technical fixes — he focuses on the organizational frictions that stall change, burn out engineers, and leave “reliability” as a slide deck instead of a lived practice. We dig into: * How SREs get stuck as messengers of inconvenient truths * What it really takes to move from advocacy to adoption — without turning your whole org into a cost center * Why tech is more like milk than wine (Sebastian explains) * And how SREs can strengthen—not compete with—security, risk, and compliance teams This one’s for anyone tired of reliability theatrics. No kumbaya around K8s here. Just an exploration of the messy, human work behind making systems and teams more resilient. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com

    30 min
  5. #65 - In Critical Systems, 99.9% Isn’t Reliable — It’s a Liability

    06/17/2025

    #65 - In Critical Systems, 99.9% Isn’t Reliable — It’s a Liability

    Most teams talk about reliability with a margin for error. “What’s our SLO? What’s our budget for failure?” But in the energy sector? There is no acceptable downtime. Not even a little. In this episode, I talk with Wade Harris, Director of FAST Engineering in Australia, who’s spent 15+ years designing and rolling out monitoring and control systems for critical energy infrastructure like power stations, solar farms, SCADA networks, you name it. What makes this episode different is that Wade isn’t a reliability engineer by title, but it’s baked into everything his team touches. And that matters more than ever as software creeps deeper into operational technology (OT), and the cloud tries to stake its claim in critical systems. We cover: * Why 100% uptime is the minimum bar, not a stretch goal * How the rise of renewables has increased system complexity — and what that means for monitoring * Why bespoke integration and SCADA spaghetti are still normal (and here to stay) * The reality of cloud risk in critical infrastructure (“the cloud is just someone else’s computer”) * What software engineers need to understand if they want their products used in serious environments This isn’t about observability dashboards or DevOps rituals. This is reliability when the lights go out and people risk getting hurt if you get it wrong. And it’s a reminder: not every system lives in a feature-driven world. Some systems just have to work. Always. No matter what. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com

    28 min
  6. #64 - Using AI to Reduce Observability Costs

    01/28/2025

    #64 - Using AI to Reduce Observability Costs

    Exploring how to manage observability tool sprawl, reduce costs, and leverage AI to make smarter, data-driven decisions. It's been a hot minute since the last episode of the Reliability Enablers podcast. Sebastian and I have been working on a few things in our realms. On a personal and work front, I’ve been to over 25 cities in the last 3 months and need a breather. Meanwhile, listen to this interesting vendor, Ruchir Jha from Cardinal, working on the cutting edge of o11y to help reduce costs from spiraling out of control. (To the skeptics, he did not pay me for this episode) Here’s an AI-generated summary of what you can expect in our conversation: In this conversation, we explore cutting-edge approaches to FinOps i.e. cost optimization for observability. You'll hear about three pressing topics: * Managing Tool Sprawl: Insights into the common challenge of juggling 5-15 tools and how to identify which ones deliver real value. * Reducing Observability Costs: Techniques to track and trim waste, including how to uncover cost hotspots like overused or redundant metrics. * AI for Observability Decisions: Practical ways AI can simplify complex data, empowering non-technical stakeholders to make informed decisions. We also touch on the balance between open-source solutions like OpenTelemetry and commercial observability tools. Learn how these strategies, informed by Ruchir's experience at Netflix, can help streamline observability operations and cut costs without sacrificing reliability. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com

    21 min
  7. #63 - Does "Big Observability" Neglect Mobile?

    11/12/2024

    #63 - Does "Big Observability" Neglect Mobile?

    Andrew Tunall is a product engineering leader focused on pushing the boundaries of reliability with a current focus on mobile observability. Using his experience from AWS and New Relic, he’s vocal about the need for a more user-focused observability, especially in mobile, where traditional practices fall short. * Career Journey and Current Role: Andrew Tunall, now at Embrace, a mobile observability startup in Portland, Oregon, started his journey at AWS before moving to New Relic. He shifted to a smaller, Series B company to learn beyond what corporate America offered. * Specialization in Mobile Observability: At Embrace, Andrew and his colleagues build tools for consumer mobile apps, helping engineers, SREs, and DevOps teams integrate observability directly into their workflows. * Gap in Mobile Observability: Observability for mobile apps is still developing, with early tools like Crashlytics only covering basic crash reporting. Andrew highlights that more nuanced data on app performance, crucial to user experience, is often missed. * Motivation for User-Centric Tools: Leaving “big observability” to focus on mobile, Andrew prioritizes tools that directly enhance user experience rather than backend metrics, aiming to be closer to end-users. * Mobile's Role as a Brand Touchpoint: He emphasizes that for many brands, the primary consumer interaction happens on mobile. Observability needs to account for this by focusing on user experience in the app, not just backend performance. * Challenges in Measuring Mobile Reliability: Traditional observability emphasizes backend uptime, but Andrew sees a gap in capturing issues that affect user experience on mobile, underscoring the need for end-to-end observability. * Observability Over-Focused on Backend Systems: Andrew points out that “big observability” has largely catered to backend engineers due to the immense complexity of backend systems with microservices and Kubernetes. Despite mobile being a primary interface for apps like Facebook and Instagram, observability tools for mobile lag behind backend-focused solutions. * Lack of Mobile Engineering Leadership in Observability: Reflecting on a former Meta product manager’s observations, Andrew highlights the lack of VPs from mobile backgrounds, which has left a gap in observability practices for mobile-specific challenges. This gap stems partly from frontend engineers often seeing themselves as creators rather than operators, unlike backend teams. * OpenTelemetry’s Limitations in Mobile: While OpenTelemetry provides basic instrumentation, it falls short in mobile due to limited SDK support for languages like Kotlin and frameworks like Unity, React Native, and Flutter. Andrew emphasizes the challenges of adapting OpenTelemetry to mobile, where app-specific factors like memory consumption don’t align with traditional time-based observability. * SREs as Connective Tissue: Andrew views Site Reliability Engineers (SREs) as essential in bridging backend observability practices with frontend user experience needs. Whether through service level objectives (SLOs) or similar metrics, SREs help ensure that backend metrics translate into positive end-user experiences—a critical factor in retaining app users. * Amazon’s Operational Readiness Review: Drawing from his experience at AWS, Andrew values Amazon’s practice of operational readiness reviews before launching new services. These reviews encourage teams to anticipate possible failures or user experience issues, weighing risks carefully to maintain reliability while allowing innovation. * Shifting Focus to “Answerability” in Observability: For Andrew, the goal of observability should evolve toward “answerability,” where systems provide engineers with actionable answers rather than mere data. He envisions a future where automation or AI could handle repetitive tasks, allowing engineers to focus on enhancing user experiences instead of troubleshooting. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com

    29 min
  8. #62 - Early Youtube SRE shares Modern Reliability Strategy

    11/05/2024

    #62 - Early Youtube SRE shares Modern Reliability Strategy

    Andrew Fong’s take on engineering cuts through the usual role labels, urging teams to start with the problem they’re solving instead of locking into rigid job titles. He sees reliability, inclusivity, and efficiency as the real drivers of good engineering. In his view, SRE is all about keeping systems reliable and healthy, while platform engineering is geared toward speed, developer enablement, and keeping costs in check. It’s a values-first, practical approach to tackling tough challenges that engineers face every day. Here’s a slightly deeper dive into the concepts we discussed: * Career and Evolution in Tech: Andrew shares his journey through various roles, from early SRE at Youtube to VP of Infrastructure at Dropbox to Director of Engineering at Databricks, with extensive experience in infrastructure through three distinct eras of the internet. He emphasized the transition from early infrastructure roles into specialized SRE functions, noting the rise of SRE as a formalized role and the evolution of responsibilities within it. * Building Prodvana and the Future of SRE: As CEO of startup, Prodvana, they're focused on an "intelligent delivery system" designed to simplify production management for engineers, addressing cognitive overload. They highlight SRE as a field facing new demands due to AI, discussing insights shared with Niall Murphy and Corey Bertram around AI's potential in the space, distinguishing it from "web three" hype, and affirming that while AI will transform SRE, it will not eliminate it. * Challenges of Migration and Integration: Reflecting on experiences at YouTube post-acquisition by Google, the speaker discusses the challenges of migrating YouTube’s infrastructure onto Google’s proprietary, non-thread-safe systems. This required extensive adaptation and “glue code,” offering insights into the intricacies and sometimes rigid culture of Google’s engineering approach at that time. * SRE’s Shift Toward Reliability as a Core Feature: The speaker describes how SRE has shifted from system-level automation to application reliability, with growing recognition that reliability is a user-facing feature. They emphasize that leadership buy-in and cultural support are essential for organizations to evolve beyond reactive incident response to proactive, reliability-focused SRE practices. * Organizational Culture and Leadership Influence: Leadership’s role in SRE success is highlighted as crucial, with examples from Dropbox and Google emphasizing that strong, supportive leadership can shape positive, reliability-centered cultures. The speaker advises engineers to gauge leadership attitudes towards SRE during job interviews to find environments where reliability is valued over mere incident response. * Outcome-Focused Work Over Titles: Emphasis on assembling the right team based on skills, not titles, to solve technical problems effectively. Titles often distract from focusing on outcomes, and fostering a problem-solving culture over role-based thinking accelerates teamwork and results. * Engineers as Problem Solvers: Engineers, especially natural ones, generally resist job boundaries and focus on solving problems rather than sticking rigidly to job descriptions. This echoes how iconic engineers like Steve Jobs valued versatility over predefined roles. * Culture as Core Values: Organizational culture should be driven by core values like reliability, efficiency, and inclusivity rather than rigid processes or roles. For instance, Dropbox's infrastructure culture emphasized being a “force multiplier” to sustain product velocity, an approach that ensured values were integrated into every decision. * Balancing SRE and Platform Priorities: The fundamental difference between SRE (Site Reliability Engineering) and platform engineering is their focus: SRE prioritizes reliability, while platform engineering is geared toward increasing velocity or reducing costs. Leaders must be cautious when assigning both roles simultaneously, as each requires a distinct focus and expertise. * Strategic Trade-Offs in Smaller Orgs: In smaller companies with limited resources, leaders often face challenges balancing cost, reliability, and other objectives within single roles. It's advised to sequence these priorities rather than burden one individual with conflicting objectives. Prioritizing platform stability, for example, can help improve reliability in the long term. * DevOps as a Philosophy: DevOps is viewed here as an operational philosophy rather than a separate role. The approach enhances both reliability and platform functions by fostering a collaborative, efficient work culture. * Focus Investments for Long-Term Gains: Strategic technology investments, even if they might temporarily hinder short-term metrics (like reliability), can drive long-term efficiency and reliability improvements. For instance, Dropbox invested in a shared metadata system to enable active-active disaster recovery, viewing this as essential for future reliability. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com

    36 min

Ratings & Reviews

5
out of 5
4 Ratings

About

Software reliability is a tough topic for engineers in many organizations. The Reliability Enablers (Ash Patel and Sebastian Vietz) know this from experience. Join us as we demystify reliability jargon like SRE, DevOps, and more. We interview experts and share practical insights. Our mission is to help you boost your success in reliability-enabling areas like observability, incident response, release engineering, and more. read.srepath.com

You Might Also Like