Adventures in DevOps

Will Button, Warren Parad

Join us in listening to the experienced experts discuss cutting edge challenges in the world of DevOps. From applying the mindset at your company, to career growth and leadership challenges within engineering teams, and avoiding the common antipatterns. Every episode you'll meet a new industry veteran guest with their own unique story.

  1. DEC 4

    Are we building the right thing?

    Share Episode ⸺ Episode Sponsor: Incident.io - https://dev0ps.fyi/incidentio Elise, VP and Head of UX at Unleash, joins us to talk all about UX. Self identifying as probably "The annoying lady in the room" and a career spanning nearly 30 years—starting before "UX" was even a job title — joins us to dismantle the idea that User Experience is just about moving pixels around. Here we debate the friction between engineering, sales, and the customer. We get to the bottom of whether or avoiding end-user interaction, understand, and research is a career-limiting move for staff+ engineers. Or should you avoid forcing a world-class developer to facilitate a call with a non-technical user if it makes them uncomfortable? Warren calls out the "Pit of Failure" often faced by teams as they seek to introduce feature flags. They can become a crutch, leading teams to push untested code into production simply because they can toggle it off—a scenario he calls the "pit of failure". And Elise dives into a great story recounting her consulting days where a company spent a fortune on a branding agency that demanded conflicting "primary colors" for a mainframe application used 8 hours a day. Her low-tech solution to prove them wrong? Listen and find out, this episode is all about bringing UX to Engineering. 💡 Notable Links: Ladder of Leadership - Book: Turn the Ship Around!🎯 Picks: Warren - Growth.Design Case StudiesElise - Paper on Generative UI: LLMs are Effective UI Generators

    36 min
  2. NOV 20

    Why Your Code Dies in Six Months: Automated Refactoring

    Share Episode ⸺ Episode Sponsor: Incident.io - https://dev0ps.fyi/incidentio Warren is joined by Olga Kundzich, Co-founder and CTO of Moderne, to discuss the reality of technical debt in modern software engineering. Olga reveals a shocking statistic: without maintenance, cloud-native applications often cease to function within just six months. And from our experience, that's actually optimistic. The rapid decay isn't always due to bad code choices, but rather the shifting sands of third-party dependencies, which make up 80 to 90% of cloud-native environments. We review the limitations of traditional Abstract Syntax Trees (ASTs) and the introduction of OpenRewrite's Lossless Semantic Trees (LSTs). Unlike standard tools, LSTs preserve formatting and style, allowing for automated, horizontal scaling of code maintenance across millions of lines of code. This fits perfectly in to the toolchain that is the LLMs and open source ecosystem. Olga explains how this technology enables enterprises to migrate frameworks—like moving from Spring Boot 1 to 2 — without dedicating entire years to manual updates. Finally, they explore the intersection of AI and code maintenance, noting that while LLMs are great at generating code, they often struggle with refactoring and optimizing existing codebases. We highlight that agents are not yet fully autonomous and will always require "right-sized" data to function effectively. Will is absent for this episode, leaving Warren to navigate the complexities of mass-scale code remediation solo. 💡 Notable Links: DevOps Episode: We read codeDevOps Episode: Dynamic PRs from incidentsOpenRewriteLarger Context Windows are not better🎯 Picks: Warren - Dell XPS 13 9380Olga - Claude Code

    33 min
  3. OCT 20

    Solving incidents with one-time ephemeral runbooks

    Share Episode ⸺ Episode Sponsor: Attribute - https://dev0ps.fyi/attribute In the wake of one of the worst AWS incidents in history, we're joined by Lawrence Jones, Founding Engineer at Incident.io. The conversation focuses on the challenges of managing incidents in highly regulated environments like FinTech, where the penalties for downtime are harsh and require a high level of rigor and discipline in the response process. Lawrence details the company's evolution, from running a monolithic Go binary on Heroku to moving to a more secure, robust setup in GCP, prioritizing the use of native security primitives like GCP Secret Manager and Kubernetes to meet the obligations of their growing customer base. We spotlight exactly how a system can crawl GitHub pull requests, Slack channels, telemetry data, and past incident post-mortems to dynamically generate an ephemeral runbook for the current incident.Also discussed are the technical challenges of using RAG (Retrieval-Augmented Generation), noting that they rely heavily on pre-processing data with tags and a service catalog rather than relying solely on less consistent vector embeddings to ensure fast, accurate search results during a crisis. Finally, Lawrence stresses that frontier models are no longer the limiting factor in building these complex systems; rather, success hinges on building structured, modular systems, and doing the hard work of defining objective metrics for improvement. 💡 Notable Links: Cloud Secrets management at scaleEpisode: Solving Time Travel in RAG DatabasesEpisode: Does RAG Replace keyword search?🎯 Picks: Warren - Anker Adpatable Wall-Charger - PowerPort Atom IIILawrence - Rocktopus & The Checklist Manifesto

    50 min
  4. SEP 17

    The Unspoken Challenges of Deploying to Customer Clouds

    Share Episode This episode we are joined by Andrew Moreland, co-founder of Chalk. Andrew explains how their company's core business model is to deploy their software directly into their customers' cloud environments. This decision was driven by the need to handle highly sensitive data, like PII and financial records, that customers don't want to hand over to a third-party startup. The conversation delves into the surprising and complex challenges of this approach, which include managing granular IAM permissions and dealing with hidden global policies that can block their application. Andrew and Warren also discuss the real-world network congestion issues that affect cross-cloud traffic, a problem they've encountered multiple times. Andrew shares Chalk's mature philosophy on software releases, where they prioritize backwards compatibility to prevent customer churn, which is a key learning from a competitor. Finally, the episode explores the advanced technical solutions Chalk has built, such as their unique approach to "bitemporal modeling" to prevent training bias in machine learning datasets. As well as, the decision to move from Python to C++ and Rust for performance, using a symbolic interpreter to execute customer code written in Python without a Python runtime. The episode concludes with picks, including a surprisingly popular hobby and a unique take on high-quality chocolate. 💡 Notable Links: Fact - The $1M hidden Kubernetes spendGiraffe and Medical Ruler training data biasSOLID principles don't produce better code?Veritasium - The Hole at the Bottom of MathEpisode: Auth Showdown on backwards compatible changes🎯 Picks: Warren - Switzerland Grocery Store ChocolateAndrew - Trek E-Bikes

    53 min
  5. SEP 7

    How to build in Observability at Petabyte Scale

    Share Episode We welcome guest Ang Li and dive into the immense challenge of observability at scale, where some customers are generating petabytes of data per day. Ang explains that instead of building a database from scratch—a decision he says went "against all the instincts" of a founding engineer—Observe chose to build its platform on top of Snowflake, leveraging its separation of compute and storage on EC2 and S3. The discussion delves into the technical stack and architectural decisions, including the use of Kafka to absorb large bursts of incoming customer data and smooth it out for Snowflake's batch-based engine. Ang notes this choice was also strategic for avoiding tight coupling with a single cloud provider like AWS Kinesis, which would hinder future multi-cloud deployments on GCP or Azure. The discussion also covers their unique pricing model, which avoids surprising customers with high bills by providing a lower cost for data ingestion and then using a usage-based model for queries. This is contrasted with Warren's experience with his company's user-based pricing, which can lead to negative customer experiences when limits are exceeded. The episode also explores Observe's "love-hate relationship" with Snowflake, as Observe's usage accounts for over 2% of Snowflake's compute, which has helped them discover a lot of bugs but also caused sleepless nights for Snowflake's on-call engineers. Ang discusses hedging their bets for the future by leveraging open data formats like Iceberg, which can be stored directly in customer S3 buckets to enable true data ownership and portability. The episode concludes with a deep dive into the security challenges of providing multi-account access to customer data using IAM trust policies, and a look at the personal picks from the hosts. 💡 Notable Links: Fact - Passkeys: Phishing on Google's own domain and It isn't even newEpisode: All About OTELEpisode: Self Healing Systems🎯 Picks: Warren - The Shadow (1994 film)Ang - XREAL Pro AR Glasses

    46 min
4.4
out of 5
18 Ratings

About

Join us in listening to the experienced experts discuss cutting edge challenges in the world of DevOps. From applying the mindset at your company, to career growth and leadership challenges within engineering teams, and avoiding the common antipatterns. Every episode you'll meet a new industry veteran guest with their own unique story.

You Might Also Like