OpenObservability Talks

SRE at Google: Planet-scale observability - OpenObservability Talks S2E05

Have you ever wondered how services are operated at Google’s scale? Here’s your opportunity to find out. Ramón will share how his SRE team runs Google’s identity services, and the elaborate end-to-end observability they use to achieve it with strict SLA. We’ll also get a glimpse at the birthplace of Kubernetes, OpenCensus, Dapper, Monarch and other cornerstones of today’s cloud-native DevOps and observability.

Ramón Medrano Llamas (@rmedranollamas) is a staff site reliability engineer at Google, focused on user identity and authentication. He concentrates on the reliability aspects of new Google products and new features of existing products, ensuring that they meet the same high bar as every other Google service. Before joining Google in 2013, he worked at CERN developing and designing distributed systems for physics. He holds a master’s degree in computer science and is pursuing a PhD on distributed systems.

The episode was live-streamed on 26 October 2021 and the video is available at https://youtube.com/live/jVTZf1SXZrg

Show Notes:

  • scale and size of Google Identity services operation
  • evolution from monitoring to observability
  • telemetry collection
  • SRE job description is changing
  • Google Dapper
  • Google Census
  • operating end-to-end observability at scale
  • flexibility vs. runbook in SRE
  • how SRE at google different
  • transition from monolith to MSA
  • Linux Foundation launching a DevOps bootcamp
  • Parca OSS launched
  • how to intro SRE culture

Resources:

  • Dapper paper: Dapper, a Large-Scale Distributed Systems Tracing Infrastructure
  • Borg paper: Large-scale cluster management at Google with Borg
  • MonArch paper: Monarch: Google’s Planet-Scale In-Memory Time Series Database
  • SRE books 
  • Systemantics