59 min

SRE at Google: Planet-scale observability - OpenObservability Talks S2E05 OpenObservability Talks

    • Technology

Have you ever wondered how services are operated at Google’s scale? Here’s your opportunity to find out. Ramón will share how his SRE team runs Google’s identity services, and the elaborate end-to-end observability they use to achieve it with strict SLA. We’ll also get a glimpse at the birthplace of Kubernetes, OpenCensus, Dapper, Monarch and other cornerstones of today’s cloud-native DevOps and observability.

Ramón Medrano Llamas (@rmedranollamas) is a staff site reliability engineer at Google, focused on user identity and authentication. He concentrates on the reliability aspects of new Google products and new features of existing products, ensuring that they meet the same high bar as every other Google service. Before joining Google in 2013, he worked at CERN developing and designing distributed systems for physics. He holds a master’s degree in computer science and is pursuing a PhD on distributed systems.

The episode was live-streamed on 26 October 2021 and the video is available at https://youtube.com/live/jVTZf1SXZrg



Show Notes:


scale and size of Google Identity services operation
evolution from monitoring to observability
telemetry collection
SRE job description is changing
Google Dapper
Google Census
operating end-to-end observability at scale
flexibility vs. runbook in SRE
how SRE at google different
transition from monolith to MSA
Linux Foundation launching a DevOps bootcamp
Parca OSS launched
how to intro SRE culture

Resources:


Dapper paper: Dapper, a Large-Scale Distributed Systems Tracing Infrastructure
Borg paper: Large-scale cluster management at Google with Borg
MonArch paper: Monarch: Google’s Planet-Scale In-Memory Time Series Database
SRE books 
Systemantics

Have you ever wondered how services are operated at Google’s scale? Here’s your opportunity to find out. Ramón will share how his SRE team runs Google’s identity services, and the elaborate end-to-end observability they use to achieve it with strict SLA. We’ll also get a glimpse at the birthplace of Kubernetes, OpenCensus, Dapper, Monarch and other cornerstones of today’s cloud-native DevOps and observability.

Ramón Medrano Llamas (@rmedranollamas) is a staff site reliability engineer at Google, focused on user identity and authentication. He concentrates on the reliability aspects of new Google products and new features of existing products, ensuring that they meet the same high bar as every other Google service. Before joining Google in 2013, he worked at CERN developing and designing distributed systems for physics. He holds a master’s degree in computer science and is pursuing a PhD on distributed systems.

The episode was live-streamed on 26 October 2021 and the video is available at https://youtube.com/live/jVTZf1SXZrg



Show Notes:


scale and size of Google Identity services operation
evolution from monitoring to observability
telemetry collection
SRE job description is changing
Google Dapper
Google Census
operating end-to-end observability at scale
flexibility vs. runbook in SRE
how SRE at google different
transition from monolith to MSA
Linux Foundation launching a DevOps bootcamp
Parca OSS launched
how to intro SRE culture

Resources:


Dapper paper: Dapper, a Large-Scale Distributed Systems Tracing Infrastructure
Borg paper: Large-scale cluster management at Google with Borg
MonArch paper: Monarch: Google’s Planet-Scale In-Memory Time Series Database
SRE books 
Systemantics

59 min

Top Podcasts In Technology

Acquired
Ben Gilbert and David Rosenthal
Lex Fridman Podcast
Lex Fridman
Darknet Diaries
Jack Rhysider
All-In with Chamath, Jason, Sacks & Friedberg
All-In Podcast, LLC
The Vergecast
The Verge
Search Engine
PJ Vogt, Audacy, Jigsaw