1 Std. 18 Min.

Opsie Daisy! The GCP Outage of 2019 and a whole lotta BGP talk Ops 'N' Hops

    • Technologie

In the inaugural episode of the Opsie Daisy podcast, Derek and Malcolm discuss the June 2019 outage at GCP, which was a major networking outage that affected all US regions. The outage lasted between three hours and 20 minutes to four hours and 25 minutes, depending on the region. The impact was significant, with packet loss and service disruptions across multiple regions. The outage also affected some G Suite services, highlighting the interconnectedness of Google's services. The root cause of the outage was a combination of a software bug and misconfiguration in the maintenance event, leading to the failure of the network control plane jobs. During the conversation, the principal themes discussed were the challenges and limitations of the Border Gateway Protocol (BGP), the impact of the Google outage, and the root cause analysis and follow-up actions taken by Google. The conversation highlighted the lack of security in BGP and the difficulty of upgrading the protocol due to the diversity and scale of the internet. The Google outage was caused by a combination of misconfigurations and a software bug, resulting in network congestion and loss of traffic. Google is taking steps to prevent similar incidents in the future, including modifying the cluster management software and extending the time for the network to run without the control plane. In this conversation, Malcolm and Derek discuss the outage that occurred in Google Cloud Platform (GCP) in June 2019. They highlight the importance of public retrospectives in the tech industry and commend companies like Google for being transparent about their outages. They also discuss the issue of SLA credits and the mixed reactions to Google's process for applying for them. The conversation emphasizes the significance of effective communication during an outage and the need for human intervention in complex systems. They also touch on the challenges of managing complexity and the trade-offs between preventing outages and adding more processes. Overall, they rate the outage as a 3 out of 5 on the Opsie Daisy scale.

In the inaugural episode of the Opsie Daisy podcast, Derek and Malcolm discuss the June 2019 outage at GCP, which was a major networking outage that affected all US regions. The outage lasted between three hours and 20 minutes to four hours and 25 minutes, depending on the region. The impact was significant, with packet loss and service disruptions across multiple regions. The outage also affected some G Suite services, highlighting the interconnectedness of Google's services. The root cause of the outage was a combination of a software bug and misconfiguration in the maintenance event, leading to the failure of the network control plane jobs. During the conversation, the principal themes discussed were the challenges and limitations of the Border Gateway Protocol (BGP), the impact of the Google outage, and the root cause analysis and follow-up actions taken by Google. The conversation highlighted the lack of security in BGP and the difficulty of upgrading the protocol due to the diversity and scale of the internet. The Google outage was caused by a combination of misconfigurations and a software bug, resulting in network congestion and loss of traffic. Google is taking steps to prevent similar incidents in the future, including modifying the cluster management software and extending the time for the network to run without the control plane. In this conversation, Malcolm and Derek discuss the outage that occurred in Google Cloud Platform (GCP) in June 2019. They highlight the importance of public retrospectives in the tech industry and commend companies like Google for being transparent about their outages. They also discuss the issue of SLA credits and the mixed reactions to Google's process for applying for them. The conversation emphasizes the significance of effective communication during an outage and the need for human intervention in complex systems. They also touch on the challenges of managing complexity and the trade-offs between preventing outages and adding more processes. Overall, they rate the outage as a 3 out of 5 on the Opsie Daisy scale.

1 Std. 18 Min.

Top‑Podcasts in Technologie

Darknet Diaries
Jack Rhysider
Passwort - der Podcast von heise security
Dr. Christopher Kunz, Sylvester Tremmel
Acquired
Ben Gilbert and David Rosenthal
Lex Fridman Podcast
Lex Fridman
Ö1 matrix
ORF Ö1
Apfelfunk
Malte Kirchner & Jean-Claude Frick