Google Cloud Outage: A lesson in reducing mean time to detect
Everything from G Suite and YouTube to ecommerce platform providers like Shopify—were hit by the outage and were down for almost 7 hours and 28 minutes.
This past Sunday has been a stressful and difficult one for Google engineers and SREs as multiple Google services were rendered inaccessible across multiple locations around the globe. Everything from G Suite, Hangouts, YouTube, Snapchat, Vimeo to ecommerce platform providers like Shopify—were hit by the outage and were down for almost 7 hours and 28 minutes. Some services came back around 5:15 PM PST, however, we still recorded errors until 7:18 PM PST.
Catchpoint caught the incident as soon as it started (around 11:50 am PST). Many of our customers are on Google Cloud, or using G Suite, or hosting images on Google Storage and we saw a huge increase in alerts and errors across those customers.
The scatterplot below shows the status of Google services when the outage first hit.
The Google Cloud status page was still green, even while Catchpoint was seeing the issue impact multiple Google services. The screenshot below was taken at the same time Catchpoint detected the incident:
The status page flagged the incident almost 30 minutes after it started. The first update from Google was logged at 12:25 pm PST.
So, let’s look at what went down. Most Google services were returning 502 errors.
HTTP/1.1 502 Bad Gateway
Date: Sun, 02 Jun 2019 21:06:53 GMT
Content-Type: text/html; charset=UTF-8
Alt-Svc: quic=":443"; ma=2592000; v="46,44,43,39"
Many websites were slow to load while others failed to load altogether because of fonts and other third parties hosted by Google Cloud. The chart below shows the impact on webpage load time compared to the day before.
YouTube also had performance issues, pages didn’t load videos:
A quick analysis of the data made clear that this was an issue with Google infrastructure. The waterfall graph below shows the high load time which resulted in performance degradation across major websites.
Traffic flowing into Google servers were immediately impacted by the issue.
The Sankey chart above illustrates the flow of network traffic. We can see the slowdown and packet loss as it hit the Google network.
Google updated the status page as soon as the issue was resolved citing network congestion as the root cause. The Catchpoint Sankey chart backs up the network congestion scenario.
Another lesson in cloud monitoring
The incident impacted millions of websites and disrupted businesses—resulting in major revenue loss. Shopify, a popular ecommerce platform, was left to deal with confused customers who took to Twitter to vent their frustration.
Companies relying on Google Suite experienced a huge productivity hit as many were left without access to Hangout and Gmail.
Consumers of Google Nest were also left in the dark without access to the Nest services to check their security cameras or control AC. This shows how ubiquitous things have become and the fragility of many systems relying on one single point of failure.
Catchpoint has always stressed how vital Mean Time to Detect (MTTD) is when it comes to end-user experience. This incident is case in point. We had already detected and narrowed down the root cause before the first Google status update went up. It then took another hour before Google was able to pinpoint the cause. The delay in updating service status only added to the confusion and frustration of millions of customers worldwide.
In order to deliver positive end-user experiences, Mean Time to Repair is the metric to beat. As we have always said, it’s an equation with 5 components: Detect, Identify, Escalate, Fix and Validate.
The sooner you detect, the better the other components are: from identification and escalation to communication with customers and updating status pages.
The delay in detecting an issue only delays the Mean Time to Resolve (MTTR). The higher the MTTR, the higher the impact on end-user experience. Lower MTTD automatically lowers MTTR. If you are proactively monitoring MTTD, then rest assured, you have already made a difference in end-user experience.
The entire digital ecosystem relies heavily on the health of cloud services. Cloud monitoring is no longer an option, it is a necessity. The cloud does not come with a performance safety net. The reality is that no matter where your application is hosted, and no matter how good your optimization strategies are, proactive monitoring from the end user perspective, at a very high frequency, is key to mitigating the impact on end-user experience.