Blog Post

Incident Review – Google Outage

Updated

Published

September 25, 2020

mins read

in this blog post

When something as ubiquitous as Google goes down, there is a lot of online frenzy with users tweeting and searching for updates on the issue. That’s exactly what we witnessed today between 9/24/2020 17:59:44 PST to 9/24/2020 18:23:20 PST. Multiple Google services like Mail, Drive, Meet, Hangouts experienced downtime.

Frustrated users took to Twitter to report the outage and the tweets were captured by Websee.

A screenshot of a social media postDescription automatically generated

Users trying to access Google services got a 502 error screen.

A screenshot of a cell phoneDescription automatically generated

Incident Timeline

At Catchpoint, we synthetically monitor various Google Services like Meet, Hangouts, Calendar, Drive, and Mail. Login to google was impacted and hence users didn’t have access to the suite of services.

We received the first alert from our monitoring at 9/24/2020 17:59:51 and looking at data we saw that the onset of the issue was at 9/24/2020 17:59:44.

Fig 1: Downtime detected across all nodes

A screenshot of a social media postDescription automatically generated

Fig 2: Outage Scatterplot

The Google status page reported the issue shortly

A screenshot of a social media postDescription automatically generated

Fig 3: Google Status Dashboard

The outage was widespread, the impact was seen from offices in Bangalore, Los Angeles, New York, and Boston.

A close up of a mapDescription automatically generated

Fig 4: Global Impact

The issues seen by the end users can be broadly categorized into two groups:

1. The server returning a 502 HTTP response code.

A screenshot of a cell phoneDescription automatically generated

Fig 5: Waterfall and Header details showing 502 Error

2. Connection timeout to the server and high latency.

A screenshot of a cell phoneDescription automatically generated

Fig 6: High Latency

Looking at the server IP breakdown, we noted that certain IP ranges were impacted. Here is a breakdown of the IPs of the servers impacted per city

A screenshot of a social media postDescription automatically generated

A screenshot of a computerDescription automatically generated

Fig 7: IP Breakdown by City

What was interesting to note during the outage was that some of the servers that historically serve some of the cities did not get any requests during the outage

A screenshot of a social media postDescription automatically generated

Fig 8: Change in Servers

Baselines are thus very important, as it helps us identify the anomalies. For a highly distributed network like the Google network, having this level of visibility ensures you are able to get one step closer to the issue.

The traceroute tests that were running parallelly also offered some great insights. Before the outage, we noted no loss at the Google AS

A screenshot of a cell phoneDescription automatically generated

Fig 9: Traceroute before Outage

However, during the outage, we saw packet loss at the Google AS

A screenshot of a cell phoneDescription automatically generated

Fig 10: Traceroute during Outage

We were actively monitoring the outage on social as well. Urs Hölzle, senior vice president of technical infrastructure and Google Fellow at Google, tweeted about the root cause – “As has been noticed, several Google services were down for some users from 6:00 to 6:23 p.m. PDT. A pool of servers that route traffic to application backends crashed and users on that particular pool experienced the outage. “

A screenshot of a cell phoneDescription automatically generated

This is the fourth outage /service disruption for google services in the month of September as issues happened on September 18th September (Google Chat), 15th September (Google Drive), 8th September (Google Drive), 2020 with multiple Google Services including Drive. So User Experience for sure has impacted, Service reliability is a big question for all the companies providing services.

Summary

We rely on important services like Google Drive, Mail, Calendar, Maps, YouTube which is always in demand, used by almost everyone who has internet access on any device. In the current global situation, with the majority of the workforce working from home, these tools are even more crucial to communicate and collaborate. The whole delivery chain and infrastructure have to scale up to cope with the surge in active userbase accessing these services and ensure the user experience is not impacted. When services like Google go down, the consequences are immediate. The incident was resolved by 6:23pm PDT but the impact on end-user experience was definitely significant.

Google has always paved the path in the field of reliability and monitoring. A number of practices and philosophies of SRE originated at Google. But when it comes to technology, things are bound to break, this is just the inherent nature of technology. With Google services dominating a large part of our daily life, any impact on the services is amplified. Handling an outage of this magnitude is a testament to the reliability, operations, network, and monitoring systems at Google.

Summary

Frustrated users took to Twitter to report the outage and the tweets were captured by Websee.

A screenshot of a social media postDescription automatically generated

Users trying to access Google services got a 502 error screen.

A screenshot of a cell phoneDescription automatically generated

Incident Timeline

We received the first alert from our monitoring at 9/24/2020 17:59:51 and looking at data we saw that the onset of the issue was at 9/24/2020 17:59:44.

Fig 1: Downtime detected across all nodes

A screenshot of a social media postDescription automatically generated

Fig 2: Outage Scatterplot

The Google status page reported the issue shortly

A screenshot of a social media postDescription automatically generated

Fig 3: Google Status Dashboard

The outage was widespread, the impact was seen from offices in Bangalore, Los Angeles, New York, and Boston.

A close up of a mapDescription automatically generated

Fig 4: Global Impact

The issues seen by the end users can be broadly categorized into two groups:

1. The server returning a 502 HTTP response code.

A screenshot of a cell phoneDescription automatically generated

Fig 5: Waterfall and Header details showing 502 Error

2. Connection timeout to the server and high latency.

A screenshot of a cell phoneDescription automatically generated

Fig 6: High Latency

Looking at the server IP breakdown, we noted that certain IP ranges were impacted. Here is a breakdown of the IPs of the servers impacted per city

A screenshot of a social media postDescription automatically generated

A screenshot of a computerDescription automatically generated

Fig 7: IP Breakdown by City

What was interesting to note during the outage was that some of the servers that historically serve some of the cities did not get any requests during the outage

A screenshot of a social media postDescription automatically generated

Fig 8: Change in Servers

The traceroute tests that were running parallelly also offered some great insights. Before the outage, we noted no loss at the Google AS

A screenshot of a cell phoneDescription automatically generated

Fig 9: Traceroute before Outage

However, during the outage, we saw packet loss at the Google AS

A screenshot of a cell phoneDescription automatically generated

Fig 10: Traceroute during Outage

A screenshot of a cell phoneDescription automatically generated