There are multiple protocols and components that keep the complex Internet engine running. And just like any other well-oiled machine, it is important to regularly check whether it is functioning efficiently and delivering optimum performance.
The internet is basically a circuit relaying data signals/packets across different paths. One of the most important processes that make up the internet is IP routing. Several protocols manage the flow of data, Border Gateway Protocol or BGP governs how data is transmitted between autonomous entities in the network.
There is never enough stress on the need for ramping up security protocols as well as implementing proactive measures to identify performance degradation across a network quickly. This was highlighted by the BGP routing issue faced by Google yesterday. Although the issue was quickly sorted out, it still had a significant impact on user experience across multiple platforms.
At 16:30 EST on November 12th, Google noticed connectivity issues across multiple services including APIs, load balancers, and even their cloud services.
Catchpoint triggered performance alerts as soon as the issue surfaced. The charts below show some of the different Google services that were impacted.
Looking at the performance data from multiple customers, we realized this was a routing problem. For example, in the instance illustrated below, traffic was routed from Germany to Russia.
The RIPEstat data shows the routing path. AS37282 was advertised as the route to Google prefixes. This route information was then accepted by AS4809 (China Telecom) and then picked up by AS20485 (Transtelecom Russia).
Initial reports of the incident coupled with the suspicious routing paths pointed to a potential BGP hijack. But a Google representative clarified to ArsTechnica that this was an accident and not malicious.
The Nigerian ISP, MainOne Cable Company–identified as the origin of the issue, also tweeted that this was an error that occurred during a planned maintenance.
Within 30 minutes the issue was resolved. Google issued this statement on their Cloud Status Dashboard.
“Throughout the duration of this issue Google services were operating as expected and we believe the root cause of the issue was external to Google. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.”
Not just another third-party
We are constantly discussing the performance tax that comes with integrating third-party tags. Such incidents are a testament to the fact that third-party monitoring should never be overlooked.
The routing issue brought down Google services which had an immediate impact on performance; multiple websites had unusually high page load time. This was mainly due to the Google AJAX libraries that are referenced by many websites. The outage brings the focus right back to third-party tag management and how performance issues introduced by these tags lead to downtime.
Customers using the Ajax libraries provided by Google (ajax.googleapis.com) had a noticeable drop in performance throughout the duration of the routing attack. Websites that relied on the Google Ajax library did not load properly leaving the page blank. For example, this website was blank for over 31 seconds.
The waterfall graph shows the unusually high wait time for the Google APIs resource which pushed the page load time to 54 secs.
Multiple features make up an online application so dependencies on such third-party services are inevitable. Proactive and constant monitoring of these services is key to mitigating the impact on performance. It is even more important to be prepared to handle such incidents, we shared tips on how you can do this in our blog “5 Lessons for Managing a Third Party Outage”.
Performance monitoring is no longer about the uptime or downtime of an application. Advanced monitoring provides you with all the data and tools necessary to identify issues quickly as well as predict and prevent potential performance issues.