How Third Party Services Nearly Grounded Jet.com
Google Compute Engine experienced an 18-minute outage in April 2016 that significantly affected Jet.com's page load speeds.
Earlier this month, Google Compute Engine, Google’s infrastructure-as-a-service offering and Amazon Web Services alternative, went down across all regions for 18 minutes. The outage was reportedly caused by two bugs in Google’s network management software, bugs triggered by a configuration error after Google engineers made changes to GCE’s network configuration in an effort to speed it up.
The outage was barely noticed outside of a handful of tech news sites and blogs, but its impact had ripple effects throughout the digital business world. Below is my analysis on the impact of this outage on the ecommerce site www.jet.com.
Mobile Testing
There were eight test runs initiated between 19:11 PT to 19:26 PT, all of which reported at least a 333% or a 4.3X increase in their web response times vs. the median web response time. This also correlated to a 500% or a 6X increase in the Document Complete times vs. the median Document Complete time.
Desktop Testing
There were eight test runs initiated between 19:11 PT to 19:26 PT, all of which reported at least a 350% or a 4.5X increase in their web response times vs. the median web response time. This also correlated to a 511% or a 6.1X increase in the document complete times vs. the median document complete time.
Further analysis shows no anomalies in the response time (time to download the home page), nor the render start (time taken to paint the first byte on the browser) But huge variations in the document complete (time taken by the browser to complete the downloading and parsing of all resources in the page) and the eventual web response time (completion of downloading and rendering of all resources on a web page).
This finding drove our troubleshooting to look at the third party services used by Jet.com.
Analyzing the waterfall charts for these troubled tests displays a pattern of two specific third party hosts failing, impacting the document complete time and the overall web response time.
The culprits, api.lytics.io and c.lytics.io belong to Lytics, an online ad-targeting company.
Like many online services, Lytics outsources its IT infrastructure to an IaaS provider. You guessed it: Google Cloud Platform (https://www.getlytics.com/pdf/LyticsTechOverview.pdf).
This sort of scenario is all too common to web performance engineers. The same third party services that can help you to target ads, personalize content and understand customer behavior on your site are also a major performance vulnerability. In this case, the culprit was the infrastructure service supporting the third party service. This is a good illustration of the complex interdependencies that power online commerce and content today.
What can you do to manage this complexity and prevent your end user experience from being damaged by it? For starters, make sure all third party/marketing resources like this load after the Document complete. And be sure to continuously monitor all third-party services that your web application calls to make sure they are not slowing down, or taking down, your site.
Anand Guruprusad is a performance engineer in Catchpoint’s Bangalore, India office