The following is an analysis of the Amazon Web Services incident on 12/15/2021.
It may be the holiday season for most of us, but for AWS it appears to be Groundhog Day, Bill Murray style. For the second week in a row, the company reported an outage, this time affecting its US-West-2 region in Oregon and US-West-1 in Northern California. The latest AWS outage lasted about an hour and brought down major services such as Auth0, Duo, Okta, DoorDash, Disney, the PlayStation Network, Slack, Netflix, Snapchat, and Zoom.
Catchpoint noted the outage at approximately 07:15 PT, once again considerably earlier than the AWS announcement.
User Sentiment analysis (Catchpoint)
The following screenshots show our analysis of the incident. The timeline below highlights the increase in network latency and packet loss impacting Internet connectivity to the U.S. WEST Region.
Multi-layer HTTP, Network view (Catchpoint)
The view below inspects the network from various regional external vantage points on multiple Internet carriers, highlighting the additional latency and packet loss at the edge of the AWS network.
Network view (Catchpoint)
For example, zooming into one of the failed measurements, we can see how the network path record from service provider Telia in Seattle reveals the increase in latency and packet loss occurring after Telia’s network degraded within Amazon’s network.
Network path record (Catchpoint)
The outage follows last week’s lengthy outage at AWS’ Northern Virginia (US-East-1) region, which inflicted cascading damage on major services such as Amazon, Venmo, Disney+, Tinder, and multiple online gaming sites, not to mention the company’s apps that power warehouse, delivery, and Amazon Flex workers. Although this one was much briefer, it once again caused a flutter on social media, with an extra dose of Log4j agita thrown in.
Major monitoring vendors reported outage-related service degradation
We could once again observe the downstream effect of the AWS outage on several enterprise monitoring vendors that run their monitoring services on the infrastructure that went down.
Datadog reported delays in collecting AWS integration metrics:
ThousandEyes reported degradation to API services:
New Relic reported that their synthetics user interface was impacted, as well as some of the APM alerting and the data ingestion for their infrastructure metrics.
Dynatrace reported that some of their components hosted on AWS cluster were impacted:
Splunk (Rigor and SignalFX) reported increased error rates in the West coast of the US and a degradation in the performance of their Log Observer:
The beauty of knowing your job
There is a classic Looney Tunes cartoon about Sam Sheepdog and Ralph Wolf, who punch a time clock together before heading out for a day of adversarial encounters. Sam always bested Ralph because he knew his job: being a sheepdog. Not a sheep.
We mention this because it is difficult to be an effective sheepdog if you are also part of the flock being guarded. (It is a recipe for success for the wolf, however.) In fact, AWS’s Adrian Cockroft pointed out the issue in a 2018 post on Medium, where he noted,
"The first thing that would be useful is to have a monitoring system that has failure modes which are uncorrelated with the infrastructure it is monitoring."
In other words, we don’t run our business-critical monitoring services, which our enterprise clients rely on, in the same public cloud infrastructure that we monitor. That’s why Catchpoint isn’t impacted by public cloud outages. That compelling logic underscores why four out of the top five cloud providers use Catchpoint’s observability solution. Indeed, we believe it is fundamentally wrong to have business-critical monitoring systems sharing the same infrastructure with the applications and properties that are being monitored.
Although public cloud infrastructure has been transformative for the global digital economy, an outage such as this illustrates how severely users and services can be impacted. Building a high-availability architecture and utilizing a multi-vendor cloud strategy are steps that every public cloud user should prioritize. Ultimately, your best defense to minimize the impact of such outages is a robust observability strategy that emphasizes early and proactive detection of issues. By identifying the issue before users are impacted, you can troubleshoot fast to prevent loss of revenue or damage to your brand.
Want to learn more?
Want to learn more best practices to prevent, prepare for, and respond to an outage? Download “2021 Internet Outages: A compendium of the year’s mischiefs and miseries – with a dose of actionable insights.”