The following is an analysis of the Amazon Web Services incident on 12/07/2021.
Millions of users were affected by an Amazon Web Services outage that took down major online services such as Amazon, Amazon Prime, Amazon Alexa, Venmo, Disney+, Instacart, Roku, Kindle, and multiple online gaming sites. The outage, which originated in the US-EAST-1 region on Dec. 7, 2021, is still ongoing at the time of blog publication.
AWS published a notice to its Service Health Dashboard at 12:37 PM EST that read:
“We are seeing impact to multiple AWS APIs in the US-EAST-1 Region. This issue is also affecting some of our monitoring and incident response tooling, which is delaying our ability to provide updates. We have identified the root cause and are actively working towards recovery.”
The service dashboard also noted issues with Amazon Connect, Amazon DynamoDB, Amazon Elastic Compute Cloud, and AWS Support Center in the US-EAST-1 region. The outage, which hit during prime holiday shopping season, not only affected online shoppers, but also the apps that power warehouse, delivery, and Amazon Flex workers. Naturally, social media lit up with hot takes.
This is not the first significant outage that AWS has experienced in recent months. The US-EAST-1 region reported performance issues in September, 2021. In addition, there was a significant outage in November, 2020, caused by API (Application Programming Interface) errors in Amazon Kenesis.
Catchpoint detects and alerts on AWS outages first
Here at Catchpoint, we observed connectivity issues for AWS servers starting at 10:33 AM EST, considerably earlier than the AWS announcement. The 504 - Gateway time out errors indicated that the server received no upstream response.
Those 504 error responses also hit Amazon.com, as thousands of users reported that their holiday shopping was abruptly interrupted.
We also saw that AWS APIs were failing.
While observability enables businesses to correlate logs, metrics, and tracing, it might not serve the purpose if your active (synthetic) monitoring tool is hosted on the same cloud that was impacted. Active monitoring from backbone and last-mile networks allows you to detect issues inside and outside the cloud. AWS mentioned in one of their status updates that some of their monitoring tools were impacted by the outage, which further delayed their resolution.
More details about the need for observability from end user locations or networks is available here.
Nobody is immune to service-impacting outages, not even monitoring vendors that RUN on public clouds!
2021 continues to be a busy year for outages. More specifically to this incident, many cloud providers have recently suffered significant service-impacting outages. Public infrastructure may be getting more reliable, but the only way to minimize the impact of an outage is to detect it early and proactively, before your customers are impacted so you troubleshoot fast, avoiding negative revenue and brand impact.
It is quite concerning that many of the monitoring vendors that enterprises depend on to keep their services safe run those monitoring systems on public cloud infrastructure. According to their status pages, this particular AWS outage appears to have impacted the following monitoring vendors:
- ThousandEyes reported an impact on the endpoint agent data collection at the same time as the AWS outage
- Datadog reported delays that impacted multiple products
- Splunk (SignalFX) reported that their AWS cloud metric syncer data ingestion was impacted
- NewRelic reported that some AWS Infrastructure and polling metrics were delayed in the US
This situation is extremely problematic: you have public cloud infrastructure failing and the monitoring vendors that are supposed to monitor those services are also failing, because they are running their systems in the same public cloud infrastructure that just had a major outage!
Rest assured that Catchpoint DOES NOT USE public infrastructure for its critical systems and cannot be impacted by public cloud outages.
For further information on major incidents in 2021, please check out our new report. You’ll find detailed analysis, as well as a checklist of best practices to prevent, prepare for, and respond to an outage.