Subscribe to our
weekly update
Sign up to receive our latest news via a mobile-friendly weekly email
Top takeaways and lessons learned from the December 2021 AWS outages
2021’s slew of Internet outages or disruptions show how connected and relatively fragile the Internet ecosystem is. Case in point: December’s trifecta of Amazon Web Services (AWS) outages, which really brought home the fact that no service is too big to fail:
The reality is, the next outage is not if, but when, where, and for how long. Pretending they don’t exist or won’t happen is not only pointless but harmful to your business. Looking back at the three December outages, we see four key takeaways:
Catchpoint observed all three outages well before they hit the AWS status page:
Early detection allows companies to fix problems potentially before they impact customers and implement contingency plans to ensure smooth failover as soon as possible. If the issue continues, it also allows them to proactively inform customers with precise details about the situation and assure them that their teams are working on it.
While it may be tempting to leave AWS monitoring to, well, AWS, that could leave you in the dark, observability-wise. A comprehensive digital observability plan should include not only your own technical elements, but also service delivery chain components that are not within your control. For example, you need insight into the systems of third-party vendors such as content delivery networks (CDNs), managed DNS providers, and backbone Internet service providers (ISPs).
While these might not be your code or hardware, your users will still be impacted by any issues they experience, and so will your business. If you are observing your end-to-end experiences, you can act when things that are out of your control are impacting your users.
It also means continuous observation of your systems to detect failure of fundamental components such as DNS, BGP, TCP configuration, SSL, the networks the data traverses, or any single point of failure in the infrastructure that we rarely change.
This issue is exacerbated by the fact that cloud has abstracted a lot of the underlying network from development, operations, and network teams. That can make it harder to find a problem.
As a result, it can catch us by surprise when these fundamental components fail. If teams are not properly prepared, it adds needless — and costly — time to detect, confirm, or find root cause. Therefore, ensure that you continuously monitor these aspects of your system and train your teams on what to do in case of failure.
The AWS incidents all clearly illustrate the downstream effect that an outage at one company can have on others. Digital infrastructure will assuredly continue to grow more complex and interconnected. Enterprises today run systems that run across multiple clouds. They also rely on multiple teams, often including a raft of other vendors, such as cloud compute, CDNs, and managed DNS. When issues originating with outside entities such as partners and third-party providers can bring down your systems, it is time to build a collaborative strategy designed to support your extended digital infrastructure. For that, comprehensive observability into every service provider involved in the delivery of your content is crucial.
While there are many monitoring solutions out there, make sure that you have a "break glass system" to be able to failover to a solution outside of the environment being monitored. Many visibility solutions are located in the cloud, which makes them vulnerable when cloud technologies go down.
This is why ThousandEyes, Datadog, Splunk (SignalFX), and NewRelic all reported impacts from the 12/07/21 and 12/15/21 events.
During the first event, Datadog reported delays that impacted multiple products, Splunk (SignalFX) reported that their AWS cloud metric syncer data ingestion was impacted, and New Relic reported that some AWS Infrastructure and polling metrics were delayed in the U.S.
There were also a number of issues triggered by the 12/15/2021 event:
Lack of observability is never a good thing, but over the course of an outage, it is significantly worse. Interestingly, AWS’s Adrian Cockroft pointed out the issue in a post on Medium, where he noted,
"The first thing that would be useful is to have a monitoring system that has failure modes which are uncorrelated with the infrastructure it is monitoring."