Incident Review – AWS Outage Led To Spikes In Response Times For Applications Using AWS Services

News & Trends

On Tuesday August 31, users across large parts of the West coast (US-West-2 region) were impacted by major spikes in response time. Some of AWS’ most critical services were affected, including Lambda and Kinesis.

SRE teams care about Service Level Indicators (SLIs) and Service Level Objectives (SLOs), and this practice is a must for SRE teams. However, unlike Google, the founders of SRE, most companies rely on other providers like AWS and GCP, along with a CDN (or CDNs), for their infrastructure and services.

This means that not only do you need to have SLIs and SLOs for your applications and services, you also need to take a close look at your providers and vendors. This is because SLOs and SLIs for most companies will have a dependency on the vendors and providers being used. If you are hosted in the cloud, the cloud vendor having an issue is analogous to you missing your SLOs.

By monitoring your vendor SLOs, you can understand their impact on your SLOs and system architecture in order to properly deliver the level of experience you are aiming for.

The AWS incident on Tuesday lasted for approximately four hours, creating a widespread series of headaches  – from websites being down to site features being unusable to difficulties logging into applications - across the entire US-West-2 region. Companies, developers, and DevOps teams shared their angst on social media and news sites. Those commenting included The Seattle Times, major gaming company Zwift, and SaaS platform Ubiquiti.  

Tweet from Ubiquiti complaining of AWS outage

Catchpoint Detects and Alerts on AWS Outages First

This blog post aims to give you a clearer picture of what happened, when, and how.

Through our proactive monitoring platform, we first detected issues for our customers at 11AM PST on Tuesday. Our data analysis revealed widespread connectivity failures in the US-West region. We immediately triggered our first alert - a full 25 minutes before AWS recognized the issue. AWS’ first mention on their status page that they were investigating the issue took place at 11.25AM PST.

AWS dashboard describing AWS outage at 11.25AM PST

It was a similar story with DDoS issues at AWS two years ago. We detected the issue five hours ahead of them. At the time, when one of our top customers reached out to AWS support about the problem straight after being alerted by Catchpoint, AWS support was unaware there was an incident going on.

Unlike other observability platforms, Catchpoint is not hosted on a cloud provider, so when a cloud provider has an incident impacting their solutions, we are not impacted. Our platform will continue to work, alerting you as soon as we detect any problem.

AWS Status Dashboard Showed Increased Latencies And Connectivity Issues

The AWS Service Health Dashboard revealed increased provisioning latencies to Amazon Elastic Load Balancing in Oregon and AWS Internet connectivity issues in the same region.

AWS service health dashboard showing latency and connectivity issues

Impacted AWS Services

Impacted AWS services included Lambda, ELB, Kinesis, RDS, CloudWatch, and ECS.

Catchpoint data revealed a major spike in response times for applications using AWS Services.
Catchpoint data revealed a major spike in response times for applications using AWS Services.

Incident Hit AWS-WEST-2 Region

Only users in the US-WEST-2 were impacted, meaning Oregon specifically (including Seattle where Amazon is headquartered). There are two other AWS Regions on the West Coast: Northern California and the AWS GovCloud. Neither of them were affected, however.  

Root Cause Identified: Network Connectivity Issues

At 2:26PM PDT, the root cause of the issue affecting network connectivity in the US-WEST-2 region was identified by AWS as, “a component within the subsystem responsible for the processing of network packets for Network Load Balancer.” This led to impairment of the NT Gateway and PrivateLink services, “no longer processing health checks successfully” and further performance degradation.

AWS root cause statement
AWS root cause statement

Going back to Catchpoint's dataset, we can also include additional metrics to validate that the cause of the outage was a network connectivity issue. Catchpoint offers 50+ metrics that allow you to narrow down issues to a specific component. You can then answer the question, “Is it the network or is it the application that is causing the problem?”  

In this case, you can see that the overall response time spiked because of an increase in connect time to the servers, which is impacted by the network. However, the load and wait times, which are related to the server processing time - and hence indicative of applications/server-side issues - is flat with no spikes.

US WEST region showing problems vs US EAST

Comparing this to the US-EAST region, we can clearly see that the issue is concentrated to US-WEST.  

Do You Have End-to-End Monitoring In Place To Detect Such Outages?

The latest outage of the summer (read about the June Fastly outage here and Tuesday's Akamai performance degradation issues here) serves as a reminder for organizations to evaluate and verify their own infrastructure setup, including their monitoring, observability, and failover strategies.  

It’s also worth taking a beat to ensure you don’t rely on cloud-only monitoring strategies, which can lead to blind spots.  

Graph showing cloud-only monitoring causes false negatives
Cloud-only monitoring causes false negatives

As you can see from the image above, we took a look at the monitoring data from the Portland AWS node (the affected region) and observed spikes in response times. When you are monitoring from locations only on the cloud, and the cloud provider has an issue, your monitoring will make it seem that you have a problem. In other words, if you monitored from this region but didn't have any services or infrastructure hosted there, you might still have received alerts on Tuesday telling you you had a problem. Essentially, these are false negatives, which mean your on-call teams are getting pinged with unnecessary alerts.

You can reduce the noise and save time and resources by deploying a holistic monitoring and observability strategy. Catchpoint has the industry’s largest network of cloud nodes, but we also simulate the entire end user experience. In other words, we have nodes on local ISPs, the major backbones, mobile networks, and the clouds that your end users connect to when visiting a site or using an app. A holistic monitoring and observability strategy means you can detect outages and performance issues from anywhere, in real time.  

Prevent Single Points of Failure

Ultimately, Service Level Indicators and Service Level Objectives are not just for your services, they are for your third-party providers - and everything in your infrastructure that is a single point of failure. This is the reason why your legal departments ensure there are SLA clauses with cloud providers, including SaaS providers such as Catchpoint.

Check out our On-Demand Webinar, “The Journey to SLO Maturity” to learn more about real-world scenarios in which SLO-based decision making enabled users to meet their business goals.

Published on
Sep 03, 2021
,
updated on
Back To Top