On Tuesday August 31, users across large parts of the West coast (US-West-2 region) were impacted by major spikes in response time. Some of AWS’ most critical services were affected, including Lambda and Kinesis.
SRE teams care about Service Level Indicators (SLIs) and Service Level Objectives (SLOs), and this practice is a must for SRE teams. However, unlike Google, the founders of SRE, most companies rely on other providers like AWS and GCP, along with a CDN (or CDNs), for their infrastructure and services.
This means that not only do you need to have SLIs and SLOs for your applications and services, you also need to take a close look at your providers and vendors. This is because SLOs and SLIs for most companies will have a dependency on the vendors and providers being used. If you are hosted in the cloud, the cloud vendor having an issue is analogous to you missing your SLOs.
By monitoring your vendor SLOs, you can understand their impact on your SLOs and system architecture in order to properly deliver the level of experience you are aiming for.
The AWS incident on Tuesday lasted for approximately four hours, creating a widespread series of headaches – from websites being down to site features being unusable to difficulties logging into applications - across the entire US-West-2 region. Companies, developers, and DevOps teams shared their angst on social media and news sites. Those commenting included The Seattle Times, major gaming company Zwift, and SaaS platform Ubiquiti.
![Tweet from Ubiquiti complaining of AWS outage](https://assets-global.website-files.com/5babb9f91ab233ff5f53ce10/61314baa9b9c404305262e38_aws 1.PNG)
Catchpoint Detects and Alerts on AWS Outages First
This blog post aims to give you a clearer picture of what happened, when, and how.
Through our proactive monitoring platform, we first detected issues for our customers at 11AM PST on Tuesday. Our data analysis revealed widespread connectivity failures in the US-West region. We immediately triggered our first alert - a full 25 minutes before AWS recognized the issue. AWS’ first mention on their status page that they were investigating the issue took place at 11.25AM PST.
![AWS dashboard describing AWS outage at 11.25AM PST](https://assets-global.website-files.com/5babb9f91ab233ff5f53ce10/61314c00128275aa61fab4d7_AWS 2.png)
It was a similar story with DDoS issues at AWS two years ago. We detected the issue five hours ahead of them. At the time, when one of our top customers reached out to AWS support about the problem straight after being alerted by Catchpoint, AWS support was unaware there was an incident going on.
Unlike other observability platforms, Catchpoint is not hosted on a cloud provider, so when a cloud provider has an incident impacting their solutions, we are not impacted. Our platform will continue to work, alerting you as soon as we detect any problem.
AWS Status Dashboard Showed Increased Latencies And Connectivity Issues
The AWS Service Health Dashboard revealed increased provisioning latencies to Amazon Elastic Load Balancing in Oregon and AWS Internet connectivity issues in the same region.
![AWS service health dashboard showing latency and connectivity issues](https://assets-global.website-files.com/5babb9f91ab233ff5f53ce10/61314c39fa322d362e555c90_aws 3.jpg)
Impacted AWS Services
Impacted AWS services included Lambda, ELB, Kinesis, RDS, CloudWatch, and ECS.
![Catchpoint data revealed a major spike in response times for applications using AWS Services. ](https://assets-global.website-files.com/5babb9f91ab233ff5f53ce10/61314c68e8ed7e1d98ca2f23_AWS 4.png)
Catchpoint data revealed a major spike in response times for applications using AWS Services.
Incident Hit AWS-WEST-2 Region
Only users in the US-WEST-2 were impacted, meaning Oregon specifically (including Seattle where Amazon is headquartered). There are two other AWS Regions on the West Coast: Northern California and the AWS GovCloud. Neither of them were affected, however.
Root Cause Identified: Network Connectivity Issues
At 2:26PM PDT, the root cause of the issue affecting network connectivity in the US-WEST-2 region was identified by AWS as, “a component within the subsystem responsible for the processing of network packets for Network Load Balancer.” This led to impairment of the NT Gateway and PrivateLink services, “no longer processing health checks successfully” and further performance degradation.
![AWS root cause statement](https://assets-global.website-files.com/5babb9f91ab233ff5f53ce10/6131f98a74d1133e64b4f61a_AWS root cause statement.PNG)
AWS root cause statement
Going back to Catchpoint's dataset, we can also include additional metrics to validate that the cause of the outage was a network connectivity issue. Catchpoint offers 50+ metrics that allow you to narrow down issues to a specific component. You can then answer the question, “Is it the network or is it the application that is causing the problem?”
In this case, you can see that the overall response time spiked because of an increase in connect time to the servers, which is impacted by the network. However, the load and wait times, which are related to the server processing time - and hence indicative of applications/server-side issues - is flat with no spikes.
![US WEST region showing problems vs US EAST](https://assets-global.website-files.com/5babb9f91ab233ff5f53ce10/6131f93ebffcf7ec4055d8c5_aws image replacement for west.png)
Comparing this to the US-EAST region, we can clearly see that the issue is concentrated to US-WEST.
Do You Have End-to-End Monitoring In Place To Detect Such Outages?
The latest outage of the summer (read about the June Fastly outage here and Tuesday's Akamai performance degradation issues here) serves as a reminder for organizations to evaluate and verify their own infrastructure setup, including their monitoring, observability, and failover strategies.
It’s also worth taking a beat to ensure you don’t rely on cloud-only monitoring strategies, which can lead to blind spots.
![Graph showing cloud-only monitoring causes false negatives](https://assets-global.website-files.com/5babb9f91ab233ff5f53ce10/61314d0c50f4bc146fa6c247_aws 8.png)
Cloud-only monitoring causes false negatives
As you can see from the image above, we took a look at the monitoring data from the Portland AWS node (the affected region) and observed spikes in response times. When you are monitoring from locations only on the cloud, and the cloud provider has an issue, your monitoring will make it seem that you have a problem. In other words, if you monitored from this region but didn't have any services or infrastructure hosted there, you might still have received alerts on Tuesday telling you you had a problem. Essentially, these are false negatives, which mean your on-call teams are getting pinged with unnecessary alerts.
You can reduce the noise and save time and resources by deploying a holistic monitoring and observability strategy. Catchpoint has the industry’s largest network of cloud nodes, but we also simulate the entire end user experience. In other words, we have nodes on local ISPs, the major backbones, mobile networks, and the clouds that your end users connect to when visiting a site or using an app. A holistic monitoring and observability strategy means you can detect outages and performance issues from anywhere, in real time.
Prevent Single Points of Failure
Ultimately, Service Level Indicators and Service Level Objectives are not just for your services, they are for your third-party providers - and everything in your infrastructure that is a single point of failure. This is the reason why your legal departments ensure there are SLA clauses with cloud providers, including SaaS providers such as Catchpoint.
For further information on major incidents in 2021, please check out our new report. You’ll find detailed analysis, as well as a checklist of best practices to prevent, prepare for, and respond to an outage.