Not all Internet outages take a website down. Some may impact a smaller subsection of users or only affect one part of a site’s functionality. Moreover, because of their relative “hidden” nature, organizations may not always know about them immediately since fewer users will be making complaints. However, such incidents can still have serious consequences, thus you want to detect them as soon as possible so you can quickly mitigate and resolve issues.
Detection isn’t as easy as it sounds. Many sites rely on basic uptime monitoring— sometimes limited to just monitoring their home page—to detect slowdowns and outages, which can mean that a company experiencing intermittent or partial site failures misses detection.
Earlier this month, through our ability to conduct transaction monitoring, we detected just such an issue at Amazon.
Detecting Amazon’s Search failures
Catchpoint systems began to detect initial failures related to Amazon’s Search function at 12:51:02 ET on December 05, 2022. As you can see below, these were intermittent in nature.
The errors lasted for 22 hours until December 7, 2022 – 11:13:44 ET, remaining intermittent, yet continuing to impact users worldwide wishing to search for products on Amazon’s desktop and mobile sites. According to our Synthetic data, around 20% of worldwide users were impacted across the entire time frame. For a certain percentage of the users, Search was completely down and unusable for the entire 22 hours. End users were greeted with the error message above when trying to search for items to buy on the site.
Any kind of negative user experience like this (we detected a similar incident several years ago) can have a serious impact on brand reputation and ultimately, the bottom line of the affected company.
Which layer of the Internet stack was responsible?
Catchpoint’s Internet Performance Monitoring platform was able to identify precisely which layer of the Internet stack was responsible for the issue. In this case, from looking at the headers (see below), we were quickly able to confirm the problem was the result of an HTTP 503 being returned by Amazon CloudFront.
Rapid detection of the problem, along with the ability to rapidly pinpoint root cause, lets you see and troubleshoot interruptions as they occur. The issue, which ultimately impacted a fifth of Amazon global users (imagine the impact on global revenue), could have been resolved far more quickly with an IPM solution.
Challenges with the golden triangle of observability
In recent years, there has been a great deal of focus in APM on “the golden triangle of observability”: logs, traces and metrics. One major problem with the kind of approach that focuses solely on the golden triangle is the delay introduced in detecting intermittent incidents.
The golden triangle is amazing, but if it’s all that we are dependent on when trying to detect issues early, it suffers from a few major shortcomings.
The most important of all is the delay we see in detecting issues that are caused by a high error threshold.
Most of the systems that depend on logs and traces to detect problems have error thresholds set as high. This is largely to avoid false positives, especially in a situation where the number of hits is high.
However, if we look at the diagram above, we can see how high thresholds set on the APM side can result in delays in detection of issues post a change (for example, increased errors due to a bad deploy).
Why IPM is important here
This is exactly when a strategy built around Internet Performance Monitoring, specifically proactive targeted monitoring, becomes important because the thresholds set are lower, allowing us to:
- Detect issues faster and at a much earlier stage – often before the actual end users are impacted.
- Detect intermittent issues that may not be impacting 100% of traffic (this was the case with Amazon search).
Proactive Synthetic Monitoring is extremely targeted and allow us to set tighter thresholds around the myriad components of the Internet stack:
Why you need a best of breed monitoring strategy
When it comes to outages like these, it is extremely important to have a monitoring strategy in place which makes use of best of breed solutions that are specific and targeted to perform specific activities, including:
- IPM, which includes:
- Proactive Monitoring to detect issues quicker, understanding which layer of the Internet stack the problem is originating from, and which functionality is impacted.
- Reactive Monitoring to ascertain the impact from a business standpoint (pageviews, revenue, conversions, and geographic impact).
- APM (logs, traces and metrics) to better understand the internal components of the system which may be impacted.
As Steve McGhee, Reliability Advocate, SRE, Google Cloud, highlighted in his Conclusion for Catchpoint’s 2023 SRE Report, there is a reason why experts never depend on a single solution, tool or platform to accomplish their tasks in the best possible manner. “When it comes to skilled labor, or ‘operations’ perhaps,” writes Steve, “you want teams to be able to reach for the right tool at the right time, not to be impeded by earlier decisions about what they think they might need in the future.”