Here at Catchpoint, we just published a white paper focusing on outages that span the last 18 months, Preventing Outages in 2023: What We Learned from Recent Failures. We take a deep dive into 10 major and intermittent outages that Catchpoint’s Internet Performance Monitoring (IPM) platform identified and examine what the data reveals.
From total failure to intermittent
The outages span the giants of the Internet and some of the biggest failures of IT resilience we were subject to – from AWS’s trifecta of outages in December 2021 to the October ‘21 outage that took down Facebook, Instagram, WhatsApp, and interrelated services. We also look at some more intermittent outages that you may have missed. While on a smaller scale, these outages still impacted key functionality, such as Amazon’s Search function being out of use for 20% of its global users for two days in December 2022.
For each of these brands, what’s at stake when a significant incident like one of these occurs, is not just revenue and the bottom line, but the company’s reputation and brand. That’s why we take our responsibility as the eyes and ears of our customers so seriously. No one can truly prevent an outage from occurring, but with the right strategies in place, we’ve repeatedly seen first-hand how companies can significantly mitigate the impact of what could have been another news-making incident.
How to prevent outages from having a major impact
Each article in the white paper provides key takeaways specific to each incident. We also begin the white paper with six critical lessons from our engineers and Chief Product and Technology Officer on how to prevent outages from having a serious impact. This blog will focus on the first two: (i) monitor what matters and (ii) map your Internet Stack.
1. Monitor what matters
This may sound like a truism, but the inspiration to launch Catchpoint was born out of the struggles its co-founders faced due to visibility gaps at other companies. Their vision was to build an IPM platform with unequalled breadth and depth, powered by our continually growing global observability network.
Mehdi Daoudi, our CEO and co-founder, launched the Quality of Services department at DoubleClick. It was responsible for building, deploying, and using internal and external monitoring solutions to keep an eye on its DART infrastructure, which was delivering billions of transactions daily. Behind his launching this department from scratch were two important things he’d witnessed first-hand:
“The impact on our employees (being plunged into crisis, not knowing what was happening or where… even though we had so many monitoring tools).
The impact on our customers, these ad ops teams under time pressure to create insertion orders, ad placements, uploading media… and every time we had an issue, witnessing the impact on their lives!
I had to change the way we did things. I needed to make these people’s lives better. These two fundamental experiences hard-wired into my DNA, the understanding that: YOU MUST ALWAYS MONITOR FROM THE END USER’S PERSPECTIVE!”
When Mehdi founded Catchpoint and the early team built our IPM platform, they didn’t stop at traditional web synthetics. They created Internet Synthetics, which today enables our customers to leverage over 40 out-of-the-box test types and an infinite number of bring-your-own monitor options. Meaning that whenever one of our customers is having a problem, they can triage, find root cause, and fault isolate at accelerated rates – and understand the problem from the end user’s point of view.
2. Map your Internet Stack
One of our favorite quotes from our customers has a great sense of humor: “Catchpoint is probably the closest thing to Nyquil that our operations team can get their hands on.” However, underneath Carl Levine’s levity (at that time, NS1’s Senior Technical Evangelist), was the serious point that visibility gaps for operations teams cause enormous amounts of undue stress and slow down the troubleshooting process.
Yes, we need to monitor the areas we can directly control: our containers, VMs, hardware, and code. However, APM will never be sufficient to get in front of outages. IT teams need IPM, i.e., visibility beyond the Application Stack into the Internet Stack.
As you can see, the Internet Stack is comprised of all the systems and subsystems; apps, services and microservices; external network delivery spanning fiber, broadband, local ISPs and mobile networks; datacenters or cloud providers; third-party APIs; DNS providers and APIs that you rely on to deliver your services. The Internet is complicated and daily becoming more so, but your monitoring approach can be simple i.e., ensure you monitor every one of these components. Why? To protect your business.
Loading a single page on the Internet involves a complex choreography involving multiple stakeholders across the Internet Stack. Any weakness within it, whether with DNS, CDN or the failure of an API, can cause a ripple effect that leads to an outage or latency issues. All your user can see is the page that isn’t working, but a problem with any one of these components (or an internal issue) may be behind it.
The only way for your operations teams to get the sleep they deserve (and to proactively mitigate the impact of incidents with a clear mind!) is to ensure you don’t have monitoring gaps. Thus, you’ll avoid users reporting the issue on Twitter or LinkedIn before you’ve been able to fix it.
The bottom line: Achieve Internet Resilience by monitoring the output and performance of every one of these components just as you monitor your own systems. Ultimately, your users – and revenue - will be impacted regardless.
Read the white paper (no registration required).