‘Preventing Outages in 2023: What We Can Learn from Recent Failures’ Provides Analysis of Internet Failures and Key Learnings
New white paper from Catchpoint provides in-depth analysis of key Internet outages across the past 18 months, from AWS to Facebook; includes six critical lessons for IT teams to improve Internet Resilience
(New York, USA) – January 31, 2023 – Catchpoint®, The Internet Resilience Company™, today releases Preventing Outages in 2023, a new white paper comprising of six critical learnings and ten pieces of in-depth analysis of major and hidden outages from across the last 18 months. IT Operations, Network Engineers, and SREs, along with VPs of Infrastructure, CIOs and CTOs will be able to draw on the Catchpoint team’s expertise to learn from the failures of the past and inform future approaches to incident management. The full white paper, Preventing Outages in 2023: What We Can Learn From Recent Failures, is available for download here (no registration required).
“What the recent failures from Internet giants demonstrate is that the question of the next outage is not if, but when,” says Dritan Suljoti, Chief Product and Technology Officer of Catchpoint. “Moreover, the downstream effect of major outages to essential Internet infrastructure, such as cloud platforms, CDNs or DNS providers, means that no company is immune, no matter how well prepared they think they are. The white paper demonstrates why it’s so important for all of us to be proactive to reduce Mean Time to Repair (MTTR) when the next outage occurs.”
Key lessons from the past include:
- Develop an Internet Performance Monitoring strategy that allows you to monitor precisely what customers, workforce, and other users expect and build an Experience Score.
- Monitor not only what is under your direct control, map your Internet stack to ensure you are monitoring every component of the Internet Stack relied on to deliver your content (including DNS, CDN, ISP, BGP, TCP configuration, SSL, and other cloud services, etc.).
- Automate intelligently – design and test automation to ensure there are no bugs hiding in the code.
- Be prepared to take fast action to remediate outages as they occur, for example, switching to a backup solution or dropping the third-party causing the issue. Develop runbooks and practice recovery.
- Whenever change is scheduled, ensure your team is ready for any outages that may occur (intentionally or not) with a crisis call plan that includes a communication plan and templates, a plan to mitigate failures from third-parties, and a best practices monitoring and observability plan.
“Given the impact of serious outages to the bottom line, not to mention the long-tail impact to brand and reputation, amidst a landscape of increased Internet reliance alongside ever-growing Internet fragility and greater and great complexity, the need for community learnings from past failures to be shared and practical advice disseminated around stemming future major incidents and ensuring Internet Resilience is imperative,” says Gerardo Dada, CMO at Catchpoint. “We believe this white paper offers an invaluable deep dive into recent outages past and key lessons learned that all of us can learn from to prevent (or mitigate the consequences of) the next major outage.”
In-depth outage analysis of ten major recent incidents include:
- Amazon’s Search issue from December 5-7, 2022, that impacted at least 20% of all global users for 22 hours – Catchpoint’s IPM platform pinpointed root cause to an HTTP 503 being returned by Amazon CloudFront.
- A $B eCommerce company suffering issues around the DNS authoritative name servers they were using to resolve a critical page on their website in August 2022 – by monitoring the entire DNS resolution chain, Catchpoint was able to identify precisely where the DNS resolution failure was occurring, Learn three best practices for monitoring DNS.
- The AWS December 2021 trifecta of outages – Catchpoint observed all three outages well before they hit the AWS status page and unlike many of its competitors was unaffected by them. Find out four key lessons for working with a hosting provider.
- The downstream effect of the Google Cloud outage in November 2021 – “a latent bug in a network configuration service” led to outages across multiple Google Cloud products and failures across many other non-Google companies, from Home Depot to Spotify, whose websites were knocked offline for a prolonged period.
- The BGP misconfiguration at the heart of the Telia outage in October 2021 which affected many other companies, including Cloudflare, Equinix Metal and Fastly.
- The now notorious mega outage in October 2021 that took down Facebook, WhatsApp, Messenger, Instagram and Oculus VR, for five hours and what Catchpoint’s deep dive into the BGP data revealed.
Catchpoint is the Internet Resilience Company™. The top online retailers, Global2000, CDNs, cloud service providers, and xSPs in the world rely on Catchpoint to increase their resiliency by catching any issues in the Internet stack before they impact their business. The Catchpoint platform offers synthetics, RUM, performance optimization, high fidelity data and flexible visualizations with advanced analytics. It leverages thousands of global vantage points (including inside wireless networks, BGP, backbone, last mile, endpoint, enterprise, ISPs and more) to provide unparalleled observability into anything that impacts your customers, workforce, networks, website performance, applications, and APIs.
Learn more at www.catchpoint.com.