Subscribe to our
weekly update
Sign up to receive our latest news via a mobile-friendly weekly email
If the Internet was already teetering on a fragile edge when we launched our inaugural Internet Resilience Report in June 2024, it’s even more so now. Just consider the CrowdStrike incident—almost a year later, some organizations are yet to fully recover.
But it’s not just outages causing chaos. The industry is waking up to a new reality: slow is the new down. Sluggish websites and applications don’t just frustrate users, they drain revenue and damage reputations.
This escalating urgency is why Catchpoint has researched and authored the second edition of The Internet Resilience Report. Building on last year’s findings, it dives deeper into the critical dimensions of Internet Resilience. From AI’s growing role in mitigating disruptions to the undeniable importance of fast-performing websites and applications, the report offers a roadmap for navigating today’s digital minefield.
In an era where downtime costs millions monthly and slow performance can sink even the most established brands, resilience is no longer optional. It’s a must-have for survival in our interconnected world. And at the center of it all is the Internet Stack—the foundation upon which every digital interaction depends.
Mehdi Daoudi
CEO, Catchpoint
The Internet Stack is the collection of technologies, systems and services that make possible and impact every digital user experience – from foundational protocols like DNS and BGP to third-party APIs and CDN providers.
To ensure resilience across the Internet Stack, organizations must actively safeguard and maintain four core dimensions:
All four are critical, but the emphasis on performance in this year’s report reflects a seismic shift in how organizations perceive resilience. It’s no longer sufficient for websites and applications to merely be "up" – they must also deliver fast, seamless experiences. 42% said if their websites or apps are slow, they might as well be down. A recent Forrester study of online retailers reached a similar conclusion, highlighting how widespread the “slow is the new down” mindset is.
Internet Resilience: The capacity to ensure availability, performance, reachability, and reliability of the Internet Stack despite adverse conditions
The findings show resilience is no longer just about uptime. It’s about protecting people, revenue, and performance.
Operational efficiency keeps the lights on – customer experience keeps the business growing
Resilience goals without purpose is no resilience at all. The drive for efficient business operations (73%) as a horizontal foundation to improve customer experience, workforce productivity, and grow revenue—like a rising tide that lifts all boats—improves the likelihood of achieving every other goal.
The real story here is that customer experience and workforce productivity showcase how vital it is to deliver frictionless service to people. Sure, resilient Ops matter; they keep systems humming. But when you anchor resilience in people, you’re not only preventing downtime—you’re also making sure every interaction is fast, seamless, and engaging. That’s what actually cements loyalty and drives long-term growth.
A million-dollar argument for investing in resilience
The negative economic impact of incidents is too significant to ignore. 51% of organizations said they felt negative economic impact of over $1M, up from 43% a year ago. One way to justify the cost of investing in resilience is to acknowledge the hard truth this money is already being ‘spent’ when there are incidents. Therefore, organizations should ensure their Internet Stack is resilient to mitigate the realization of this impact.
It's not a matter of if incidents will occur, but when. These figures highlight the urgent need for robust resilience measures. By prioritizing a resilient Internet Stack, businesses can mitigate risks, minimize downtime, and protect their financial health, ensuring continuity and stability in an increasingly digital world.
Uptime alone doesn’t cut it anymore
These findings underscore the dire need for organizations to prioritize website performance as part of their internet resilience strategy. At a minimum, organization’s digital properties should be as fast as – or faster than - their competitors. Slow-loading websites can lead to frustrated users, lost sales, and a tarnished reputation.
To stay ahead, businesses must ensure their websites are optimized for speed and reliability across an ever-expanding edge. This involves regular monitoring where the user experience occurs, the establishment of experience level objectives (XLOs), and a clear understanding that fast digital web performance does not stop at the source where your sites and apps are hosted. Making website performance a cornerstone of resilience strategy ultimately safeguards operations, enhances user satisfaction, and drives growth.
Confidence is uneven. Priorities are shifting. The journey to Internet Resilience is far from complete.
Not all parts of your digital stack are built to bounce back
To chart a path to resilience, you must first know where you are. Organizations should use this data to benchmark their own levels of Internet Resilience to identify strengths and improvement areas.
By comparing their resilience against industry standards, they can ensure their digital infrastructure is robust and capable of handling disruptions. Benchmarking helps prioritize investments in critical areas like productivity stacks, cloud infrastructure, and APIs, enhancing overall operational efficiency and customer satisfaction. Additionally, it provides insights into how well they are prepared compared to competitors, enabling strategic planning and risk mitigation. Ultimately, leveraging this data fosters a proactive approach to maintaining a resilient and secure digital environment.
AI gets the buzz, but XLOs are the backbone
With 63% of businesses prioritizing AI investment, hype around AI is reinforced (and for good reason). Given this hype, though, establishing experience level objectives (XLOs) should be considered the top business priority over the next 18 months (because who wouldn’t say ‘invest in AI’).
While investing in AI capabilities is crucial, XLOs are what truly matter for ensuring business resilience and success. XLOs provide a clear framework for measuring and improving user experiences, which directly impact customer satisfaction and loyalty. By augmenting recovery time objectives (RTOs) and recovery point objectives (RPOs) with XLOs, businesses can better manage disruptions and maintain high service standards. In a world where performance is the new availability, XLOs are how resilience is measured from the customer’s perspective. Don't let the AI hype overshadow the importance of experience level objectives—make them your number one priority.
Everyone agrees resilience matters. Fewer agree who owns it.
The desire for a resilient Internet Stack to deliver seamless digital experiences can catalyze IT to business conversations. It is crucial for aligning IT and business on common goals, ensuring seamless operations and minimizing disruptions.
Without alignment, resilience efforts are likely to fail. Most organizations believe the ultimate responsibility for internet resilience should be part of the technology charter – with 44% saying IT Ops or SREs, and 72% saying Chief Information or Technology Officer – but the diversity of responses highlights the need for a unified approach. When IT and business leaders collaborate, they can create a robust strategy that supports both technological stability and business continuity, driving overall success.
Blind spots create the biggest risks. The results show a strong move toward targeted, best-of-breed monitoring—especially for third parties and the systems users actually experience.
Square pegs can’t fit in round holes.
A resilient Internet Stack relies on a robust observability framework, which is essential for detecting and resolving issues before users notice a disruption. Using purpose-built, best-of-breed Internet Performance Monitoring (IPM) tools is crucial; otherwise, it's like trying to fit a square peg into a round hole.
Non-purpose-built tools can lead to gaps in monitoring and missed critical alerts. 73% of organizations use IPM tools, highlighting their importance. Other tools like Digital Experience Monitoring (55%), Network Performance Monitoring (63%), and Application Performance Monitoring (64%) also play vital roles in a broader sense, but provide insights into different stack components. As we’ll see in the next data, best-of-breed is preferred versus broad [non best-of-breed] capabilities by more than double, so use IPM for visibility into the Internet Stack.
Targeting best-of-breed for critical apps and digital resilience
Fully leveraging cloud benefits (65%) in an internet-centric fabric is table stakes. Ensuring excellent digital experiences (61%) and prioritizing the resilience of your critical tier 1 applications (55%) is the game. There is even a case to be made for the resilience of your non-critical applications.
For example, regularly testing your resilience preparedness on non-critical applications may provide more conducive learning to ensuring your critcal applications always adhere to recovery and resilience objectives – table top exercises to prepare for live, production, critical systems. Ensuring excellent digital experiences can reduce costs by helping you calibrate internal monitoring. By focusing on what impacts user experience, you can streamline internet monitoring efforts, reducing unnecessary expenses. This targeted approach ensures efficient resource allocation, optimizing performance while minimizing costs associated with broad, non-specific monitoring.
Resilience breaks where visibility ends.
No study of Internet Resilience is complete without considering third-party dependencies, which are critical to digital success. These dependencies must be monitored for service level adherence.
Agent-based Application Performance Monitoring (APM) cannot monitor third parties, but IPM can. Even though third parties monitor their respective services, organizations need to monitor themselves to proactively ensure reliability and performance since providers may not be forthright with service level-related incidents. Organizations (74%) consider third-party providers highly or extremely critical to their resilience success. This underscores the importance of comprehensive monitoring to maintain service quality and prevent disruptions caused by external dependencies.
Organizations are leaning hard on AI. The question is whether their resilience strategies can keep up.
AI can’t fail quietly—and yet, in many organizations, it still does
We now live and work in an “AI or die” environment. This is because AI is essential for business success, ensuring the smooth operation of critical applications. When these applications experience downtime or slow performance, it can disrupt and damage business operations, leading to financial losses and reputational harm.
A proper Internet Performance Monitoring strategy is crucial for AI implementations – with 36% or organizations citing this as the mechanism for knowing when their AI is either down or slow - as it helps detect and resolve issues promptly. For instance, 57% of respondents recognize the impact immediately when AI supporting Tier 1 apps becomes unavailable or slower. This highlights the importance of robust monitoring to maintain AI efficiency and prevent operational disruptions, ensuring continuous business success.
Failing to invest in AI is no longer a neutral decision—it’s a risk.
Organizations should prioritize AI investments over the next 18 months to avoid the negative opportunity cost of inaction. Failing to invest in AI can lead to inefficiencies, competitive disadvantages, and missed growth opportunities.
...with only 4% expecting to decrease their AI spending. This trend underscores the critical role AI plays in enhancing reliability and resilience. By investing in AI, businesses can ensure robust performance, mitigate risks, and drive innovation, securing their position in an increasingly AI-driven market.
Proof that organizations should not go it alone
Further to the critical dependencies on third parties, most organizations rely on third-party AI capabilities – especially third-party AI capabilities for ensuring critical Tier 1 application resilience. This approach is favored over building in-house solutions due to the expertise, scalability, and cost-effectiveness offered by third-party providers.
In contrast, 49% are considering adding their own AI, and 54% have built their own. The reliance on third-party AI allows businesses to leverage advanced technologies without the significant investment and time required for developing proprietary solutions, ensuring faster and more reliable resilience measures.
Let’s call it what it is: resilience is no longer a background process. It’s the main event. The findings of this report speak for themselves. Websites that merely “stay up” don’t cut it anymore.
That’s not downtime. That’s damage.
The lesson? If you’re not actively investing in purpose-built Internet Performance Monitoring, you’re flying blind. If you’re still relying on broad, one-size-fits-all tools, you're solving a modern problem with yesterday’s kit. And if AI isn't part of your resilience story yet, it will be—whether by choice or by consequence.
Building a resilient Internet Stack is not just about preventing downtime; it's about ensuring every digital interaction is seamless and engaging, driving long-term growth and cementing customer loyalty. That means best-of-breed IPM tools, smart use of AI, and a robust Internet Stack that’s monitored from the outside in.
The businesses that invest in Internet Resilience now won't just stay online—they'll stay ahead.
Explicit Congestion Notification (ECN) is a longstanding mechanism in place on the IP stack to allow the network help endpoints "foresee" congestion between them. The concept is straightforward… If a close-to-be-congested piece of network equipment, such as a middle router, could tell its destination, "Hey, I'm almost congested! Can you two guys slow down your data transmission? Otherwise, I’m worried I will start to lose packets...", then the two endpoints can react in time to avoid the packet loss, paying only the price of a minor slow down.
ECN bleaching occurs when a network device at any point between the source and the endpoint clears or “bleaches” the ECN flags. Since you must arrive at your content via a transit provider or peering, it’s important to know if bleaching is occurring and to remove any instances.
With Catchpoint’s Pietrasanta Traceroute, we can send probes with IP-ECN values different from zero to check hop by hop what the IP-ECN value of the probe was when it expired. We may be able to tell you, for instance, that a domain is capable of supporting ECN, but an ISP in between the client and server is bleaching the ECN signal.
ECN is an essential requirement for L4S since L4S uses an ECN mechanism to provide early warning of congestion at the bottleneck link by marking a Congestion Experienced (CE) codepoint in the IP header of packets. After receipt of the packets, the receiver echoes the congestion information to the sender via acknowledgement (ACK) packets of the transport protocol. The sender can use the congestion feedback provided by the ECN mechanism to reduce its sending rate and avoid delay at the detected bottleneck.
ECN and L4S need to be supported by the client and server but also by every device within the network path. It only takes one instance of bleaching to remove the benefit of ECN since if any network device between the source and endpoint clears the ECN bits, the sender and receiver won’t find out about the impending congestion. Our measurements examine how often ECN bleaching occurs and where in the network it happens.
ECN has been around for a while but with the increase in data and the requirement for high user experience particularly for streaming data, ECN is vital for L4S to succeed, and major investments are being made by large technology companies worldwide.
L4S aims at reducing packet loss - hence latency caused by retransmissions - and at providing as responsive a set of services as possible. In addition to that, we have seen significant momentum from major companies lately - which always helps to push a new protocol to be deployed.
If ECN bleaching is found, this means that any methodology built on top of ECN to detect congestion will not work.
Thus, you are not able to rely on the network to achieve what you want to achieve, i.e., avoid congestion before it occurs – since potential congestion is marked with Congestion Experienced (CE = 3) bit when detected, and bleaching would wipe out that information.
The causes behind ECN bleaching are multiple and hard to identify, from network equipment bugs to debatable traffic engineering choices and packet manipulations to human error.
For example, bleaching could occur from mistakes such as overwriting the whole ToS field when dealing with DSCP instead of changing only DSCP (remember that DSCP and ECN together compose the ToS field in the IP header).
Nowadays, network operators have a good number of tools to debug ECN bleaching from their end (such as those listed here) – including Catchpoint’s Pietrasanta Traceroute. The large-scale measurement campaign presented here is an example of a worldwide campaign to validate ECN readiness. Individual network operators can run similar measurement campaigns across networks that are important to them (for example, customer or peering networks).
The findings presented here are based on running tests using Catchpoint’s enhanced traceroute, Pietrasanta Traceroute, through the Catchpoint IPM portal to collect data from over 500 nodes located in more than 80 countries all over the world. By running traceroutes on Catchpoint’s global node network, we are able to determine which ISPs, countries and/or specific cities are having issues when passing ECN marked traffic. The results demonstrate the view of ECN bleaching globally from Catchpoint’s unique, partial perspective. To our knowledge, this is one of the first measurement campaigns of its kind.
Beyond the scope of this campaign, Pietrasanta Traceroute can also be used to determine if there is incipient congestion and/or any other kind of alteration and the level of support for more accurate ECN feedback, including if the destination transport layer (either TCP or QUIC) supports more accurate ECN feedback.