AWS Outage: How do you prepare for the failure of your own safety net?
When AWS’s massive outage struck, it didn’t just take down cloud services, apps, and enterprise platforms. It also knocked out many of the monitoring systems organizations depend on for real-time answers. Observability companies, including Datadog, New Relic, Checkly, Dynatrace, SpeedCurve, and Splunk Observability, lost visibility or functionality precisely when organizations needed them most.
In those first, chaotic 15–20 minutes, before teams knew the problem was AWS East, most teams were left flying blind, unable to see what was really happening, simply because their monitoring tools live in the very cloud that just failed.
How do you ensure ongoing visibility and rapid response when your cloud-hosted monitoring tools might go down too? What steps can you take today to protect your organization from getting caught out next time a major outage hits? Let’s get into it.
What happens when your monitoring fails?
This question is not rhetorical, it's essential. As Adrian Cockcroft, writes: monitoring systems themselves can fail, and when they do, organizations are left without the visibility needed to respond effectively. You need monitoring that is independent of the systems being monitored.
The AWS outage demonstrated this principle. Organizations that relied solely on cloud-hosted monitoring tools found themselves blind to the very outage impacting their operations. The lesson is clear: your monitoring strategy must account for the possibility that your monitoring tools themselves can fail.
How can you prevent massive outages from catching your business off guard?
To avoid getting caught out by future incidents, organizations need to rethink their approach to monitoring and resilience. Here are the key steps:
#1. Your monitoring can’t live in the same cloud you’re trying to monitor
Relying on monitoring tools that are hosted exclusively, in the same cloud environment as your critical systems or vendors creates a significant risk. If that cloud experiences an outage, both your production systems and your monitoring can go down together, leaving you without visibility at the most critical moment.
Even as we wrap-up this post at 4pm ET, many of the leading monitoring platform still report incidents on their status pages.
#2. Map your critical third-party dependencies
Most organizations are not fully aware of all the dependencies their digital systems rely on. If you're a retailer, for example, even if your website is architected to survive a cloud provider outage, it still depends on DNS, BGP routing, SSL certificates, CDN services, payment processing systems, third-party APIs, cloud services, and dozens of smaller components. Any of these can fail and impact your operations.
Action: Conduct a thorough audit of your critical systems. Internet dependency mapping tools like Internet Stack Map can help you visualize and document every protocol, service, and third-party vendor your operations rely on, making it easier to pinpoint single points of failure and strengthen your resilience.
#3. Develop a resilience plan
For each critical system, develop a resilience plan that includes multi-cloud or multi-region architectures, redundant systems, and documented fallback procedures. Cloud engineering teams have embraced Chaos Engineering, a philosophy of deliberately testing failure scenarios to understand how systems break and how to respond. Because things will break; it's just a matter of when, and whether you're prepared.
Action: Implement chaos testing, tabletop exercises, and regular drills. Build redundancy where it matters most. Document and practice your incident response playbook.
#4. Invest in Internet Performance Monitoring
Many organizations have not invested enough in Internet Performance Monitoring (IPM), a solution that provides visibility into the performance and availability of every aspect of the internet stack, internal and external. IPM gives you the critical awareness you need to detect, diagnose, and respond to outages across the entire stack as most monitoring tools are not internet-aware.
Action: Deploy IPM to ensure you have external visibility into cloud providers, DNS, CDN, and other critical services.
#5. Make resilience a priority
Resilience should not be an afterthought. Hope is not a strategy – internet systems are interdependent and fragile. Learn about the resilience formula. Invest in planning, redundancy, runbooks, and processes needed to make it a reality.
It could be hard to invest in resilience amidst the push for consolidation, especially as most organizations have observability sticker shock. However, there is a strong argument to shift the focus from more data to better data, reducing costs while improving actionable visibility.
Forward organizations are investing in a Chief Resilience Officer to meet the needs of Internet resilience in the business with a position that signals the importance of the matter and has the organizational power and focus to make it happen.
How Catchpoint caught the incident 16 Minutes ahead of AWS reporting it
Catchpoint's Internet Sonar detected the AWS outage at 06:55 AM UTC — a full 16 minutes before AWS updated its status page at 07:11 AM UTC.


Those 16 minutes are not trivial. In a crisis, every minute counts. Picture the war room: executives demanding answers, teams racing to diagnose the issue, work grinding to a halt, calls flooding in from the CIO and CEO. If you don't know what's happening, those 16 minutes feel like an eternity.
How early detection was possible
Internet Sonar continuously monitors billions of signals in real time, providing alerts and updates on key services worldwide. Take a peek at the Live Internet Outages Map powered by Internet Sonar .
If your monitoring platforms rely on AWS, they most likely lacked visibility to this incident. You probably learned about the outage from frustrated customers, users flooding support lines, or marketing teams monitoring social media. You may have been in the dark for 16 minutes or much longer about what was actually wrong.
Conclusion
AWS’s outage is reminder of the fragility and complexity of the modern internet. Today it was AWS and DNS; tomorrow it could be Azure and a certificate issue; the day after, Google and BGP routing. Each layer of the Internet Stack, including cloud providers, DNS, CDN, SSL certificates, APIs, and payment processors, introduces risk. The question is not whether outages will happen, but how prepared you are when they do.
Organizations that rely solely on cloud-hosted monitoring tools risk lacking visibility to the very incidents that threaten their operations. The solution is clear: adopt independent, external monitoring; map your dependencies; invest in resilience; and prepare for failure.