Last week AWS experienced an eight-hour outage caused from a DDoS attack impacting the cloud provider’s Router 53 DNS web service. As discussed in a previous post, Catchpoint detected the incident five hours before AWS acknowledged the problem.
This isn’t the first—nor will it be the last—time an AWS outage makes headlines. In 2017, an Operator’s typo led to a four-hour AWS S3 outage, costing Amazon a reported $150 million in SLA penalties. In 2011, the Amazon E2C outage highlighted the problems associated with relying on a single service provider’s hosting environment. In 2010, in one of our first posts, a three-hour Amazon outage illustrated our emphasis on why performance monitoring matters.
What this cursory review of the last decade makes clear is that outages are not a question of if, but rather when. To its credit, AWS had in place DNS redundancy as packets were eventually re-routed through Neustar ASN. Once more we return to the main issue: if you don’t know there’s a problem, then you don’t know to put these plans into action.
The image above shows AWS re-routing the packets through Neustar ASN.
Let’s take a look at some proactive measures you can take not only to mitigate the impact of an attack, but also to detect when and where an incident occurred.
Don’t miss Nith Mehta, VP of Technical Services at Catchpoint, lead a session on what caused the AWS DDoS attack, how Catchpoint detected the incident, and why AWS was unaware there was an issue for nearly five hours in our new webinar: Anatomy of a DDoS Attack
Step 1: Alert for Anomalies
No surprise here: the first place to start is with your current DNS monitoring strategy. Whether operated internally or by an external managed provider, it must include synthetic monitoring. With synthetic monitoring, you can collect performance data from across multiple geographical regions where end-users reside. More specifically, synthetic tests capture performance and uptime data from:
- DNS records stored on nameservers
- Root nameservers distributed around the world
- Top-level domains (TLDs) server records
This data enables setting baselines for performance, which then can inform internal SLOs and alerting systems. Integrations with alerting tools, like VictorOps, reduces MTTR for DNS issue thanks to actionable data and incident response identifying patterns and escalating accordingly thanks to machine learning.
Over the last twelve months there has been a renewed interest in synthetic monitoring. APM vendors Dynatrace, DataDog, and NewRelic are promoting “next generation synthetics” that claim to solve the problems of the past by migrating monitoring locations to cloud environments like AWS, Azure, and GCP. In a warning to organizations evaluating managed DNS providers, Drit Suljoti, CPO & Co-Founder at Catchpoint, explained the situation as follows:
If you’re thinking, for instance, that Amazon Route 53 will be your new DNS replacement or the new DNS vendor to put in your multi-DNS strategy, it’s important to realize that monitoring using “Synthetics” from an APM vendor like Dynatrace, NewRelic, AppDynamics, or Datadog would be monitoring the DNS from Amazon’s own datacenters. Latency will be almost zero, and availability will be the highest it possible could be – as the synthetic monitoring agent does not have to go across the internet, just to the next room down in the same physical location and from the same network…What matters is how reachable the service is for your end users, not how reachable it is from Amazon.
This hypothetical from August 2019 is not too far off from what happened in last week’s DDoS attack. AWS was unaware of DNS failures and latencies, which suggests its own DNS solution as well as its externally managed DNS failsafe operate in a centralized cloud environment. Because Catchpoint monitors DNS performance from geographically distributed endpoints we were able to capture data from distributed nameservers, authoritative nameservers, and TLDs, which allowed us to detect the initial failures and upon further analysis determine that the outage was limited to AWS S3.
Step 2: Be Ready to Scale
DDoS attacks are one of the most commonly seen cybersecurity threats, and for good reason. Since they generate large volumes (aka floods) of traffic, DDoS attacks can be easy to spot, but hard to solve without scaling up capacity to relieve network resources of unexpected strain. There’s a fine balance to strike here, as noted by AWS in a recent article on DDoS Mitigation, insofar as scaling up ensures site availability to your customers but may also be a costly method for filtering out bot traffic.
Pro-tip: check your SLA to see whether or not DNS provider offers cost protection for DDoS attacks, and if not, make sure to raise this during your regularly scheduled SLA review session.
Step 3: Decouple Your Infrastructure
To avoid cascading failures across your entire network you need to reduce dependencies between layers and within layers. This is especially true for any mission-critical components. Rebuilding monolithic applications using a microservices architecture is no light undertaking, but the benefits will be worth it.
For instance, loosely coupled infrastructure are easier to scale and more affordable since you’re increasing capacity for select components through microprocessors rather than traditional servers for an entire network layer.
DNS monitoring in a decoupled architecture has advantages as well. DNS Experience tests can isolate and analyze the health of individual components of your DNS system. DNS Experience testing offers insights into where and why network latency and performance degradation are occurring. During the AWS attack, DNS Experience testing helped determine that the outage was contained to AWS S3.
Step 4: Implement DNS Redundancy
As a single point of entry to the internet, DNS is a popular target for cyber-attacks. In 2018, organizations weathered an average of 9 DNS attacks, a 34% increase from 2017 reports Efficient IP and IDC in the 2019 Global DNS Threat Report. The number of incidents wasn’t the only figure to see year-over-year rise:
- The average cost of a DNS attack in the United States exceeded $1.27 million
- 48% of all organizations that experience a DNS attack lose more than $500,000
- 10% of all organizations that experience an attack lose over $5 million
DNS redundancy is one way to reduce MTTD and MTTR following a cyber-attack as it provides a failsafe solution should a DDoS attack cause an outage. Deploying multiple DNS servers in different networks lessens the chance that a single DDoS attack can compromise DNS availability for an entire managed provider’s customer base. While expensive, the cost of not preparing a backup DNS system may be much higher.
DNS monitoring should be an integral part of any monitoring strategy and you must:
- Monitor availability with DNS Direct Name Server test
- Monitor latency and performance with DNS Experience tests
- Protect against DNS hijacks and cache poisoning
With 2020 right around the corner, it’s the perfect time to reevaluate your current monitoring strategies, especially when it comes to network fundamentals like DNS, Traceroute, and BGP. With the release of Network Insights there’s no excuse not to have a proactive network monitoring in place that provides insight into the experience of your entire user base.
Don’t miss Nith Mehta, VP of Technical Services at Catchpoint, lead a session on what caused the AWS DDoS attack, how Catchpoint detected the incident, and why AWS was unaware there was an issue for nearly five hours in our new webinar, Anatomy of a DDoS Attack.