Blog Post

Dyn DNS Outage - a Doomsday for the Internet

Updated

Published

October 23, 2016

mins read

Mehdi Daoudi

in this blog post

Heading 2

What was supposed to be a quiet Friday suddenly turned into a real “Black Friday” for us (as well as most of the Internet) when Dyn suffered a major DDOS attack. From an internet disruption’s perspective, the widespread damage the outage caused made it the worst I have ever experienced.

At the core of it all, the managed DNS provider Dyn was targeted in a DDOS attack that impacted thousands of web properties, services, SaaS providers, and more.

The chart below shows the DNS resolution time and availability of twitter.com from around the world. There were three clear waves of outages:

7:10 EST to 9:10 EST
11:52 EST to 16:33 EST
19:13 EST to 20:38 EST

The DNS failures were the result of Dyn nameservers not responding to DNS queries for more than four seconds.

We were impacted in three ways:

Our domain Catchpoint.com was not reachable for a solid 30 minutes until we introduced our secondary managed DNS provider Verisign. We also brought up and publicized to our customers a backup domain that was never on Dyn, so our customers could login to our portal and keep an eye on their online services. All of these were in standby mode prior to the incident.
Our nodes could not reliably talk to our globally distributed command and control systems until we switched to IP only mode, bypassing DNS lookups. This was a feature we had developed, tested, and in production, but was not active as our engineering teams planned one more enhancement. Due to the nature of the situation, we deemed the enhancement to be lower risk than what we were experiencing.
Many of our own third party vendors that our company relies on stopped working- Customer Support and Online Help solution, CRM, office door badging system, SSO, 2 Factor Authentication services, one of the CDNs, a file sharing solution, and the list goes on and on.

This blog post is not about finger pointing; the folks at Dyn had a horrible day putting up with their worst nightmare. They did an amazing job of dealing with it, from notifications to extinguishing the fire. This is about how to deal with the worst case outage, as a company and an industry.

As with every outage, it’s important to take the time to reflect on what took place and how this can be avoided in the future.

Here are some of my takeaways from Friday, and the must-have solutions:

DNS is still one of the weakest links in our Internet infrastructure and digital economy. We have to keep learning and sharing that knowledge with each other. Here are several articles we have written on DNS.
A single DNS provider is not an option anymore for anyone. No company, small or large, can rely on a single DNS provider.
DNS vendors should create knowledge base articles about how to introduce secondary DNS providers, and they must be easy to find and follow.
DNS vendors need to make the setup of auto – transfer easier to find. Having to open a ticket in a middle of a crisis to find out the IP of the xtransfer name servers is simply not a viable option.
DNS Vendors should not set high TTLs (two days) on the authoritative nameserver records they pass on the DNS queries, and it should be easy to drop or change TTL. While this is great to bypass changing records on the TLDs, making the nameservers authoritative for two days becomes a headache when trying to switch to or migrate from a back-up solution.

Introducing another DNS vendor wouldn’t have achieved 100% of the result until you go into the Dyn configuration and add that other solution in the mix:

The community must work together to come up with commercial or open source solutions to make DNS configurations compatible between vendors (this is for complex DNS setups like failover, geo load balancing, etc.). This is a no longer a nice-to-have, but a must-have.
There needs to be a way to push registrar configurations faster. We need an emergency reload button at the registrar levels, but also at the Root and TLD levels. And this means someway to tell DNS resolvers to purge their cache. Waiting for two days is not going to work after this catastrophic event, just like having multiple DNS vendors active does not scale financially for most companies.
Lastly, it is time for The Internet Engineering Task Force to take a very close look at the DNS standards and figure out how to make this key protocol more redundant and flexible to deal with the challenges we are facing today.

Some takeaways from a monitoring standpoint:

I had people tell me, “But Mehdi, I am not seeing a problem in my RUM.” When your site isn’t reachable, RUM won’t tell you anything because there is no user activity to show. This is why your monitoring strategy must include synthetic and RUM.

DNS monitoring is critical to understand the “why.”
DNS performance impacts web performance.
The impact was so incredible, some sites that didn’t rely on Dyn still suffered outages or bad user experience. This is because they used third parties that did rely on Dyn.

We interact with many things on a daily basis (cars, cell phones, planes, hair dryers) that have some sort of certification. I urge whoever is responsible to consider the following:

A ban on any Internet-connected device that does not force the change of default credential upon starting it. There shouldn’t Admin/Admin for anything including cameras, refrigerators, access points, routers, etc.
A ban on accessing of such devices from any place on the Internet. There should be some limitation, either access through the provider interface or from local network.
Consumers should also pressure the industry by not buying the products that aren’t safe. Maybe we need an “Internet Safety Rating” from a governmental agency or worldwide organization.
A must-have feature on every home and SMB router, and access point is the ability to detect abnormal traffic/activity and turn it off or slow it down; sending thousands of DNS requests in a minute is not normal. We should learn from Microsoft and what they did with Windows XP to limit an infected host.
Local ISPs must have capabilities to detect and stop rogue traffic.

Cybersecurity is dire. I hope this incident serves as a huge wake-up call for everyone. What happened Friday was a Code Blue event; we rely on the Internet for practically everything in society today, and it’s our job to do everything we can to protect it.

Thank you, Dyn, for the prompt response times to the support tickets, to Verisign for last-minute questions, our customers who were very patient and understanding, our entire support organization, and some special friends in major companies who offered a helping hand by providing some amazing advice around DNS.

Mehdi – Catchpoint CEO and Co-Founder

To learn more about how you can handle a major outage like this in the future, join our upcoming Ask Me Anything: OUTAGE! with VictorOps, Target, and Release Engineering Approaches.

Summary