The Importance of DNS for a Healthy Internet
Picking a DNS vendor is one of the most important choices you can make when it comes to reliability of your services. It is also arguably the most important element of web performance to monitor. If DNS infrastructure goes down, all the digital services relying on it go down with it. This became alarmingly clear when the massive series of DDoS attacks hit Dyn in 2016, and much of the websites and applications on the Internet became inaccessible to huge swaths of users across North America and Europe.
The Domain Name System (DNS) is the Internet’s address system. This foundational technology is a complex web of private and public services that ensure the URL you type into your browser is the website you want to visit. The DNS lookup and resolution process should take milliseconds, but if something goes wrong along the way, your browser will lag, be unable to access the site, or worse, get hijacked and redirected to another (potentially malicious) site.
Most web-scale enterprises use third-party DNS service providers to remove the headache of managing their own global DNS infrastructure. Many rely on a Platform-as-a-Service (PaaS) model such as Dyn, Cloudflare, Neustar, AWS, or NS1, who are also referred as “Managed DNS.” Since the 2016 Dyn outage, many businesses also adopted a multi-DNS strategy – i.e. choosing a secondary provider along with a primary one – to protect themselves so that if there is an outage with one DNS vendor, their services will still be available to end users through the other vendor.
DNS Vendor Consolidation
There have been a number of recent changes in the industry among managed DNS providers, including Oracle’s acquisition of Dyn and Neustar’s acquisition Verisign’s DNS contracts.
Oracle has now announced it is essentially shutting down Dyn, folding its web application security, email delivery, and “best-in-class DNS” into its own native services to be offered only via the Oracle Cloud Infrastructure (OCI) platform. Standalone Dyn services, aside from Internet Intelligence, will no longer be available as of May 2020.
The new DNS service on Oracle Cloud Infrastructure DNS service won’t support Zone transfer to external nameservers (key to one implementation of multi-DNS architecture), Webhop (HTTP redirect), and DNSSEC. DNSSEC strengthens authentication in DNS, which is generally weak, by using digital signatures based on public key cryptography. Earlier this year, ICANN called for maximum deployment of DNSSEC across all unsecured domain names, saying, “In order for the Internet to have widespread security, DNSSEC needs to be widely deployed.” Further, rumors are swirling that a sizeable number of ex-Dyn staff will be let go, potentially leaving behind only generalists as opposed to support experts.
Lots of people are now wondering if Oracle’s DNS service will be good enough. “All these changes can lead to “tremendous risk for the installed base whose businesses depend on those services,” says Brian Zeman, COO at NS1 in a recent blog on the Oracle announcement. On the other hand, Verisign shutting down and moving contracts to UltraDNS has worked out OK so far. However, there are still ITOps and NetOps teams that recall the challenges they had with UltraDNS and why they switched to Dyn in the first place.
At any rate, consolidation is happening across the DNS solution landscape and is reducing the number of available managed DNS providers who offer the global Anycast footprint, designed to reduce latency and provide necessary features required by most modern global digital services.
Regardless of whether one has a single-DNS or multi-DNS strategy (we highly recommend the latter), they will be asking if they should stay with the option they have or change to a new vendor. We faced the same decision here at Catchpoint and have already implemented a plan to continue having a reliable multi-DNS architecture.
Synthetic Monitoring is a Must
Most people think they just have to pick a reliable DNS vendor(s) and move right on to the next vendor or tool evaluation. But the reality is you are far from done because you just married your success to the reliability and availability of another company. Now you also must constantly monitor the DNS provider to ensure you are getting a reliable service.
There are a host of problems that can go wrong within the multi-step DNS process. Most DNS vendors use Anycast systems for their DNS infrastructure to ensure the low latency of DNS resolution. However, this increases their global footprint, and with it the risk of you having micro-outages, wherein users in specific regions are unable to resolve your domain and access your service.
One key monitoring strategy is relying on Real User Monitoring (RUM); however, RUM doesn’t give you sufficient visibility into outages or provide a steady measurement for SLA and benchmarking, as it collects data only when users visit your pages. This means you can’t detect issues fast enough (or at all), and also cannot properly troubleshoot the issue.
Instead you need 24/7 synthetic monitoring from as many locations as possible, from key geographies where your end users reside. Catchpoint’s large number of global testing locations (called nodes) is one of our greatest strengths as it offers vantage points across backbone, broadband, cloud, last mile, and wireless providers. They not only provide the visibility from the geographies that matter to you, but also from key transit, broadband, and consumer ISPs to detect those local outages.
Synthetic monitoring is essential with DNS. It allows you to look at the performance of all the nameservers involved and detect any errors along the entire DNS resolution chain. It also gives you a window into the database records used by DNS servers (from MX to CName Records), which enables the diagnosis of the specific cause(s) of an error such as misconfiguration, DNS Cache Poisoning, or insecure zone transfers.
As we’ve recently been discussing, it is also critical to monitor your Service Level Agreements (SLAs), and not simply rely on the vendor’s assessment of their own service, which might not bring the accountability you require as a buyer.
The questions of where you are measuring the SLAs from is not simply resolved by using a third party; you need to make sure that your monitoring service is not monitoring from the wrong perspective. This is particularly true if you are considering using one of the big cloud services as your new DNS vendor.
If you’re thinking, for instance, that Amazon Route 53 will be your new DNS replacement or the new DNS vendor to put in your multi-DNS strategy, it’s important to realize that monitoring using “Synthetics” from an APM vendor like Dynatrace, NewRelic, AppDynamics, or Datadog would be monitoring the DNS from Amazon’s own datacenters. Latency will be almost zero, and availability will be the highest it possible could be – as the synthetic monitoring agent does not have to go across the internet, just to the next room down in the same physical location and from the same network.
What matters is how reachable the service is for your end users, not how reachable it is from Amazon. Monitoring from the cloud simply doesn’t make sense; your end users are not there. If you use Amazon or Azure for DNS, you certainly don’t want to also use them for monitoring.
Furthermore, you can’t monitor DNS solely using inferred DNS metrics by monitoring HTTP URLs, as you are not truly monitoring the DNS service and the vendor infrastructure providing the service. You are instead monitoring DNS resolution as it goes through the multiple layers of the “Authoritative Name Servers,” Recursive Resolvers, and OS DNS resolution. This means DNS TTL (time to live) gets into the mix, as does how the DNS is cached on the user’s OS, browser, and their recursive resolver which is shared across users (for more, see our free ebook on basics of DNS).
That’s why at Catchpoint, we have a monitor specifically focused on monitoring DNS. It allows you to trace queries through the complicated web of DNS hierarchy, pinpoint the issues, and solve any problem with your vendor, whomever they may be amidst these changing times.
DNS is the first interaction a customer has with your brand; it’s the most critical, and at the same time, the most fragile because of UDP.
I have always loved this joke: “I would tell you a joke about UDP, but you probably wouldn’t get it.”