Blog Post

3 Lessons from a DNS Resolution Failure Incident

Catchpoint recently helped a major eCommerce customer troubleshoot a DNS resolution failure issue that we can all learn from in terms of how to best monitor DNS. Learn more.

Whether you are a Site Reliability or Network Engineer, or simply involved in monitoring a digital service, you know by now that if DNS is not working properly – your users are experiencing an outage. However, despite its importance in ensuring the resilience and availability of the web, DNS is often not monitored correctly, which can mean undetected outages and any associated ripple effects on your business.

Moreover, most companies today do not maintain their own DNS infrastructure; they outsource DNS to third party companies known as “Managed DNS Providers”. Companies rely on the provider to operate and maintain a reliable DNS infrastructure, and Service Level Agreements are what they rely on to ensure the provider delivers. Trusting your vendors is important (and agreement helps) but verifying and ensuring they hold their end of the bargain is even more important – and where monitoring DNS correctly becomes key to your success.

To reduce MTTR when DNS resolution failure happens, you need a scalpel not a sword

At Catchpoint, monitoring DNS is a core use case for all our customers. We quickly detect any major outage or misconfiguration and help our customers in enhancing their monitoring strategy for this complex and fragile technology.

We recently helped a major eCommerce customer (a billion-dollar company in the S&P 500, headquartered in the US) troubleshoot an interesting DNS issue that we can all learn from in terms of how to better monitor our services. Through Catchpoint testing, the eCommerce company discovered that they had intermittent DNS resolution failures on a specific website of theirs from specific locations in the US.  

We drilled down into the problem they experienced, looking across the entire resolution chain to pinpoint where the problem occurred, identified how to fix it, and worked together on expanding their monitoring strategy going forward.  

In this blog, we will share three key DNS monitoring and observability lessons that every SRE, Devops, and NetOps should pay attention to.

Three key lessons

1. Monitor the CNAME to A Third-Party, Not Just Your DNS  

Often companies rely on third-party SaaS providers for key aspects of their digital presence. At Catchpoint we rely on two CDNs, a hosted CMS, a marketing automation service, a career and resume management service – and all of them are behind subdomains of “catchpoint.com” and relying on CNAME to point to the service. We monitor every single one of those services and have playbooks on what to do when any one of them fails.  

In this case, our customer relied on a third-party SaaS provider for their career page, and its domain simply had a CNAME to it. They were monitoring the service and noticed intermittent DNS failures which they could see was not their DNS but something else.

The company’s end user would visit “career.company.com”, and behind the scenes the domain has a CNAME to “[nameofcompany].phenompeople.com”. The primary domain (managed by the customer) did not rely on DNSSEC, but the CNAME domain did implement DNSSEC.

DNSSEC is a feature introduced by IETF to “... provide origin authentication and integrity protection for DNS data, as well as a means of public key distribution” (cit. RFC4033). It consists of a sequence of DNS requests and replies aimed at verifying the authenticity of the records being returned using cryptography (via RRSIG and DNSKEY records) and a chain of trust (via DS records).  

During the resolution of the domain, the end users’ machine would reach out to their “Recursive DNS resolver” which would recurse through the DNS, validate DNSSEC and finally send back the records for the domain. The recursive resolver would first query the .com GTLD authoritative name server, then receive the authoritative nameservers, and finally, the server would query one of the Phenompeople.com domain authoritative name servers.

As you can see from the following snippet, the latter response of the GTLD server contains a RRSIG record and two DS records – meaning that phenompeople.com is signed and the resolver is supposed to perform the DNSSEC validation.

Response of the GTLD server containing a RRSIGH record and two DS records

This is correctly performed by the resolver, which then queries one of the Phenompeople domain authoritative name servers – now identified as pdns10.domaincontrol.com (173.201.78.54) – to request its DNSKEY record, to be able to proceed with the DNSSEC validation.  

2. Monitor TCP not just UDP  

DNS relies primarily on UDP; however, it also uses TCP as a fall back in certain scenarios. Therefore, monitoring DNS by simply sending a DNS query over UDP is not sufficient.

And here comes the problem with this instance of DNS failure.

When the query for DNSKEY is sent, the answer provided by pdns10.domaincontrol.com is too long to be handled in UDP, hence the authoritative name server replies with a message with the TC (Truncated) flag being set, as you can see from the following snippet.

The authoritative name server replies with a message with the TC (Truncated) flag being set

Consequently, the recursive resolver triggers the very same query to the very same server but now using the Transmission Control Protocol (TCP). However, the TCP connection is torn down by the server. See packet number 20 in the following snippet.

The TCP connection is torn down by the server

We observed the same thing happening when we manually performed a dig related to DNSSEC from that very same machine to the same authoritative name server.

Similar observation when we manually performed a dig related to DNSSEC from that very same machine to the same server

In this case, we can see from the following “tcpdump” trace that the server replied (5) with an initial TCP packet of size 1460 – exactly the value of MSS being negotiated in the TCP handshake (2), but before sending the second TCP packet containing the trailing data of the DNS reply, the server decided to tear down the TCP session (7).

tcdump trace

3. Monitor for Anycast, not just for your cloud

DNS is quite complex, and the need for speed has resulted in almost every DNS infrastructure relying on ANYCAST to quickly route users to the closest DNS servers. The problem with this is that when a managed DNS provider has a bug, misconfiguration, or anything that could impact their service – the issue could impact any one of their many ANYCAST locations, hence resulting in micro-outages for you.  

DNS querying from New York might reach different servers than when querying from Chicago. But this isn’t just about geography. Querying from the same city with ATT vs Verizon as the ISP can result in querying to different DNS servers in different datacenters. This complicates monitoring tremendously since the most common form of DNS monitoring  relies on probes from their datacenter/cloud or common browser synthetics offerings from APM (observability platforms) which are located only on AWS and/or Google Cloud – hence provide extremely limited visibility on what could be causing outages in DNS.  

Going back to our customer’s case, we noticed that the very same query performed from Italy or from Las Vegas leads to a proper reply. The managed DNS provider is relying on Anycast for their name servers and only a subset of their locations (in the US) have DNS servers that can’t properly handle DNS queries over TCP.

DNS query issue

We did a few more tests on that name server from one of the nodes showing the issue, and we were able to identify that dig will fail only when the +dnssec option is set, meaning that the DO bit is set in the DNS request – or in other words, the request from dig to the resolver to provide RRSIG records along with the requested records. Hereafter a snippet of the successful dig requesting for DNSKEY in TCP.

DNS test

We believe that the issue on pdns10.domaincontrol.com is somehow related to the size of the reply being generated. All the replies received from that server on the nodes showing the issue are indeed smaller than the MSS set in the TCP session (1440 bytes). And DNSSEC is known to generate large messages – indeed the size of the reply we received while running dig from Italy was 1813 bytes. As proof of that, we found that the same issue is experienced when running dig from the node seeing the issue with type ANY – which usually leads to large responses (when enabled).

DNS test issue

Key takeaways  

What should you be monitoring in relation to DNS? First, we hope that you realize by now that DNS is much more complex that it looks. This incident demonstrates three clear takeaways:

  1. Monitor the key CNAMES behind or in front of your domains. Your DNS configuration and infrastructure might be fine, but your business relies on SaaS providers which might still have an issue that impacts your users and business. The customer doesn’t know if it’s your CDN or your SaaS provider having the problem. They just see your brand in the affected domain.
  1. Monitor TCP not just UDP. DNSSEC failures are not just about signatures. You also need to know, are the name servers properly handling DNS over TCP? Make sure to monitor them.
  1. Monitor for Anycast, not for your cloud/datacenter. Authoritative DNS servers are almost always relying on ANYCAST… don’t just monitor from a few browser synthetics locations on AWS and assume the rest of the world can reach you. You need to monitor from as widely dispersed a set of locations as your customers.  

Learn more

Read more about the role DNS played in the 15 hour Slack outage last year and deepen your understanding of how to resolve DNS issues more quickly.  

We’ll be happy to demonstrate exactly how Catchpoint can help your organization if you’d like to try it yourself.

DNS
Network Observability
eCommerce

You might also like

Blog post

The Challenges of Ajax CDN

Blog post

Risky Business: Implementing a Redundant Networking and Multi-CDN Monitoring Strategy

Blog post

That One Time Using APM Bit Us

Blog post

Smartsheet’s SRE Team Takes Center Stage as It Hits the 8M User Mark