As a digital society built around data and devices, we have reached a pivotal juncture where the customer experience is the product. Services must be accessible to everyone, anywhere and at any time.
This user experience journey begins with resolving your domain name. DNS has been around for years however how we use DNS has grown in complexity over the years and the need for better monitoring is critical. Not monitoring the DNS resolution is like having a Lamborghini without fuel nor wheels.
According to industry standards, DNS resolution time should be less than 50 ms. High DNS resolution times on your website can create major performance bottlenecks. Going back to the example of the Lamborghini, neglecting high DNS resolution time is akin to neglecting the quality of fuel in your car which eventually ruins the entire engine of the car.
Finding issues before they impact the end user experience is why synthetic monitoring with appropriate alerting and notification thresholds configured is an important part of the IT Operations or SRE toolset. It’s not enough to be notified that an issue is occurring, you also need the ability to quickly identify what component, element, or provider is contributing to the incident.
The Catchpoint technical services team often encounters anomalies in performance and provides insights and data to customers to help them quickly resolve incidents. Recently we noticed fluctuations in web page response times for an online payment gateway. Nodes in India were showing higher than normal response times.
Drilling down into the data we identified the issue to be related to high DNS resolution times in India. The chart below shows high DNS resolution times for two different domains. The domain on the left serves auxillary content on the web page and the domain on the right is the primary URL to access the application. Both domains were impacted.
This Online payment company uses a content delivery network (CDN) to serve their content across the globes. Some CDN uses DNS redirection methods to redirect users to the proper edge servers allowing a faster loading of the web pages and assets.
To further investigate the issues additional DNS monitoring was required. We used the DNS experience test to further analyse the overall DNS response trend. The DNS experience test allows you to monitor each level of DNS resolution of a domain by emulating a recursive DNS resolver without caching records. DNS experience tests operate slightly different than a real user or our web monitor does to be able to identify which nameserver is problematic.
The data highlighted high DNS resolution time from two major ISPs in India.
The above screenshot shows availability drop across Airtel and Reliance ISP. This indicates that the DNS resolution failed at some level when one of the authoritative nameservers failed to respond to the DNS query.
We see that this analysis corresponds to the failed data points in the scatterplot chart below; the authoritative nameservers originating from the same subnet IP address fails to respond for the query when the users are accessing the site from Airtel and Reliance.
The next step in the troubleshooting process is to query a name server directly with our DNS direct test.
The DNS direct test shows the CNAME provided by the CDN provider fails to respond from the authoritative nameserver if the users are originating from Airtel & Reliance ISP.
The high response times from Airtel & Reliance ISPs illustrate that the query was attempted three times at a 1.5 second interval before timing out.
To ensure the validity of our observations, we tested one of the Public Resolvers belonging to Airtel and Tata. The webpage showed high DNS time from Airtel resolver when compared to Tata.
The screenshot above clearly depicts high DNS from Airtel.
Bharati Airtel is one of the leading ISP providers and one of the major telecom company in India. According to Statista, there are 290.11 million wireless subscribers for Airtel in India.
Since Airtel is among the top service providers in the telecom industry, we wanted to confirm the issue was on the carrier side. So we used the Dig tool from a mobile phone running on Airtel and Vodafone carriers.
The analysis may lead you to think that this is related to the monitoring methodology we use at Catchpoint. But our goal is to provide our customers with a true understanding of their user experience by making sure every layer (ISP, CDNs, Peering…) is working as expected, by monitoring where it matters the most… “Outside the firewall”.
From the above results, we can precisely conclude that this was a REAL problem, impacting the users in India of this major global online payment system, only they know how many transactions resulted in poor experience.
As we mentioned earlier, the customer experience is the product. What do you do when so much of that customer experience is delegated to third party services like CDNs, DNS, Ads… ? Very simply:
Trust but Verify and Verify and Verify!