At Catchpoint, our award-winning support team aims to be a partner, not just a gateway to the tool. Earlier this month when UltraDNS, a major DNS provider, went down, they found themselves faced with nine support tickets within one hour. Our customers were experiencing outages on their websites and online services. They needed urgent help from Catchpoint in understanding what was causing the disruption, so they could quickly resolve the situation or validate their own findings that it was a third-party DNS issue.
I decided to dig into the details to better understand how we were able to support customers during the May 6th Neustar UltraDNS outage and helped our customers with the challenges they were facing to provide critical observability at the time they needed it. To learn more, I spoke to Varun Master, Technical Service Engineer, one of the frontline members of our Customer Service team. He shared what happened and how the support team worked with our customers to diagnose it.
What Happened With UltraDNS?
Many of our clients use external DNS providers, UltraDNS is one of them. The outages started occurring on May 6th at around 1:45/1.50pm EST on the US East coast and in the US Central and US West regions. Clients that use Catchpoint to monitor their digital services, such as websites and API services, to ensure they’re running as quickly as possible, started to get alerts from Catchpoint about a large number of DNS failures. Our support team quickly got into action to help the customer.
We already knew what was going on since we were monitoring all the major infrastructure providers, including UltraDNS. This was a real outage, happening worldwide with Neustar UltraDnS. We immediately posted information on our status page to inform all our customers and keep them updated as the situation unfolded.
What Did the Velocity and Volume Of the Tickets That Were Coming In Look Like?
Although we had updated our status page and the catchpoint alerts clearly showed the error types, nine customer tickets still came in within the hour. DNS outages like this don’t happen too often and can be a significant problem for customers, particularly those with a single DNS provider. The whole ordeal lasted for about an hour.
What Did Clients Experience?
The first ticket came from the director of software development at a cybersecurity firm. His first question was whether we rolled out any changes over the last few hours. He confirmed they were seeing “errors globally” and wanted to find out what the root cause was.
Our agent joined the live chat within a few seconds and sought to clarify what was going on. The firm let him know that they were also talking to their DNS provider (UltraDNS), who then confirmed that they were indeed having issues.
A leading software as a service company was the next client to reach out to us. Their Operations staff asked if there was “something going on here we should be aware of with the nodes” or if it were “a bigger network issue.” They shared the affected test link, which showed DNS failures in New York and Washington DC. Because we never take anything for granted, our agent said they would need some time to research what was going on and created a ticket. We never assume that because vendor ABC is having issues, all the alerts we are seeing are due to that vendor.
How Did the Support Team Work With Clients To Address the Problem?
We always respond immediately to any ticket that comes in. Our first objective is to make sure we understand the issue. We understand what the customer is going through when their websites or services are down and our support team is on standby to help the customer resolve the issue.
In any situation, while we are sending out responses, we are also running an investigation to find out exactly what is going on. Then we relay that information as quickly as possible to the customer. If it’s related to a third-party service, as in this instance, while we can’t do anything directly to solve it, we can tell our customers what’s going on so that they understand the root cause and can work with the provider to stay informed. We also try to offer alternatives. For instance, we might suggest they pause the alerts until things are back to normal.
Finally, once we had details to share, we did so with as much detail as possible.
How Do You Ensure Support Relationships Are True Partnerships?
At every stage of the incident response, we seek to partner with the customer and help carry some of their load to remove stress by providing visibility into the network. One of the ways that we did this during the UltraDNS outage was to monitor their tests for several hours after the situation was resolved. We also sent links to the tests in Catchpoint that we manage, which are capable of providing data above and beyond what the average customer would have, to show that their DNS services were now running as expected.
For further information on major incidents in 2021, please check out our new report. You’ll find detailed analysis, as well as a checklist of best practices to prevent, prepare for, and respond to an outage.