DNS observability is an essential part of any Ops team’s strategy. Looking for proof? It’s happening right now.
It has been a busy week for Ops teams across the globe. Many were forced to urgently rotate SSL certificates after one of Lets Encrypt’s root certificates expired.
Collaboration plays a critical role during such situations where members in a team or multiple teams must communicate and work with each other to rapidly and efficiently complete a collective task. Unfortunately, things got more challenging this week as one of the world’s largest collaboration and messaging applications, Slack, was not accessible for various users worldwide during the same time period.
DNS misconfiguration is at the core of this issue. If the process of DNS resolution fails, users experience outages like this. However, you can take action to avoid business impact.
Let’s start by breaking down the issue that’s happening currently.
Slack Acknowledges The Issue
Users were not able to access desktop, mobile, and web applications of Slack from 15:30 AM UTC onwards. The outage was related to a DNS failure, which was later acknowledged by Slack. At the time of publication, the issue is still ongoing for some users and at 06:57 Slack UTC announced it may take up to 24 hours to completely resolve this issue for all users.
Users were struggling to understand if they were not able to access Slack due to their device, wireless network or ISP connectivity. Things got more difficult as Slack’s status page was down due to the same issue.
Why Monitoring From The Cloud Isn’t Enough
During such incidents where Operation teams are not able to collaborate efficiently with each other, things can easily get out of control. This can lead to outages that directly impact customers. IT teams in some organizations might already monitor their SaaS applications, but it is not surprising if none of them had triggered any alarms for Slack.
Most monitoring solutions are hosted on cloud instances. Monitoring applications from cloud instances leaves dangerous blind spots and does not accurately represent end user experience.
A good monitoring and observability strategy must include a combination of observation across backbone and last mile networks. The backbone network has predefined bandwidth and consistent network connectivity. This allows you to monitor, measure and benchmark application performance without any network fluctuations. At the same time, the last mile network represents availability and performance for real end users who are trying to access digital services on their home/office networks.
Catchpoint’s Last Mile Tests Detected DNS Issues As The Root Cause
Catchpoint’s last mile tests detected Slack DNS issues, allowing the platform to proactively notify respective teams.
Even after 15 hours of the outage, some users still cannot access Slack. Those who are aware of the issue and its root cause can mitigate it by overriding their default DNS resolver with a public DNS resolver such as 126.96.36.199 or 188.8.131.52.
Slack has now confirmed the outage was “caused by our own change and not related to any third-party DNS software and services.” This was related to Slack’s TTL allowing for caching of responses for up to two days.
The lesson here? DNS might be a small service in the delivery chain, but minor mistakes in configuration can take hours to recover if you have large TTL for your records. Read more about how TTL can impact DNS responses.
Understand How to Resolve DNS Issues More Quickly
DNS is at the core of the Internet. If the process of DNS resolution fails, users will experience outages such as this one. Observing the DNS of all your essential SaaS services from the cloud, backbone and last mile is essential to understanding the true performance of DNS. The fix is easy, but only if you know what needs to be fixed!
Watch Our DNS How-To video series to find out how to verify DNS server mapping, and other DNS-related tips!