Subscribe to our
weekly update
Sign up to receive our latest news via a mobile-friendly weekly email
Learn from a real-world site outage: the need for proactive monitoring across IPv4/IPv6, DNS accuracy, and why APM tools alone can miss critical micro-outages.
Even after decades of building web applications and troubleshooting live production issues, the thrill of solving why some random website is failing never fades.
Last week, a colleague shared a link to ONUG’s website about their upcoming event in NYC this fall.
I clicked on the link, and was waiting, and waiting, and waiting for the page to load and it did not. Finally, after about 30 seconds, Chrome greets me with “ERR_CONNECTION_TIMED_OUT”
Quickly I tried to go somewhere else on the internet (quickest to find are you connected when in the browser already), and every website I could think of was working fine. The link was also working well for some of my colleagues – which made it even more interesting.
I was quite surprised by the issue as you wouldn’t expect it out of an organization that has “networking” in the name - ONUG does stand for Open Networking User Group and focuses on IT leaders of large enterprises. So, I was intrigued as to what could be so tricky that even ONUG would be impacted.
Curious as to what was going on, I decided to launch Developer Tools and go back to ONUG’s site. It again failed to connect, and although Chrome tried to automatically reload the URL - it again greeted me with “This site can’t be reached - onug.net took too long to respond.”
Instinctively I clicked on one of the failing requests in the Developer Tools to look at what IP address the browser was connecting to but saw no IP.
Unfortunately, this reminded me that Chrome wouldn’t show the user the IP address in Developer Tools, unless it received some response from the server. A sad reminder that 11 years ago we asked for a Chromium enhancement to add the IP addresses to developer tools and their APIs for these cases, and now almost 100 Chrome versions later – a user is still unable to see what IP address the browser couldn’t connect to.
But hey, Chrome was nice to think that by reloading the request over and over, the problem of connecting to a server would somehow be fixed! (recall when the solution to any failure was “reboot the machine”)
I moved on to the next handy tools on my desktop: Ping and Traceroute – but everything was working getting to “162.159.135.42”, no packet loss or high latency.
Perplexed as to what was going on, I simply launched the IPM platform we built at Catchpoint and started monitoring onug.net from the 1,274 locations we have globally. Within seconds it became clear what the issue was, when reaching the IPv4 address it worked perfectly, but when you hit the IPv6 address – that one was not able to establish a connection.
The data from the network tests performed showed clearly 100% failure connecting to the IPv6 address for ONUG.
The network path showed that onug.net is sending IPv6 traffic to Akamai Linode, and IPv4 Traffic to Cloudflare. The DNS for the website is configured to have both A and AAAA records, therefore any users on only IPv4 see it working fine; however, any user on dual stack IPv4 and IPv6, is at the mercy of what the OS and the Chrome picks from the records. If an AAAA record is picked (IPv6) then they are experiencing the same issue as I did, while it works just fine when the A record is picked.
There are some key lessons to learn from this experience:
Situations like this one result in many micro-outages like this one, intermittent errors, and regional errors to go undetected, or blamed on the users. We must recognize the number of factors from DNS resolution to ISP performance, to routing, to dozens of others that can impact the performance or availability of applications for users in particular regions.
This simple situation highlights the importance of continuous, proactive monitoring, from where your users are, and having the right technology at hand to catch errors like this one and to detect and solve issues before users are impacted.