Content delivery networks or CDNs are tasked with finding the fastest and most efficient path to deliver content. CDN providers maintain an extensive network of PoPs (points of presence) that dictate the routing paths for the requested content.
CDN routing can make all the difference in terms of overall website performance. But there are other points in the path between the end-user and an edge server which could create potential bottlenecks in the network. The issue Catchpoint detected on November 14, 2018, with AT&T and CloudFront is an example of how one of the components in the network can contribute to latency and even downtime.
AT&T & AT&T wireless users were unable to access content served by CloudFront on the east coast. There were peering issues between AT&T and Cloudfront via Telia.
Amazon CloudFront had network issues on November 14, that began at 11:30 CET. The page load time of numerous websites spiked between 11:30 and 20:30 CET.
Catchpoint was able to detect the issue immediately and notify customers impacted by the issue. We investigated the incident further to identify the root cause.
AT&T was routing its traffic through Telia, but the network was having trouble resolving the request. We can see this clearly in the traceroute below. There was 100% packet loss after the request hits the Telia Network, ASN 1299.
The RIPEstat highlights the route withdrawal.
However, users from the same city but from a different ISP were able to reach CloudFront.
The network issue lasted for almost 9 hours. As the issue escalated, a number of Catchpoint customers raised support tickets with CloudFront citing the network issue.
CloudFront responded by resolving the traffic originating from AT&T to a different CloudFront IP address that bypassed Telia. During the incident, the AT&T (Boston) IP address was 126.96.36.199 which was then changed to 188.8.131.52 post which the issue was resolved.
The traceroute below shows the change in network path:
Although the routing issue was resolved, the peering issue with Telia is yet to be resolved. A traceroute previous CloudFront IP 184.108.40.206 still shows a 100% packet loss.
CloudFront was able to bypass Telia and reroute the traffic. They took swift action as soon as they were notified about the network issue. This is exactly what proactive monitoring aims to deliver. Reducing the MTTR by giving you visibility into every process/component in the application delivery chain. The incident is also a reminder to the fact that comprehensive monitoring must account for the user’s device type, network type, and even their geolocation. We can only detect real-world issues by testing from where the users are and emulating user journeys from distributed vantage points.