On July 3, users across the globe came to a standstill when they weren’t able to load photos on both Facebook and Instagram, while users of Facebook-owned WhatsApp weren’t able to send images or videos.
If Facebook was a standalone CDN it would probably be the in the top three CDNs in the world because of the sheer number of assets and traffic. Running the CDN infrastructure and networks at Facebook cannot be a small job or task.
Let’s start with what end users saw (or, more accurately, did not see):
The user experience issues were caused due to failure to load images served by the Facebook CDN. Content coming from the host content scontent-.*-\d.xx.fbcdn.net got a either a 503, 502 HTTP error code, or a connection timeout.
Here’s a quick view into the Facebook CDN domains that were failing:
503 HTTP response code:
502 HTTP response code:
The possible root cause might be a Config issue, because we saw changes in the config version in the headers during the incident:
During the incident (Dallas – Cogent):
After the incident:
During the incident (London – Cogent):
After the incident:
As we begin to look at the data for the Facebook/Instagram/WhatsApp issues from Wednesday, it is interesting to understand how their CDN infrastructure works.
What we learned:
1. Facebook CDN domains serving your photos are scontent-.*-\d.xx.fbcdn.net
2. The same object is served by different servers based on the user’s location.
Take 25994550_10156466395424714_5937507471042938431_n.jpg for example. Based on the city from which the request originates, a different server serves the object:
3. The hostnames have a code for the CDN edge serving the content. For example, scontent-SIN2-2.xx.fbcdn.net is an edge server in Singapore.
4. These hosts map to a static IP and are not using an Anycast network:
As mentioned before, some servers can serve requests from multiple cities:
Network traces to scontent-sin2-2.xx.fbcdn.net shows a Unicast network design:
5. It was interesting to see that requests coming from a particular city weren’t necessarily served from a CDN location in the vicinity. We also saw that the request was served from a CDN location in a different country altogether:
- Request from Atlanta served from CDN server in Hongkong
- Request from Bangalore served from CDN server in Singapore
- Request from Seattle served from CDN server in Stockholm
That helps us understand why the response time is high for a number of requests in these cities: the requests are doing a world tour!
Looking at it slightly differently, we found that the commonality between the cities served by the same CDN server was the ISP.
CDN servers serving multiple cities:
Common ISPs between the cities:
The one that definitely needs fixing is NTT IPv6, where the traffic is getting routed to Hong Kong. The underlying cause for this is the peering between NTT and Tata Communication, as you can see from the network path visualization below:
The rollout of the fix was gradual across the CDN servers, with some recovering faster than others:
Things got back to normal across the globe around 22:06 UTC (6:06 pm EST).
Incidents and outages are scary! We all need to ensure we have the right set of tools and processes in place to help us avoid them, and when they do happen, help us reduce MTTR.
But don’t miss out on the amount of learning that happens in the process of understanding what went wrong! You might even uncover an underlying problem that has been around for a very long time as you triage the current incident. This knowledge becomes key when you are faced with the next incident.