How Facebook Outage Affected Other Sites
On May 31, 2012, many sites relying or using the Facebook plugin, experienced massive performance and usability issues.
Back in October 2011 we wrote about web pages turning into airports without air controllers. Web pages have become a very complex ecosystem relying on many third-party tags for advertising, behavioral targeting, social plugins, etc. This is all fine and many of these are great solutions.
However, very few websites have adopted a policy of ensuring that their performance and usability is unaffected if one of these third party tags has a hiccup, which is bound to happen!
Yesterday, May 31 2012, from 19:00 EDT until 21:50 EDT and today June, 1 2012 from 2:30 am to 6:00 am many web sites relying or using the Facebook plugin on their sites experienced massive performance and usability issues. The social giant was having issues and its users could not login to wish happy birthday to their relatives and friends. But at the same time it affected users visiting sites like Urban Outfitters, L.L. Bean, HSN, JCPenney, Teleflora, 1800 Flowers…!
The issue affected mostly web sites that have placed the Facebook code inline in their page. From an end-user perspective the “spinning hourglass” never stopped while loading the web page because the browser waited and waited and waited for resources to www.facebook.com to complete which they never did. Worst case scenario users might have seen pages hanging or functionality of the page was impaired. A few websites, their developers followed WPO techniques and loaded the Facebook tags asynchronously, not blocking their content and their users did not even notice the issue. Sadly very few websites implement such techniques for third-party tags. To have a widespread adoption of such techniques, the vendors should be the ones implementing them on the tags they provide to the sites.
We had plenty of sites affected, but one chart example is enough to show the impact. I picked the site of a major retailer which shows the Facebook glitch was not local to a region, all of our nodes worldwide detected the issue.
Retailer’s Site Document Complete Scatterplot
Performance Impact & Bottleneck Time by Zone (Grouping of URLS)
Extracted the Facebook Hosts – Only www.facebook.com was having issues.
The lesson learned once again is that everyone is prone to failures (Google, Facebook, Twitter, etc) some will experience downtime and others slow performance. As a site owner (or developer) you need to be prepared for these kinds of failures, you need to build your site knowing that failure will happen. Some things to consider:
- Wait for your content to load before executing these 3rd party calls.
- Have a mechanism of detecting and removing very quickly the offending tag(s).
- Test your site for single points of failures.
- Demand accountability from third party providers, and require SLAs.
And to the Facebook team (and other third parties), my humble suggestion based on what we learned from the DoubleClick days: we had separate infrastructure to deliver 1×1 pixels so we did not mess with our clients’ user’s experience when our systems had problems. Good luck everyone!
Mehdi – Catchpoint
Some additional resources about dealing with Single Points of Failures:
Steve Souders (Google) 2010 – http://www.stevesouders.com/blog/2010/06/01/frontend-spof/
Patrick Meenan (Google) 2011- http://blog.patrickmeenan.com/2011/10/testing-for-frontend-spof.html
Ironic – Video of Steve Souders talking about SPOF a day before this outage at Fluent 2012: