This morning, beginning at 8:40 am UTC (4:40 am EDT), customers of the telecommunications giant Level 3 experienced serious connectivity problems for two hours that obviously had a dramatic impact on their sites’ performance. To make matters worse, as a major backbone ISP, the impact was not just felt by Level 3 customers, but spread to any web traffic that passed through their network. The issue was caused by Telekom Malaysia, whose prefix hijacking caused a route leak which resulted in global routing problems.
As you can see, the greatest impact is seen in Oceania, but there is also significant impact in Europe as well as repercussions in Asia and North America. In our tests of the Level 3 network, the majority of problems in Oceania manifested as time outs and connection failures, which in turn led to packet loss and greatly increased round trip times due to certain routers being unavailable.
Here is a traceroute from Sydney showing significant packet loss at Global Crossing (Level 3) in hop 7.
1 2 ms 1 ms 1 ms xxx.xxx.xxx.xxx
2 * * * Timed Out
3 1 ms <1 ms <1 ms lag30.sglebinte01.aapt.net.au[220.127.116.11]
4 1 ms 1 ms 2 ms po41.sglebbrdr11.aapt.net.au[18.104.22.168]
5 <1 ms <1 ms <1 ms 203-219-106-153.tpgi.com.au[22.214.171.124]
6 2 ms 3 ms 3 ms syd-gls-har-int2-be-20.tpgi.com.au[126.96.36.199]
7 272 ms 316 ms 272 ms globalcrossing1-10g.hkix.net[188.8.131.52]
8 * * * Timed Out
9 * * * Timed Out
10 * * * Timed Out
Meanwhile, Europe was experiencing more problems with DNS than anything else, as seen in this DNS Traversal from Frankfurt highlighting more increased RTT and packet loss, which in turn resulted in DNS resolution timeouts.
Stepping back from the details for a moment, this issue shows the potential butterfly effect of global networks. An ISP flaps its wings in Malaysia, causing a chain reaction that results in site failures and time outs on the other side of the world. It also highlights the fact that despite all of the precautions that can be taken, sometimes our sites’ web performance is completely out of our hands. Even if Level 3 customers had a redundancy in place, the scope of the company’s network is so vast that you can’t guarantee that another ISP wouldn’t peer with them at some point. The end result is an outage that could only have been prevented at the source.
UPDATE: Corporate responses from both carriers: