Another major outage on the Internet has taken place today. Telia, a major backbone carrier in Europe, suffered from a network routing issue between 16:00 and 17:05 UTC. This had a significant ripple effect with several other companies reporting outages at around the same time.
Companies affected include:
It’s always arresting to see the secondary and tertiary effects that a major outage can have. In this instance, it briefly caused some of the biggest infrastructure and content delivery networks to have serious performance issues.
There is a common theme in the slew of outages 2021 has brought us so far: the added – and painful – impact that an outage in one part of the Internet delivery chain has on all third-party dependencies.
Moreover, a number of the outages (this one included) have involved the BGP protocol. This is worth noting, because not every business has insight into BGP activities – and that has resulted in some very real business and reputation losses.
What We Saw At Catchpoint
Many of our customers saw huge increases in webpage response times at the time of the Telia incident. This increase in response time occurred across portions of the US East Coast and Europe.
The Mystery Is Solved: BGP Misconfiguration
What caused today’s outage to occur?
At 21:17 UTC, Telia Carrier posted the following statement:
Ultimately, only the postmortem that Telia may or may not release can shed light exactly on what happened today. However, from digging into the BGP data available, we can uncover some very interesting facts…
Let’s Return To The Time Of The Cri(me)sis
Earlier in the evening, Telia shared this email with their customers (which we accessed from the outage.org mailing list):
We regret to inform you that your services were affected by an incident occurred at 16:00 UTC during a routine update of a routing policy for aggregated prefixes in Telia Carrier IP Core network. This caused traffic to prefixes contained within the aggregates to be blackholed, resulting in a impact on some parts of the network.
When the underlying problem source was traced, the configuration was rolled back to the earlier working version of the routing policy (17:05 UTC). Affected services started to recover gradually after this operation was applied. No further disturbances related to this incident are expected.
Due to the wide impact on several customer services, a number of complaints was received by our Customer Support Centre, resulting in delays in the communication with customers via email and/or phone. We apologize for any inconvenience this incident has caused to your services."
This was at exactly the time that we saw the first batch of a large number of BGP events.
We dug a little deeper and focused our attention on rrc01, the RIS route collector that RIPE NCC deployed at the London Internet Exchange (LINX). This router collects data directly from an IPv4 and an IPv6 peer from Telia (AS1299).
As can be seen from the following graphs, at around 16:00 UTC, the number of networks being announced and withdrawn from the two peers spiked upwards across both peers.
It is interesting to note that the two peers of AS1299 generated BGP events related to about 500k IPv4 networks and 32k IPv6 networks. In other words, more than 50% of the full set of IPv4 routes were affected, as were more than 30% of the full IPv6 routes shared by the peers. This gives us a rough initial idea of how widely the outage affected the entire Internet.
What Can You Do To Tackle BGP Misconfigurations?
It’s the duty of every single network operator to avoid at all costs misconfigurations in the router they manage. However, no one is perfect and BGP (misconfigurations) happens! In these cases, it is critical to react as soon as possible to the issue, to minimize the disservice for end users.
It is easy to see from the BGP data that as soon as the BGP instability at Telia started, several operators decided to temporarily switch off their peering with AS1299 and/or attempted to route the traffic on alternative routes.
While it is impossible to avoid BGP misconfiguration completely, network operators should follow common sense rules and apply some of the best practices advocated by MANRS. This will enable them to minimize the chances of BGP misconfiguration happening.
In addition, businesses need a strong observability and monitoring structure in place to alert them as soon as anything like this begins to spread in the wild.
It is worth noting that not every observability solution has the capability to pick up on BGP-related issues. At Catchpoint, we are able to offer our customers the unparalleled perspective of more than 500 peers both in IPv4 and IPv6 from public repositories (RIPE NCC RIS and University of Oregon's Route Views), in addition to our own private infrastructure.
For further information on major incidents in 2021, please check out our new report. You’ll find detailed analysis, as well as a checklist of best practices to prevent, prepare for, and respond to an outage.