Blog Post

Post Incident Review: Comcast Outage

Published
July 2, 2018
#
 mins read
By 

in this blog post

We get upset and frustrated when a site we are trying to access is unavailable. That frustration multiplies when we can’t reach any site on the internet. On Friday, June 29th, Comcast experienced a major outage caused by two unrelated fiber cuts to their network backbone providers. Media reports the outage began a little before 10 AM Pacific time. Comcast Cares Twitter account confirmed the outage around 11 AM. At 11:37 they indicated the outage was due to a fiber cut.

Tweet from ComcastCares

Later in the day, Comcast updated their statement to indicate two, separate and unrelated fiber cuts to network backbone providers, Level3 and Zayo, as reported by Wired.

Examining the data

Catchpoint monitoring locations include nodes on the Comcast network. Looking at tests in our benchmark account run on Comcast nodes the earliest error we saw was at 12:27 Eastern Time.

Scatter plot of Comcast test runs

Many of the failures were due to connection failures and timeouts:

Connection error

We also saw communication issues between our nodes and our command and control systems. One custom visualization we use when troubleshooting outages such as this is a Sankey diagram. A Sankey diagram helps to visualize the flow of data. The width of the arrows is proportional to the flow quantity.

When analyzing the traceroutes between our Comcast nodes in Chicago, Los Angeles, New York, Philadelphia, Seattle, San Francisco and Washington DC with a Sankey Diagram we see that the path analysis stops near Hop 8 and Hop 9 at the bottom of the chart. This indicates the point where the traceroutes dropped off in Comcast’s network.

Another way to look at the same data is with a force directed visual. Force directed graphs use algorithms to try and determine how to display a network graph in a visually appealing way. Data is presented in a two-dimensional or three-dimensional way with all edges of equal length with minimal crossing edges.

The force directed visual of the traceroutes is displayed below. The blue circles show the termination point of traceroutes. These circles are typically on the right-hand side of the chart as that represents the end of the flow. The blue circles near the center show the traceroutes did not complete.

Force directed visual of traceroute

Impact of the outage

Outages like the one Comcast experienced can sometimes lead to other outages. One of the sites we use to confirm the impact of an outage is DownDetector.com. They collect passive user data from various sources.

The charts below show the widespread nature of the outage. This wasn’t localized to a single city or state. Customers across the US were affected.

Comcast outage map from DownDetector

Trend of reported problems over 24 hours

Many news articles about the outage referenced charts like these. As more people started searching for answers to the outage, Downdetector experienced a brief outage presumably due to the Slashdot effect.

What started out as a means of research, communication and collaboration has evolved. We have come to rely on the internet for day to day tasks. Remote employees, like me, rely on the internet to stay connected to co-workers. Small businesses rely on the internet to communicate with customers via e-mail, online chat, or VOIP phone services. Outages like these are seen as more than an inconvenience when they impact our lives and businesses in major ways. It is easy to get frustrated.

In times like these, we have to remember the internet is fragile. Outages will occur and given the interdependencies and complex nature of the internet, there can be a ripple effect when one service provider experiences an outage. As a consumer, there isn’t much you can do when outages occur. But as a business, you can monitor your application to identify when a vendor or provider is experiencing an issue.

_I would like to thank my colleagues Dhanush Suhas and Zachary Henderson for their help in gathering the data for this blog._‍

We get upset and frustrated when a site we are trying to access is unavailable. That frustration multiplies when we can’t reach any site on the internet. On Friday, June 29th, Comcast experienced a major outage caused by two unrelated fiber cuts to their network backbone providers. Media reports the outage began a little before 10 AM Pacific time. Comcast Cares Twitter account confirmed the outage around 11 AM. At 11:37 they indicated the outage was due to a fiber cut.

Tweet from ComcastCares

Later in the day, Comcast updated their statement to indicate two, separate and unrelated fiber cuts to network backbone providers, Level3 and Zayo, as reported by Wired.

Examining the data

Catchpoint monitoring locations include nodes on the Comcast network. Looking at tests in our benchmark account run on Comcast nodes the earliest error we saw was at 12:27 Eastern Time.

Scatter plot of Comcast test runs

Many of the failures were due to connection failures and timeouts:

Connection error

We also saw communication issues between our nodes and our command and control systems. One custom visualization we use when troubleshooting outages such as this is a Sankey diagram. A Sankey diagram helps to visualize the flow of data. The width of the arrows is proportional to the flow quantity.

When analyzing the traceroutes between our Comcast nodes in Chicago, Los Angeles, New York, Philadelphia, Seattle, San Francisco and Washington DC with a Sankey Diagram we see that the path analysis stops near Hop 8 and Hop 9 at the bottom of the chart. This indicates the point where the traceroutes dropped off in Comcast’s network.

Another way to look at the same data is with a force directed visual. Force directed graphs use algorithms to try and determine how to display a network graph in a visually appealing way. Data is presented in a two-dimensional or three-dimensional way with all edges of equal length with minimal crossing edges.

The force directed visual of the traceroutes is displayed below. The blue circles show the termination point of traceroutes. These circles are typically on the right-hand side of the chart as that represents the end of the flow. The blue circles near the center show the traceroutes did not complete.

Force directed visual of traceroute

Impact of the outage

Outages like the one Comcast experienced can sometimes lead to other outages. One of the sites we use to confirm the impact of an outage is DownDetector.com. They collect passive user data from various sources.

The charts below show the widespread nature of the outage. This wasn’t localized to a single city or state. Customers across the US were affected.

Comcast outage map from DownDetector

Trend of reported problems over 24 hours

Many news articles about the outage referenced charts like these. As more people started searching for answers to the outage, Downdetector experienced a brief outage presumably due to the Slashdot effect.

What started out as a means of research, communication and collaboration has evolved. We have come to rely on the internet for day to day tasks. Remote employees, like me, rely on the internet to stay connected to co-workers. Small businesses rely on the internet to communicate with customers via e-mail, online chat, or VOIP phone services. Outages like these are seen as more than an inconvenience when they impact our lives and businesses in major ways. It is easy to get frustrated.

In times like these, we have to remember the internet is fragile. Outages will occur and given the interdependencies and complex nature of the internet, there can be a ripple effect when one service provider experiences an outage. As a consumer, there isn’t much you can do when outages occur. But as a business, you can monitor your application to identify when a vendor or provider is experiencing an issue.

_I would like to thank my colleagues Dhanush Suhas and Zachary Henderson for their help in gathering the data for this blog._‍

This is some text inside of a div block.

You might also like

Blog post

Preparing for the unexpected: Lessons from the AJIO and Jio Outage

Blog post

Learnings from ServiceNow’s Proactive Response to a Network Breakdown

Blog post

DNS misconfiguration can happen to anyone - the question is how fast can you detect it?