Subscribe to our
weekly update
Sign up to receive our latest news via a mobile-friendly weekly email
Discover how ServiceNow mitigated a major network outage using Catchpoint’s IPM data, turning a potentially severe disruption into a valuable learning experience.
ServiceNow is undoubtedly one of the leading players in the fields of IT service management (ITSM), IT operations management (ITOM), and IT business management (ITBM). When they experience an outage or service interruption, it impacts thousands.
The indirect and induced impacts have a multiplier effect on the larger IT ecosystem.
Think about it. If a workflow is disrupted because of an outage, then there are large and wide ripple effects. For example:
The list goes on.
Unfortunately, ServiceNow recently experienced such a type of incident.
We conducted an analysis using Catchpoint’s Internet Performance Monitoring (IPM) data. We can see that ServiceNow took proactive steps to lessen the duration and impact of what could have been a much larger, impactful incident.
Let’s dissect.
On 15th Aug 2024 at 14:15 PM ET, ServiceNow’s core services were down with reports highlighting intermittent success based on the connectivity with the upstream providers. Failures were reported till 16:18 ET enveloping a timeframe of 2 hours 3 mins. This outage not only impacted ServiceNow’s portal resources, but their client integrations were impacted as well.
Catchpoint’s Internet Sonar started triggering alerts while synthetically correlating thresholds against the already existing test implementation. The Internet Sonar Dashboard dynamically populated outage data with Response and Connection Timeout errors from major geographic locations. Observing the outage trend, we found resources to be intermittently reachable while the majority of the requests were facing high connect time.
This outage resulted from instability in ServiceNow's connectivity with its upstream providers, particularly AS 6461 | Zayo.
We observed this behavior in the Catchpoint portal with our Traceroute monitor.
15th Aug 11:00 EST to 14:00 EST, before the outage:
15th Aug 14:15 EST to 16:20 EST, during the outage:
15th Aug 16:20 EST to 18:00 EST, after the outage:
ServiceNow uses multiple ISPs for its datacenter locations as listed in this article: https://support.servicenow.com/kb?id=kb_article_view&sysparm_article=KB0547560. Out of the listed ISPs, the core ISPs which have a direct neighbor-ship with ServiceNow are Lumen (3356), Cogent (174), Zayo (6461), Level3 (3356), AT&T (7018) and Verizon (6167).
Before the outage, AS 6461 | Zayo was one of the most favorable for incoming traffic for ServiceNow:
But as soon as Zayo started having some major issues, it resulted in path fluctuations, which ultimately led ServiceNow Team to go through multiple BGP events (Announcements/Re-announcements/Withdrawals during the process).
Let’s breakdown the above highlighted BGP activity from RIPEstat in three scenarios:
Before the outage (Aug 14 between 00:00 – 23:59 UTC) - we observe a total of 178 events taking place for AS 16839. These events give us a perspective of BGP activity within SNC ASN (including neighboring changes):
During the outage on 15th Aug (if we compare it with a day before), we observed the number of events increased drastically to 491 with lot of route withdrawals and re-announcements.
This abnormality highlights the network volatility (basically the changes made by ServiceNow team manually/using automation to make sure their services are still reachable from the internet) while the changes were being made, ServiceNow portal and partner integration continued to have connectivity issues.
As we keep a close eye to this affected network, we observe that after the outage we see ServiceNow ASN is no longer connected/receiving traffic directly through Zayo - which kind of hints at issues specifically at Zayo-ServiceNow link during the incident. BGP did its job, and traffic found a reliable way to get to destination via other providers of ServiceNow.
Eventually, the issue with Zayo-ServiceNow link was solved 10 hours (15th Aug, 20:25 UTC – 16th Aug, 06:28 UTC) after the first hints of the incident, and traffic started to be routed through the originally preferred links:
In addition to the Internet Sonar and synthetic test alerts, we saw in this in the Catchpoint portal:
At 15:51 EST, though at a lower level, we saw the services restoring back to normal as ServiceNow Team, based on the BGP event, rolled back the routes and redirected the traffic to alternate IPs:
As highlighted in the snippet above, we can observe the traffic routing, before the issue, during the issue, and after the issue. We can see the redirection in real-time as requests are routed to a new of IP, 149.95.29.217, instead of the original IP 149.95.45.217 (part of subnet 149.95.16.0/20). This was done in a process while mitigating the impact, deprioritizing traffic via Zayo using BGP updates.
Even though we have less control over the Internet, there are highlights from this outage. When BGP events take place, not taking proactive, necessary actions will eventually lead to large, long outages. The ServiceNow Team took necessary steps based on the network fluctuation observed. They restored connectivity to core resources and client integrations.
This outage can serve as a great learning experience for a lot of organizations to:
In today's distributed environment, the application delivery chain is made up of numerous disparate but interdependent parts, and incidents like this demonstrate the impact a network outage can have on your infrastructure - DNS, load balancers, CDN, cloud infrastructure, datacenters, and so on - but most importantly on your end-user experience and overall business.