Microsoft Cloud Outage Causes Global Workforce Disruptions
Were you impacted by the Microsoft Services outage? We were. Fortunately, our IPM platform was able to immediately diagnose the problem. Find out what our telemetry unearthed and our 6 key takeaways.
Many of us (indeed 1 billion plus users worldwide) rely on Microsoft for essential work activities and were impacted yesterday (Wednesday January 25, 2023) when the cloud service provider experienced a prolonged outage. Internet Resilience is a business priority because when critical workforce services like Microsoft go down, global teams are hugely disrupted. The sooner IT teams can detect the problem, isolate its cause, and troubleshoot, the sooner your workforce can resume its normal operations and limit impact to the business.
Multiple Microsoft services started to fail from Jan 25, 2023 - 07:08 UTC/02:08 EST across the globe and the issue continued to impact users for around five hours. The outage didn’t only impact Microsoft 365 services, such as Teams, Outlook, and SharePoint Online, but also Microsoft games such as HALO and security features like Microsoft Defender for Identity, along with its crown jewel cloud offering, Azure. Based on incident MO502273 on the Service health dashboard, the preliminary root cause was identified as “a wide-area networking (WAN) routing change [that] resulted in users being unable to access multiple Microsoft 365 services.”
Microsoft expand on root cause
Redmond elaborated on the root cause on January 27, saying the global outage was caused by a single router IP address change that led to packet forwarding issues between all other routers in its WAN. "As part of a planned change to update the IP address on a WAN router, a command given to the router caused it to send messages to all other routers in the WAN, which resulted in all of them recomputing their adjacency and forwarding tables," Microsoft said. "During this re-computation process, the routers were unable to correctly forward packets traversing them."
While the change was planned, the “command” given to the router was clearly an error – indeed a costly one, which led to the widespread service impact, hitting in waves that peaked approximately every thirty minutes, as shared on the Microsoft Azure service status page (which itself was impacted, intermittently displaying “504 Gateway Time-out errors”).
Over Twitter, Microsoft kept users informed as they troubleshot the issue.
January 25th MS Twitter updates
An hour later, they rolled back the change to mitigate the issue.
Five hours after that, Microsoft reported on a return to normal for almost all its services.
Immediate detection by Catchpoint’s IPM platform
Catchpoint’s IPM platform detected the issue as soon as it started at Jan 25, 2023 - 02:08 ET, for Bing, Teams and Outlook applications. Thanks to our continually growing global observability network, we were able to observe increased connect, wait times, and availability drops around the world.
The first type of errors users saw would have been a 503 status code, then they would have seen Service Unavailable errors. Below, the Response Header from Catchpoint’s tests indicate a 503 Service Unavailable error for Microsoft’s Search engine, Bing.
Fortunately for the US and most of EMEA, the outage largely occurred during the early hours of the morning before anyone had started their working day.
At Catchpoint, however, as for many global businesses, we have a large percentage of our workforce based in India (where it was early afternoon) and our employees bore out their initial confusion at finding core Microsoft services unavailable. One of our QA Managers in Engineering reported Teams being almost impossible to use for two and a half hours, feeling disconcerted on seeing “Status Unknown” on colleagues’ statuses, and being abruptly ejected from a Teams call then unable to rejoin. However, once alerted to the outage by our Managed Services team who were observing the telemetry our IPM platform was detecting, she was able to switch over to the mobile version of Teams and could resume the call.
Likewise, one of our System Administrators found that his Outlook had simply stopped working. He wasn’t receiving any emails (including no doubt concerns about the outage itself), nor could he send them. When he became aware of the data Catchpoint was unearthing, he too understood there was a global outage and instead of switching off until service resumed, began to proactively communicate with colleagues to find workarounds until service normalized.
Sign up for a free trial to find out how our IPM solutions can help you enable Internet Resilience for your distributed employees and global customer base.
Now, the networking aspect (and supporting BGP data)!
Our telemetry matched what Microsoft announced. However, in addition, we decided to dig into the BGP data since it’s the natural starting point for a networking-based investigation and Microsoft said it was a networking/WAN issue. At Catchpoint, we collect BGP data in real-time from more than 50 partners distributed all over the globe – and the number is growing! From a quick look at the BGP events collected by a few of our peers, it is easy to identify the exact moment when the problem arose (and probably when Microsoft’s customers started complaining). In the graph below, we analyzed all the BGP updates generated by our peer selection, which were carrying a route towards networks originally announced by one of the 5 AS numbers (ASNs) mostly used by Microsoft.
As you can see, there was a spike in BGP events starting at 02:08AM ET/07:08 UTC , where the routing activity of each of the 5 ASNs spiked upwards. This was recorded by all the peers we analyzed. Most of the events recorded were network announcements, but we also caught several network withdrawals – meaning that any service running behind the withdrawn networks were not reachable from the selected peers – and likely by a lot more people around the world.
To confirm that, we can double-check the telemetry that Catchpoint produced across the duration of the outage, which revealed many connection timeouts, as can be seen in the following Waterfalls and Sankey charts.
Our Synthetic Traceroute tests were actively monitoring the network from our employees to Microsoft on a continuous basis. This allowed us to track the impact of the outage before and after at a global scale, as is demonstrated in the following three Sankey charts.
From Figure 1, we can see that the test failures were due to a connection time out. Most monitoring tools would only indicate there is a connection issue, but as you can see, Catchpoint helps network teams determine at which level the issue is happening. Figure 2 clearly indicates global packet loss at hops managed by Microsoft. Figures 3 to 5, meanwhile, show that before and after the outage, there is no packet loss whereas during the outage we see increased packet loss for the destination IP. During a critical outage, this kind of granular information helps under pressure network teams quickly narrow down the root cause of the issue.
In today’s competitive landscape, organizations can’t afford to see workforce productivity levels come to a standstill by not being able to send instant messages, emails, or access key documents. This kind of outage also means hundreds of calls to the Help Desk team from distributed employees to understand the situation. Without a proper monitoring tool, it will take a considerable amount of time for the helpdesk teams to even realize it is not their fault and offer clarity and workarounds to their workforce.
Initially, based on the patterns observed (503 errors turning into connection timed out errors), we suspected this was a network outage caused by changes following a services failure. However, the same pattern is generated when planned network changes, moving traffic from one pod/datacenter to another, by stopping application services at a location and routing users to a different location, which - based on Microsoft’s initial postmortem - is what happened here. This is why analyzing telemetry data is extremely important: know your architecture and evaluate the difference caused by any changes performed before and during the incident.
Our six key takeaways
There are various ways to mitigate the damage caused by this type of extensive outage, depending on whether you are the unfortunate service provider experiencing it or its downstream, inadvertent casualty. Our six key takeaways:
- Communicate, communicate, communicate – when an outage occurs, be prepared to tell your workforce and customers what is going on as quickly as possible so they can find workarounds.
- Make sure your communications are fast and accessible for every user. Not everyone will go on Twitter or in this instance, be able to access the MS admin page with more detailed notes. As a matter of course, service providers should route people to a down page that is on a different infrastructure from the core service so that ALL your users know exactly what is happening and won’t blame it on their own WiFi or ISP and experience further disruption (and anxiety) as a result.
- For service providers and those reliant on them, implement an IPM strategy that allows you to monitor precisely what your distributed employees and global customers are experiencing 365/24/27.
- Monitor the entire Internet Stack you rely on to deliver your content (including DNS, CDN, ISP, BGP, TCP configuration, SSL, and other cloud services, etc.), even if you assume much of it is outside your control.
- The leading cause of failure is config changes, so whenever change is planned, be prepared for any outages that may follow.
- Prepare, prepare, prepare - ensure your team is ready for any outages that may occur (intentionally or not) by practicing recovery and developing runbooks, including a crisis call plan that includes a communication strategy and templates, a plan to mitigate failures from third-parties, and a best practices approach to monitoring and observability.
In its PIR, Microsoft acknowledged its own key takeaways:
How are we making incidents like this less likely or less impactful?
- We have blocked highly impactful commands from getting executed on the devices (Completed)
- We will require all command execution on the devices to follow safe change guidelines (Estimated completion: February 2023)
One thing that’s painfully clear from the last eighteen months is that no matter how big a service provider is, they can still fail. Failures happened to the giants of the Internet: from Microsoft to Facebook. Our experts have carried out a deep dive analysis into six of these incidents and put them into a new white paper, along with six critical takeaways. Get Catchpoint’s Preventing Outages in 2023 white paper now (no registration required).
Updated on 2/2/2023