Blog Post

Slack Outage of 2/22/22 – Good Morning! Here’s 16 Minutes of Stress!

Updated

Published

February 23, 2022

mins read

in this blog post

For some time now, people have understood the importance of early warning systems, whether for detecting earthquakes and tsunamis, military defense, or business and financial crises. Why should service providers, especially those delivering software as a service (SaaS,) be any different? In a world where time is money and minutes mean millions, it is vital for organizations to keep a very close eye on the supply and delivery chain of their service to their end users, both business and consumer. ‍

According to techjury, Slack has more than 10 million daily active users, of which 3 million are paying subscribers who spend an average of 9 hours logged in to Slack each day. So yes, it is important for Slack to fully understand how their service is performing and what their users are experiencing. However, if you’re one of the 600,000 companies worldwide that relies on Slack to keep communication flowing throughout the enterprise, it’s critical for you to have your finger on that pulse as well. As a customer, the only thing worse than having a productivity application go down is wasting even more time thinking it might be you and trying to fix something you can’t.

This is not without precedent. In 2019, according to TechTarget, Slack watered down its cloud service level agreement (SLA) after outages forced the vendor to issue $8.2 million in credits in a single quarter. “Compounding the financial impact of the downtime was an exceptionally generous credit payout multiplier in our contracts dating from when we were a very young company,” Slack’s finance chief, Allen Shim, told analysts on a conference call at the time. “We’ve adjusted those terms to be more in line with industry standards, while still remaining very customer friendly.” So, while subsequent outages should not have quite the same bottom-line impact, the intangible effects of issues such as damage to the brand, resolution efforts, loss of productivity, and vulnerability to competition remain and are harder to quantify.

Now, looking at it from a customer’s perspective: When the service started having trouble yesterday around 9 AM ET, there was no indication of an issue by Slack. However, if you were a user of Catchpoint, you wouldn’t need one, as you would have been alerted to it right away (see below).

A chart showing failed tests for slack — Failures for Slack tests from Feb 22, 2022 starting at 09:09:48 ET (Catchpoint)

Once alerted, the organizational reaction to this event varies by user type and position. If I’m the head of IT at a company that uses Slack, my first step is to check Slack’s service site to see what they report: several minutes into the failure there was no update from Slack. In my role, I would need to verify the error before I start to raise the alarm internally. Fortunately, I don’t even have to log in to Slack myself to do that; the Catchpoint tests capture the error notifications that end users see, such as this one right after successfully logging in:

An image of an error notification stating the server is having trouble loading — Error notification (Catchpoint)

‍

And this one when users try to fetch conversations:

An error notification saying that the thread couldn't be loaded — Error notification (Catchpoint)

‍

Not only that, but Catchpoint waterfall reports can also see the error request on the page, which seems to be posting back the error captured on the application (this request is seen only in failed test runs).

An image of Catchpoint's Waterfall chart identifying an error from a failed test run — Waterfall chart (Catchpoint)

Houston, we have a problem! At this point I would have successfully navigated the “detect” stage of an incident lifecycle to identify (next stage) the event as a bona-fide incident. Then comes the hard part: triage. Whom do I have to notify – or worse, wake up? Whose breakfast am I am going to interrupt to help diagnose this?

Let’s start with the folks who are guilty until proven innocent: the network team. The chart below shows where in the many steps the test is failing:

An image of an Availability chart showing where in the steps the test is failing — Availability chart (Catchpoint)

Okay, so it’s not the network. However, let’s make doubly sure and look at what requests are failing; we can do that because with Catchpoint we have the ability to emulate user activity, which shows that users were able to log in and take quite a few actions before getting an error from the Slack servers.

An image of an emulated activity chart showing users were able to log in and take a few actions before getting an error from the Slack servers — Emulated user activity (Catchpoint)

This incriminates the “<root>/api/client.boot?_x_id=noversion-1645542869.027&_x_version_ts=noversion&_x_gantry=true&fp=e3” request.

The chart below, showing the before and after, points to it being one of the key calls responsible for conversations and messages page functionality.

Image of a request data chart showing data as a key call responsible for conversations and messages page functionality — Request data chart (Catchpoint)

Confirmation… I can let the network team go on with their day (they are Mountain Dew guys anyway)!

However, if I’m the head of the Help Desk, right about now I’m spitting out my very hot Dunkin’. I can see that while users are able to log in, they can’t do a whole lot else, which means – particularly because this is the first thing in the morning – I’m going to have a whole lot of people fumbling around trying to figure out what’s going on and flooding my support staff with tickets and calls! I catch my breath and send out an email letting my user community know that Slack is having an outage and that they should seek other means of communication until further notice.

I would have been able to do all of this in a matter of minutes, which makes a difference because it took Slack approximately 16 minutes to inform the world about what Catchpoint users would have already known:

Image of a Slack notification error taking approximately 16 minutes to inform users — Incident status report (Slack)

Can your organization afford to lose 16 minutes of productivity per employee? And have your employees opening tickets or calling the help desk for situations beyond your organization’s control? If you’re like most enterprises and the answer is no, and you’re not already invested in an industry-leading observability solution like Catchpoint, the good news is that it’s not too late. Don’t wait until the next outage to start on your observability journey.

‍

Catchpoint is ready to partner with you, so contact us and speak to one of our experts today.

Summary

‍

And this one when users try to fetch conversations:

‍

Let’s start with the folks who are guilty until proven innocent: the network team. The chart below shows where in the many steps the test is failing:

This incriminates the “<root>/api/client.boot?_x_id=noversion-1645542869.027&_x_version_ts=noversion&_x_gantry=true&fp=e3” request.

The chart below, showing the before and after, points to it being one of the key calls responsible for conversations and messages page functionality.

Confirmation… I can let the network team go on with their day (they are Mountain Dew guys anyway)!

‍

Catchpoint is ready to partner with you, so contact us and speak to one of our experts today.

Incident Management

Application Experience

ITOps

Blog post

APM vs Observability: Both-and, not either-or

Blog post

Catchpoint Expands Observability Network to Barcelona: A Growing Internet Hub

Blog post

Slack Outage of 2/22/22 – Good Morning! Here’s 16 Minutes of Stress!

in this blog post

Summary

You might also like

APM vs Observability: Both-and, not either-or

Catchpoint Expands Observability Network to Barcelona: A Growing Internet Hub

Demystifying API Monitoring and Testing with IPM