Network outages are both common and expensive – usually far more expensive than people realize. Yes, the network is down and the organization is losing money, but do you really appreciate how much money? And how much an outage can actually cost on a per minute basis? It’s not only more than most people think, it’s something that can be mitigated fairly easily.
Let’s consider the recent FAA outage. In the early morning of 1/11/2023 the FAA’s NOTAM system went down. This isn’t one of the absolutely critical FAA systems like Air Traffic Control, but it still managed to bring air traffic to a near standstill. In an example of just how interconnected (and interdependent) modern networks are, NOTAM’s failure prevented almost all planes in the USA from taking off because it provided important safety information that pilots are trained to review as part of the departure process. It's not a critical part of the FAA’s many systems, but it has a significant impact on flight safety and needs to be available. Worse, it’s nationwide rather than regional like Air Traffic Control, meaning that a failure isn’t limited to one region but impacts an entire country at once.
The outage was dealt with fairly quickly (about 90 minutes) but still resulted in chaos across the nation’s airports. Reports indicate that about 7000 flights were delayed and 1100 canceled, severely impacting the 1.7 million people that travel by air on any given day. The cost of those canceled and delayed flights can probably be determined, but the economic cost due to the knock-on effects of disrupted travel are literally incalculable. It’s a safe bet that this 90-minute outage cost hundreds of millions of dollars – which works out to well over a million dollars a minute.
The average cost of an outage
That may seem high, but it’s not really as unusual as you might think. An oft-quoted (and out of date) Gartner analysis back in 2014 put the average cost of an outage at $5600/minute (or $6700 now accounting for inflation). That same report also points out that large enterprises are more likely to see costs closer to $9000/minute or about $11000 in today’s dollars. And again, none of this takes into account the long-tail effect of lost productivity, lost reputation, etc. Now bear in mind that per Dun & Bradstreet, 59% of Fortune 500 companies experience a minimum of 1.6 hours of downtime every week. That’s an average cost of $643,200 to $1,056,000 per week.
So, what does this all mean? Very simply,
When you’re losing tens of thousands of dollars each minute, fixing outages is important - but fixing outages fast is essential. And being able to proactively prevent them is not just the Holy Grail to IT, but also to the organization’s bottom line. The Internet is fragile, complex and interconnected. Our systems, networks, applications, Internet need to be resilient and able to quickly rebound in the fact of an outage, which let’s face it, are a fact of life. However, the way IT teams deal with them can result in savings of literally millions of dollars.
The MTTR calculation
The process of fixing a network issue once it’s found is going to vary depending upon the network and the problem. And, frankly, there’s not a lot of time to be saved there – it’s going to take as long as it takes. There may be some best practices for resolving issues, but if you have to restart an application or reconfigure a routing table it’s generally not going to be a process than can be accelerated. But remember that one of the biggest parts of the MTTR calculation is making the IT team aware that there’s a problem so they can diagnose it in the first place. These Alerting and Diagnosis steps can and should be accelerated as much as possible.
Four ways to make alerting fast and effective
Alerting is all about conveying the fact that something’s wrong to people who can do something about it. It’s sometimes referred to as Mean Time To Identification or MTTI and it’s all about understanding what the problem is and which team is best suited to address it. That’s why some groups call the acronym Mean Time To Innocence instead – not because they’re trying to avoid blame, but because if it’s an issue they literally AREN’T responsible for or able to resolve, the best thing they can do is to quickly get the alert to the team that IS. This is often where the majority of those precious minutes are wasted. There are a number of things you can do to ensure that alerting is fast and effective:
1. Set thresholds that see problems coming before they impact service. If traffic is slowing down, don’t wait until it grinds to a halt before alerting IT. The earlier you can show them what’s starting to go wrong, the more likely they’re going to be to fix it before it causes an outage. And that will save ALL the money!
2. Send alerts to the right people i.e. those who can fix things. The nature of the outage is going to determine who gets alerted, but it’s absolutely essential to make sure the right people get notified. If there’s one team that does nothing but the website, alert them and not everyone in IT for website issues! Alerting the wrong people causes delays while they try to find the right people to pass the alert to – assuming they even have the expertise to understand how to deal with the alert in the first place. Many an outage has escalated unnecessarily because someone ignored an alert that “didn’t seem important at first.” But don’t make the mistake of only sending the alert to one individual (you never know who might be in the bathroom or taking a smoke break). There should be a team of people available to respond to urgent alerts immediately.
3. Send those alerts via the medium(s) the fixers are using RIGHT NOW. That means things they are likely looking at when working– Slack, Teams, a ticketing system, whatever. Or if they’re not currently working, send a text an automated phone call, a page (yes, pagers are still being used) or some other form of communication that will be accepted as urgent. Email does NOT cut it in an emergency! This usually requires an Internet performance monitoring or observability solution that can integrate with a wide range of applications for notification, collaboration, configuration and analytics. Use the tools your people are currently using and familiar with whenever possible.
4. Send alerts with ACTIONABLE data. Make it clear what the impending issue is as well as its potential severity. The recipient needs to be able to understand exactly what’s going wrong and what they’re expected to do about it. You do NOT want your people puzzling over an unclear alert at $10K/minute.
Five ways to accelerate diagnosis
Of course, alerting is just the first part of the resolution process. It is even more important to accelerate diagnosis which determines where the problem is and what actions need to be taken to fixit. Obviously, this can vary wildly depending upon the network, but this is also the step that can be most effectively shortened:
1. Use an Internet Performance Monitoring solution that covers the entire Internet Stack. Too many solutions cover only the network (NPM) or applications (APM), severely limiting your visibility. Often, however, (and particularly in relation to getting ahead of outages), the problem can’t be determined without understanding the entire path your users are taking through the Internet. After all, the Internet is almost as critical a tool today as electricity. It’s not enough to monitor your app stacks with APM. You need IPM to monitor how your users get to those apps via the Internet. Moreover, switching between multiple solutions has been proven to severely impact MTTR and slow down the process of understanding what’s going wrong.
2. Use a solution that can show all the required information in one interface. And preferably in one dashboard. Humans lose focus easily, no matter how well trained. Switching between interfaces can slowdown diagnosis almost as badly as switching between applications. If you’re trying to troubleshoot a traffic lag that’s threatening to turn into an outage, you need to be able to see everything impacting that traffic in one intelligent(and preferably customizable) dashboard. This can make a huge difference when it comes to perceiving the source of the problem.
3. USE REAL-TIME DATA! Not all solutions provide real-time data which can be incredibly expensive at $10k/minute. Even a 2-minute delay can drag out the troubleshooting process and obscure necessary details that you need to see. While a lot of times data sampling rates are customizable, the most crucial details may be limited by the monitoring application. For example, some well-known observability solutions provide data from BGP sources only every 15 minutes. That 15-minute-old BGP data is going to cost you when every minute counts! Note that means your data is not only 15 minutes old, but you have to wait another 15 minutes to see if changes had an impact. Real-time BGP peer data (or even just 5-minute delayed data) can save you tens of thousands of dollars.
4. Use granular data from as many sources as possible. As mentioned above, fast diagnosis requires visibility into as much of your network as possible and these days your network is the entire Internet. Just because a system is accessible from New York doesn’t mean it’s not down for Dubai – and you’re not going to be able to figure that out unless you have visibility into the user experience in that region. Furthermore, the data gathered needs to be as granular as possible. The devil is usually in the details and data points five minutes apart are usually not detailed enough for accurate troubleshooting. Finally, that data shouldn’t exist in a vacuum. At any time, you should be able to compare your current data with data from the past week or month or year or several years. You need context to pick out anomalies or trends and your past data needs to be instantly available for analysis. Having to hunt for archived data or wait for refreshed data from your vantage points is only going to slow you down and cost you more money.
5. War rooms should be a last resort. Gathering a group of together is not only inefficient, but also expensive in a number of ways. It takes time to notify everyone, time to get them all together into a meeting, time to explain the problem and every minute of that time is time the outage continues to cost you thousands of dollars. There’s also the cost of the time taken out of these people’s day – IT salaries usually aren’t cheap and war rooms can be very expensive meetings in terms of lost productivity. If there’s any way at all to start diagnosing the problem WITHOUT getting a bunch of people together, attempt it.
Reduce time to minimize costs
If you follow the above best practices, you’ll be optimizing your response to those inevitable outages: the right people will have the right information they need to find the problem and diagnose it – FAST. MTTR will be reduced along with the duration of the outage. In fact, you may be able to eliminate the outage entirely. And, most importantly, the reduction in time spent means the overall cost is minimized. Make no mistake – the goal here isn’t to be so fast you make errors! The goal is to be cognizant of the value of each minute spent trying to resolve the outage and to make your response as efficient as possible.
Note that not only will the above approach help improve your MTTR, it also puts you in an excellent position to prove the value of your IT team. By tracking your MTTR in dollars instead of minutes, you can easily show your executive team how much you’ve saved them. A report showing that IT fixed five outages this month is of academic interest to most C-levels. But showing how your fast response saved the company millions of dollars each month is guaranteed to make them pay attention!
Learn more about how we can help
Catchpoint can help you mitigate the cost of outages with all the capabilities described above. Not only do we have the world’s largest global observability network, we also have the platform to make troubleshooting fast and easy – all while integrating seamlessly into the applications you already use. Give Catchpoint a try for free or get in touch with us for details.