On July 8 of 2022, the Canadian telecom company Rogers Communications suffered a major outage that impacted most of Canada for almost two days. This wasn’t completely unprecedented (they’d had an outage in 2021 that impacted their wireless servers for several hours) but the breadth and severity of this one is going to end up costing them far, far more than it seems at first glance.
Two weeks following the outage, Rogers released a document to the CRTC (the Canadian Radio-television and Telecommunications Commission, Canada’s telecommunication regulator) detailing the error and how it happened. In brief: someone screwed up a code update and took down their core network at about 4 AM on the morning of Friday, July 8. Their wired and wireless networks are combined on that core, so internet and mobile service to most of Canada went down and stayed down for about a day and a half. Rogers wasn’t able to start restoring service for at least 15 hours and some services and customers weren’t back online for up to four days after the incident.
This disrupted not only service to private customers, but also businesses and parts of the 911 emergency system. Multiple other smaller carriers also relied upon the Rogers backbone for all or part of their traffic, so they were impacted as well. The results emphasized not only how reliant upon the internet modern life has become, but how dangerous it can be to rely upon only one provider – especially one that’s prone to outages. Most Canadians are still unhappy about the 2021 outage. Reportedly during this outage, staff at Rogers’ retail outlets were quietly requested to remove the window signs proudly proclaiming them “Canada’s Most Reliable 5G Network!”
A Plethora of Impacts
Most Canadians use the Interac debit card system for retail payments and those stopped when Interac went offline. This left most of the population suddenly scrambling for cash – an issue when most ATMs also relied upon that same Interac network.
Rogers-reliant services like Fido, Chatr and TekSavvy went fully or partially down, disrupting mobile and fixed internet for millions of subscribers.
Emergency services including both 911 and other lines were intermittently unavailable to many Canadians, especially via mobile phones. The problem was exacerbated by uncertainty over who was affected since most police and fire departments claimed their 911 call centers were fully operational, but no one could tell if they were actually reachable until they tried to call. Many also reported issues with their non-emergency lines as well.
Government agencies including Revenue Canada (Canada’s IRS) and (ironically) the CRTC reported their phone lines were down and multi-factor authentication wasn’t working. The Canada Border Agency reported that due to the outage, the ArriveCAN app that allows border crossers to show proof of COVID-19 vaccination wasn’t working for some travelers, meaning they’d be denied entry if they couldn’t provide paper copies. Courts were even disrupted as they were unable to let anyone testify via Zoom.
Major events, like a concert by Toronto native The Weeknd, were canceled. Others like the Calgary Stampede were forced to go cash-only as point-of-sale systems were impacted. Major venues across the country advised patrons to print out their tickets and arrive early as delays were expected.
An Incalculable Cost
The big question is what this all means in terms of costs. The easiest ones to calculate are the costs directly to Rogers itself. After promising a rebate of 5 days of service charges (from an earlier promise of 2 days), Rogers is on the hook for that lost revenue from almost 11 million subscribers or around $170 million (all figures in Canadian dollars). However, the fact that this works out to approximately $3.80 per subscriber is outraging their customers and Rogers is facing a class-action suit demanding $400 per subscriber. That could be over $4 billion right there.
As well, Rogers has promised to build out their infrastructure to separate “its wireless and wireline networks through adjustments to its core network.” This could take up to two years at a cost of up to $500 million, according to financial experts. There’s also the cost of lost subscribers, but that’s probably the least of their concerns, thanks to Canada’s oligopoly that’s ensured 90% of telecom traffic goes through just three providers: Rogers, Bell and Telus. Most Canadians have access only to one or two of those providers meaning competition isn’t really a factor; if you don’t like Rogers, where are you going to go? Most Canadians don’t have a backup plan for themselves or their businesses because there simply isn’t an alternative, meaning Rogers doesn’t care because Rogers doesn’t have to. But that may be about to change, as outlined below.
The impact on Canada’s economy is a lot harder to estimate. The Weeknd’s canceled concert meant refunding tickets to everyone at the 50K+-seat Rogers Center in Toronto (yes, Rogers owns it as well) but that’s just one of the events canceled across the country. Loss of sales not only from retailers but also workers in internet-dependent gig-economy jobs could be anywhere from tens to hundreds of thousands of dollars. An unknown number of remote workers simply couldn’t work during the outage which represents an unknowable loss in productivity and wasted salaries. There’s literally no way to know what the outage cost the country overall.
But Rogers may end up paying for that economic hit anyway. Their planned $26 billion takeover of Shaw Communications was already facing resistance from consumers and regulators and it’s now both highly public and highly endangered. The public is NOT happy, and politicians are threatening to derail the merger which will waste all the time and money put into it so far. Worse, this outage has thrown the telecom oligopoly into sharp relief and many Canadians are furious about Rogers’ near monopoly. In the weeks since the outage, multiple editorials in leading newspapers have appeared calling for the nationalization of Canada’s telecom industry or the creation of a new crown corporation to provide reliable internet across the country. Any politician calling for this is guaranteed support from angry Canadians and Rogers should be terrified: if this actually happens, their future will either be competition with the federal government or being legislated out of existence. At this point, Rogers is in deep trouble no matter what.
Are You the Next Rogers?
Every other telco, internet provider or large organization with a network should be sweating while looking at all of this. Even if you’re not a major telco responsible for connecting nearly everyone in a country of 35 million people, a major outage like this represents a nightmarish cost. What kind of bill will you run up if you go down for two to four days?
Fortunately for everyone (except Rogers), all of this is pretty easily avoidable. While you can never prevent human error, you can easily implement a lot of procedures to minimize it and – much more importantly – deal with the resulting issues quickly. Here are a few tips IT professionals should consider to avoid sharing Rogers’ fate:
Test before going live – yes, it’s common sense and yes, Rogers says they did that. They obviously didn’t do it properly though, did they? Not all testing procedures are equal and not all testing procedures are effective. Ideally you want to have a system or solution that will let you thoroughly test exactly how network changes are going to impact your end users before you put it into production. You then also need to be able to run that same test immediately after going live to ensure that it worked the way you thought.
Monitor from the outside in – just because it all looks fine on your side doesn’t mean your users can actually reach you. Remember all those 911 call centers that couldn’t tell which citizens could and couldn’t get through to them? By testing your network from your user’s perspective (instead of your own) you can better understand what users in what regions and using what equipment are having problems. Which means you can then…
Diagnose & resolve problems fast – it took Rogers about 2 hours to realize there was a problem and almost 15 hours to start resolving it. You need to be alerted immediately with detailed information about what’s wrong and where exactly the problem is occurring – ideally before your end users even notice. If Rogers had resolved their bad update with only an hour’s downtime, their subscribers wouldn’t be howling for blood. If they had resolved it with no downtime, we wouldn’t be discussing this at all and that merger with Shaw Communications would be right on track.
Understand your redundancies and how they work – Rogers thought they had built redundancy into their network:
"We don't understand how the different levels of redundancy that we build across the network coast to coast have not worked," said Kye Prigg, Rogers' senior vice-president of access networks and operations, on CBC's Power & Politics.
You need to understand how redundant your systems are before they are tested under fire. That means having a solution that can tell you where you’re covered and where you’re exposed so you can shore up weak spots (or at least know where to start looking when things go wrong). Note that this isn’t just something Rogers needed to do to prevent this outage, it’s something all of their downstream partners needed to do as well. Your third parties may not be as reliable as you think and you should have a plan in place to mitigate the impact if they have outages.
Rogers ran into a perfect storm of problems that seriously hampered their ability to resolve their network outage and made the fallout particularly severe (and expensive). But they have provided an object lesson in what can go wrong when the stars align. Again, you can’t eliminate human error, but you CAN mitigate it with a solution like Catchpoint. By monitoring the digital experience of your users, testing and measuring the impact of network changes from their perspective and learning immediately when they’re having issues, you can stop outages before they start.
We’ll be happy to explain exactly how Catchpoint can help your organization if you’d like to try it yourself.
You should also check out the Internet Outage Prevention Checklist our engineers put together to help you develop or refine your outage prevention and mitigation strategy. Download the Checklist here.