Blog Post

5 lessons from the Google Cloud Outage

Updated

Published

June 10, 2019

mins read

in this blog post

The Google Cloud outage was yet another reality check for enterprises and businesses, raising serious concerns over the reliability of the cloud and the vulnerabilities in the cloud’s architecture. The incident had a huge impact on performance and SLAs. It was a textbook example of what could go wrong if your application depends wholly on a single cloud provider.

Catchpoint detected increased page load and response times along with a dip in availability across some popular websites in the e-commerce, media, and gaming industries. Multiple Google services, including GSuite, Google Compute Engine, Google Nest, Snapchat, Discord, and Shopify suffered during the outage. The issue occurred across different ISPs in different locations.

Read the complete breakdown of the incident in our blog post here or watch the outage analysis webinar here.

Building the right cloud monitoring strategy

The complexity of the internet makes it all the more unpredictable, so incidents such as the Google Cloud outage are inevitable. These incidents provide insight into what could be done better. It is an opportunity to re-examine the existing processes and strategies within your organization, so you are well prepared in the face of another sudden outage. Here are some lessons to remember:

1. Do not trust blindly

No matter how popular, resourceful, or process-driven your vendor is, expect failures and lapses. If your digital services need to be 100% reachable and reliable, your architecture and strategy must support the goal. Once you have architected and built for such level of reliability, the health of your application relies on how well you track performance and the processes you have in place to manage major incidents.

2. Avoid putting all eggs in the same basket

If you deploy all your services, support, and collaboration/monitoring tools on a single cloud provider or if your connectivity is through a single ISP, then it’s a recipe for disaster. For example, if the application is hosted on a specific cloud service and your monitoring tools are also running on the same cloud service, you will not be able to receive alerts or troubleshoot any issues.

After a few failures, your teams might start suffering from “Fear of Missing Outages” (FOMO). Build application resiliency by ensuring that critical services, monitoring tools, and communication tools are in different platforms and have no shared single points of failures.

3. Invest in monitoring tools

If you are relying on the vendor’s status page or Twitter for detecting outages or bad user experience, then you are gambling on end-user experience and your brand image. You need to actively monitor from outside the cloud infrastructure. So, without dedicated synthetic or black-box type of monitoring, you cannot baseline performance and nor can you prepare for sudden outages.

The ideal monitoring strategy gives you end to end visibility; you can track the health of your IT infrastructure and network. It should allow you to triangulate issues across the various provides your digital services rely on such as DNS Providers, CDNs, Partner/Vendor APIs, Cloud Providers, etc. The performance and reachability data can provide useful insights that will help optimize the application, which vendors you rely on, and reduce the risk of any negative impact on end-user experience.

4. Test and test again

The single biggest cause of outages is configuration changes that went haywire. More than 90% of the time, outages can be traced to code or configuration changes that were not well tested or were implemented incorrectly. So it is important to follow a stringent process when it comes to deploying code or configuration changes, which includes steps to validate whether the changes have the desired effects, and what to do in case they go bad.

Conduct robust testing of any configuration change in both QA environments and limited production environments to find any errors or performance issues. It is recommended to implement such changes during the weekend or late night, whenever your service has the least business-impacting traffic to the application so that you minimize the effect on end-user experience.

5. Monitor SLAs and hold vendors accountable

When a vendor scrambles to fix an outage, they are under pressure trying to find a resolution before SLAs are breached. You do not want to stress out vendors in the middle of an outage; they know what it means. However, ongoing communication with the vendor and alignment over what happened and how to recover is key to ensuring trust on both sides.

The service provider is expected to compensate for SLA breaches. However, it is your responsibility to bring up SLA breaches with a vendor. But if all the monitoring data comes from the same vantage point, then you have no way to validate its veracity and determine the exact compensation needed. Monitoring from multiple vantage points (ISPs, network types, etc.) gives you unbiased performance data. You will then be able to hold your vendor accountable for every second that impacted end-user experience.

Run your own show

Digital transformation has forced us to forego the traditional application architecture where we had control and visibility over critical components in the delivery chain. The current scenario outsources most of these critical components to cloud providers. This shift to the cloud results in limited control and reduced visibility, which puts end-user experience at a greater risk.

So, in a cloud environment, it is the cloud provider that runs the show. You can only sit back and hope the provider upholds the SLA. But is this the best way to manage your application performance? Certainly not. The risk involved in this approach cannot be ignored – it impacts everything from your revenue, efficiency, and productivity to the brand’s reputation itself.

On average, IT teams spend 46 hours per month handling outages and performance issues. This is due to ineffective and siloed monitoring strategies that result in higher Mean Time to Detect/Innocence (MTTD/I), which in turn delays the Mean Time to Repair (MTTR). During a crisis, the different teams within the organization, like the SRE, Ops and IT teams, end up resorting to finger pointing, and are unable to maintain an acceptable MTTR.

The right service architecture, coupled with the proper monitoring strategy, allows you to run your own show, to take back control, and regain visibility. Implementing and maintaining the right monitoring strategy will insulate your application from performance degradation. This gives the IT teams the reins and boosts the confidence of those tasked with handling any performance crises, whenever they strike.