One of the biggest nightmares for any service provider is to find themselves in SLA (service level agreement) hell due to poor performance.
An issue that negatively impacts end users’ experience is inevitably going to have an effect on a company’s business metrics, and when that happens, they’re going to be looking for someone to blame, and more importantly, compensate them for that lost revenue.
The reasoning behind having comprehensive SLAs in place is not a difficult concept to grasp. Protecting one’s brand image and revenue stream(s) is obviously of paramount importance. Yet as the landscape of digital architecture grows more and more complex, companies are forced to outsource more functionality to third-party vendors, which in turn creates additional places where performance can go bad.
An SLA is designed to mitigate the risk of that outsourcing by holding vendors financially accountable for any performance degradations that affects the end users through objective SLA monitoring, grading, and governance. According to the 2017 State of SaaS report conducted by Tech Target, over 25 percent of respondents acknowledged that they had incurred financial penalties for failing to meet their SLAs, with the average amount in penalties rising above $350K.
With that much money on the line, the simple truth is that vendors cannot afford to be the cause of their customers’ poor performance.
To make matters worse, more than 10 percent of the respondents admitted that service disruptions led to the loss of a customer, illustrating how much poor performance can erode the trust that’s necessary for a customer-vendor relationship.
No business can afford to allow their brand to be harmed by poor customer experiences, so having strict SLAs in place along with diligent SLA monitoring practices becomes an absolute necessity.
The latter part of that strategy – diligent SLA monitoring practices – is dependent upon having a powerful synthetic monitoring solution in place that can replicate the end user experience while measuring from both backbone and last mile locations. The backbone tests, which eliminate noise that is out of the vendor’s control (e.g. local ISP or user hardware issues), are the most valuable for SLA monitoring and validation, while last mile and real user measurements provide additional context by showing the actual end-user experience.
A two-pronged approach to monitoring
Meanwhile, SaaS vendors themselves must also have end user experience monitoring strategies in place, with a two-pronged approach: one is to ensure the health of their digital supply chain, and the other is to validate their SLA requirements by proving that they are not the cause of any disruptions in their clients’ customer experiences. These two complementary goals ultimately serve the underlying purpose of SLA monitoring – that is to minimize the amount of money penalties that a vendor must pay their customers in penalties.
This is the approach taken by Zscaler, the world’s largest cloud security platform, which helps some of the biggest companies and government agencies around the world securely transform their networks and applications.
Given their service offering, Zscaler’s security applications obviously must be placed in the path between the end users and whatever application they’re using (i.e. video conferencing software, banking software, etc.). This means that should Zscaler’s own digital supply chain suffer a service disruption, it will likely cause a negative digital experience for the end user as well.
The need for synthetic SLA monitoring
The prevalence of both first- and third-party services within everyone’s digital supply chain emphasizes the need for complete outside-in view of the end user experience. Viewing solely from within one’s own network is incomplete, and only relying on real user monitoring will still leave gaps in visibility when trying to determine the root cause of the issue (i.e. who ultimately bears responsibility for the disruption).
By being able to synthetically test every step of the digital supply chain, a SaaS vendor such as Zscaler is able to spot potential performance degradations before they have an impact on the end user experience, and then drill down into the analytics to pinpoint the root cause of the issue and troubleshoot a solution. This aspect of SLA monitoring is crucial, as it allows Zscaler to head off any problems before they trigger an SLA breach. After all, the best way to avoid paying penalties on your performance is to always have great performance.
There are a number of different ways that Zscaler obtains the real-time, actionable insights that allow them to detect and fix issues as quickly as possible. One crucial aspect is testing from as close as possible to the physical location of the end user(s).
Many performance degradations are localized in specific geographies due to problems with single servers or datacenters, or peering issues with local networks and ISPs. When that’s the case, a performance test run from a different country or on a different ISP isn’t going to give you data that you can act on. So, a testing infrastructure that provides a wide array of locations, ISPs, and cloud networks is vital to ensuring the end user experience.
Another important aspect for diagnosing and fixing performance issues is to have access to a wide range of test types and metrics. Once a performance alert goes off, an IT Ops/SRE must then drill deeper into the data to pinpoint the root cause, often by running different test types depending on the nature of the issue; for example, when an API fails, an API-specific test is in order; to pinpoint a network peering issue, a traceroute test is required.
However, effective SLA monitoring is about more than just ensuring that your own services are performing up to standards – it’s also about proving that you’re not responsible for other people’s failures.
SLA monitoring through validation
Anyone who grew up with at least one sibling knows the value of passing the buck when something breaks. You know your little brother was the one who broke that lamp, but of course he doesn’t want to be punished, so he’s going to go out of his way to push the blame onto you. And unless you can prove it, it’s your word against his.
The same principle applies to business and digital performance, albeit with consequences much more severe than an early bedtime. When a company suffers a performance issue that results in loss of revenue and/or brand prestige, they’re naturally going to look for the culprit that’s responsible and tie it to an SLA breach in order to recoup some of that money. They’re going to be armed with data in these attempts, so vendors must be equally armed as well through their own SLA monitoring efforts. The name of the game, as it was when you were a kid, is to prove that it wasn’t your fault.
Once again, the answer lies with deployment of a thorough synthetic monitoring solution that can clearly and definitively articulate the root cause(s) of any performance problems during the post-mortem analysis.
When a vendor such as Zscaler is tasked with proving that they were not the source of a performance problem, one of the most important aspects is to be able to do so through data and charts that are easy to share and understand. Remember that these analyses and the business decisions that result are often being performed by people who don’t have the technical proficiency of a network engineer or SRE, so clear and obvious visual evidence is crucial.
Another helpful tactic for SLA monitoring is the ability to isolate first- and third-party content, and to be able to identify exactly who is responsible for the performance of all those third-parties. For example, if social sharing tag causes excessive delays in the loading of a website page, your synthetic monitoring solution should be able to pinpoint exactly what the tag is, who hosts it, and how much of a delay it caused.
Finally, the ability to filter out extraneous noise through synthetic tests is vital to ensure accurate SLA monitoring. The simple fact is that some performance degradations are out of our hands; they can be caused by a weak home WiFi network, a damaged ISP cable, or something as simple as inclement weather that disrupts a mobile network. Here again, we see the importance of a synthetic “clean-room environment” that just looks at the customer-critical elements in the digital supply chain.
Don’t get blamed for someone else’s mistake
The ultimate goal behind any vendor’s SLA monitoring strategy is to ensure that that you minimize the amount of penalties that you have to pay to your clients.
With a strong synthetic monitoring platform in place, you should be able to catch issues as soon as they arise and fix them quickly, and demonstrate the root cause of issues that lie beyond your control and for which you are therefore not responsible. This two-pronged approach to SLA monitoring will save your company money in both the short- and long-term, and protect your brand’s prestige at the same time.