An advanced end-user experience monitoring strategy involves many applications and layers; the complexity of IT architecture systems, combined with increasingly demanding customer expectations, mean that organizations must develop a proactive, customer-centric mindset to deliver amazing digital experiences. The way to do this is with an end-user experience monitoring solution that’s designed to get a complete outside-in view of the customer experience, as well as detect and fix issues as quickly as possible.
A crucial part of that end-user experience monitoring solution is a thorough and efficient alerting strategy. Anyone who has been involved in a DevOps, IT Operations, or SRE role (essentially anyone who’s had to receive and respond to system alerts) knows that this is one of the most stressful jobs within an IT organization, as well as one of the most susceptible to operational inefficiencies.
IT organizations have traditionally tied their alerts to the most obvious metrics and pages within their service delivery. Obvious metrics like availability (is the site up or down?) and page load speed on their key pages (did the page take more than XX seconds to load?) should always have alerts tied to them so that the operations pros can respond and investigate any problems as soon as they’re detected.
Tackling the Shortcomings of Basic Alerting
Even this most basic alerting strategy is often difficult to implement and act upon in an efficient manner. One of the biggest challenges faced by teams charged with end-user experience monitoring is to make sure that the alerts they receive are accurate and actionable. The last thing that anyone wants is to literally be woken up in the middle of the night by an alert that turns out to be a false positive, triggered either by faulty data, or because the testing agent itself experiences an issue.
Furthermore, even if an alert is sounded accurately, it can often take an operations team a long time (sometimes several hours) to investigate both its veracity and the root cause before being able to actually fix the problem.
There are a few ways to overcome these obstacles, but the most efficient method is by deploying an end-user experience monitoring solution that’s purpose-built with IT professionals in mind, and designed to give them accurate, actionable data that allows them to focus their time and resources on the most important issues.
In the case of Catchpoint’s integrated platform, this is done through a two-pronged approach to monitoring and alerting:
- Stateless Nodes
Catchpoint’s monitoring agents, which are strategically deployed in geographies and on ISPs that matter most to customers, are stateless. This means that if there’s an issue with the testing environment, the test will not run or deliver data.
- Automated Debugging
The Catchpoint platform automatically performs debugs such as DIG, DIG+Trace, etc. in order to take the most exhaustive and mundane tasks out of the hands of the operations team and allow them to investigate the root cause and ultimately fix the issue in a timely manner.
Tying Alerts to Long-Term Trends
You’ve likely heard of the fable about boiling a frog. Put a frog in a pot of water that’s already boiling, and it will jump right out. But if you put it in a pot of room temperature water and slowly heat it, the frog will boil to death without ever realizing that something is wrong.
The story, while both barbaric and inaccurate (a healthy frog will in fact try to get out of the water long before it begins to boil), still serves as a valuable metaphor for effective end-user experience monitoring. Trend shifts are a common basis for performance monitoring and alerting, as they allow teams to be notified if performance degrades in comparison to historical data, rather than specifying a value for a certain metric. However, if you’re not careful about what you’re monitoring and what metrics you’ve tied your alerts to, you can miss gradual performance degradations over time, never realizing that there’s a problem until it’s too late and your end user experience has already suffered severely.
Therefore, in addition to trend shift-based alerts, it’s important to also have some that are tied to hard-and-fast KPIs that are powered by historical data. This is critical to avoiding the “performance creep” described above, as it provides both the long-term perspective and the threshold(s) that are necessary to detect those issues even without a dramatic spike in load times or availability.
The chart below, as provided by priceline.com, shows one of those long-term performance creeps and how it can get away from you. As you can see, this metric nearly doubled over a six-month period, and had they not had an alert tied to it, would have gone unnoticed until customers started being affected.
Integrate with Other Tools to Maximize Efficiency
While Catchpoint’s trustworthy data and advanced alerting system is a key advantage for operations teams looking to maximize their time and resources, a complete end-user experience monitoring strategy will also likely include other tools that must all work cohesively together. For example, a separate alerting tool can house all of your external tools and disseminate all of the alerts across teams and channels within your organization (e.g. email, SMS, messaging tools, etc.).
Hence the need for your end-user experience monitoring tool to be able to integrate with alerting platforms such as OpsGenie, PagerDuty, and VictorOps, as well as communication tools such as Slack and Microsoft Teams so that internal employees can communicate and share performance data with each other quickly and easily (Catchpoint’s public URL feature is also designed to make this easier, as any test results can be shared with a simple link that can be viewed internally or externally).
Priceline takes advantage of several of these integrations, pushing data from Catchpoint into Splunk via an add-on that’s easy to set up. They also use the Webhook API to push alerts into Slack, allowing everyone on the channel to see the full collection and history of alerts in one place.
Alerting is Just the First Step
Ultimately, your end-user experience monitoring solution needs to be optimized to not only detect and fix issues as quickly as possible, but also to communicate them to anyone inside or outside of your organization that needs to be informed. The first step in that process is to have the proper individuals alerted right away with accurate and trustworthy data, at which point they can collaborate with anyone who needs to be informed.