Alerting is an integral part of performance monitoring. A monitoring tool can maintain performance effectively only if it can send out alerts and notifications the moment an issue is detected. Alerts are your first line of defense against performance degradation. It notifies you of a potential issue, allowing you to investigate and fix it quickly.
Since alerts are such an essential feature, we tend to go overboard when setting up monitoring tests and configuring alerts. And these alerts create a lot of noise making monitoring complex and difficult to troubleshoot.
In this blog, we discuss some of the best practices to follow when configuring alerts.
Start with the tests
Alerting strategies mainly aim at reducing the MTTD (mean time to detect) which has a direct impact on MTTR (mean time to resolve). Quickly detecting a problem reduces the impact on the end-user experience. To build an effective alerting strategy, there are a few pre-requisites, the first being a sound test methodology and second, a response strategy.
Every monitoring test should yield actionable results. The first step in the process is to decide on a testing methodology keeping in mind the primary goal of the monitoring process. The tests should evaluate metrics that are relevant to the overall goal of your monitoring strategy. The tests should run as frequently as possible to catch an issue as soon as it happens. Hence, the ideal test strategy should focus on selecting the correct test type, choosing the right locations to monitor from, and deciding the frequency of the tests.
The second pre-requisite is having a well-defined response strategy in place. When an issue is detected, an alert should be triggered only if the data signals a real incident and not an anomaly or outlier. How many tests need to fail before sending out a notification? Having a response workflow that assigns specific alert types to specific response teams is important. If you do not have the right people to troubleshoot and fix a particular issue, for example, DNS or CDN related performance issues, then it doesn’t make sense to send out alerts for these issues.
Configure and trigger alerts based on these two criteria. Setting up advanced or complicated alerts when you don’t have the resources to handle such alerts, will result in a weak alerting strategy weak and impact the MTTD.
Alerting Strategy Basics
Alerting strategies should focus mainly on delivering the right information to the right people at the right time. Before we start configuring alerts for every test in the monitoring tool, there are a few points to consider.
All alerts are not a high priority by default. An assignment and escalation process needs to be established to prevent alert fatigue.
Alerts can be set up for several different parameters that track performance degradation and sudden outages. But not every alert requires immediate attention. Categorizing alerts based on urgency will help sort the incoming notifications and lets you focus on those that need immediate remediation.
Alert priority allows you to cut down time spent fixing issues that don’t have a major impact on end-user experience or the business itself. You can manage alerts more efficiently when each alert notification highlights priority.
In the example below, Catchpoint will trigger an alert when a test fails from four different locations. The critical alert indicates an issue that needs immediate attention as there were multiple failures and is impacting end-user experience.
Minimizing false positives with valid alerts
Alerts can create a lot of noise especially if there are too many false positives. Eventually, the teams handling the alerts become reticent making monitoring less effective. Proactive monitoring is only as good as the alerts it triggers. Alerts must evaluate the right metrics within reasonable thresholds.
It is important to structure or configure the alert, so it is triggered only after excluding anomalies. The alert parameters should confirm a real incident. For example, in the set up below, the alert triggers after there are four consecutive failed data points within a set time threshold.
Include relevant data in the triggered alert notification. Data should include a summary of the failed data point and other valid details, so it is easy for the recipient to interpret and quickly understand the issue at hand. The notification should indicate:
- The severity of the issue.
- The current impact of the issue and the locations affected.
- Details of the failed data points.
- Details of additional tests that confirm the issue such as a traceroute.
Notify the relevant teams
The third most important aspect of alert configuration deals with the recipients of the alerts. You may have multiple teams maintaining the application but not all the teams need to be looped into an incident notification. There should be a hierarchy that decides which teams receive notifications.
Assigning the right teams will cut down the time taken to resolve an issue and eliminate any confusion and noise that critical alerts create.
There should also be a clear escalation policy in place so that those working on the issue know whom to reach out to when the troubleshooting is complicated or if the resolution requires implementing major changes within the application.
Once you have the list of relevant recipients, the next step is to figure out the best way to deliver the alert notification. Delivering the alerts as soon as possible to those who matter makes a lot of difference to the time taken to resolve the issue. These may be sent out via emails, Slack messages, or both. Catchpoint offers advanced alerting features such as alert API to push alerts to a defined endpoint. Catchpoint also integrates with third-party alerting tools such as OpsGenie and PagerDuty.
It is also a useful practice to analyze alert trends and patterns to understand common issues and MTTD values. This will also help figure out if the assigned teams are resolving issues effectively. Such an analysis will help identify what is working and what isn’t in the existing alerting strategy.
Alerting strategies should focus mainly on delivering the right information to the right people at the right time. An effective alerting system should:
- Set up the right type of alerts and reduce false positives
- Notify the relevant teams
- Notify via the right channels
- Classify alerts based on priority
- Have an escalation policy in place to handle high priority alerts
- Constantly review alerts and logged incidents to better evaluate the current alerting strategy