Site Reliability Engineering and Monitoring Tools
In this article we discuss tools that are used by SREs (Site Reliability Engineers) based on the survey.
When we crafted the questions and analyzed the data from our recent Site Reliability Engineer (SRE) survey, we tried to present a balanced view of the organizational environments and challenges that SREs face.
It was no surprise in the survey that some of the top job responsibilities of an SRE pertain to monitoring, but the answers produced some glaring holes that required further investigation. Specifically, when asked what their job functions are, 69% of respondents say they are responsible for proactively monitoring and reviewing application performance; 71% of respondents list logging and diagnostics as a job function. With these percentages, I expected to see monitoring tools at the top of the list when asked what tools SREs could not live without. Spoiler alert: they weren’t. So where are they generating their data?
From the list of tools SREs can’t live without, we asked about application performance management, synthetic monitoring, real user monitoring, and alerting and notification. With the exception of alerting and notification, I was surprised at how unimportant these were to have in the toolset. 90% consider alerting and notification tools a must-have, but the percentages plummet after that:
- 33% say synthetic monitoring is a must-have
- 26% consider application performance management a must-have
- 17% say real user monitoring it their must-have tool.
This made me wonder what tools are being used to generate the alerts and notifications? Did we miss a category of tools or is it a difference in how tools are classified?
Here’s how I define the three terms:
Application Performance Management provides insight into back-end and front-end performance by tracking all the components and sub-components with which a request interacts to deliver an application. This can help organization detect and diagnose disruptions in availability and service to maintain a high quality of service.
Synthetic Monitoring, sometimes referred to as active monitoring, simulates user actions or paths taken through an application on a scheduled basis. Synthetic monitoring provides detailed information on availability and performance of applications. It allows organizations to be proactive and preempt potential performance incidents before they impact customers.
Real User Monitoring passively collects data on how actual users are interacting with and experiencing your application. This is achieved through instrumenting the application or injecting code on the page to collect engagement metrics.
From my perspective, alerts and notifications are generated by monitoring tools. So if they aren’t coming from application performance monitoring, synthetic monitoring, or real user monitoring, how are they being generated?
The Mystery Deepens
When generating the questions, we assumed monitoring data played a large part in informing decisions made by SREs. And we were right about that; logging, monitoring, and observability ranks as one of the top technical skills needed by SREs. When asked what the primary uses of monitoring and observability data were, only four respondents (out of 416) said they do not collect monitoring or observability data.
Monitoring and observability tools are important whether they are open source, a vendor solution, or built in-house. The metrics and data collected are needed to inform SREs when applications are not responding properly.
The more I reflected on this, the more I thought that the discrepancy in responses may stem from the way the question was worded. We asked what tool they couldn’t live without. From the SRE perspective, alerting and notification is a critical job function and they classify the monitoring tools they use as alerting and notification solutions.
The service level indicator SREs care about the most is end-user availability. Given this, it makes sense that the primary use of monitoring and observability data is to alert on issues; therefore the alerting and notification tools are the most important. When the metrics most often used to define success of the SRE are number of incidents and mean time to resolve, you need to be notified quickly when something goes wrong.
Given how important availability is, it was disappointing that we didn’t include it as a metric used to define success at the individual, team, and organizational level. 12 people wrote in that availability or uptime was a key performance indicator for their team; I will venture a guess that if we had included availability, that number would be much higher.
It’s not simply about being notified quickly; you also need relevant diagnostic information readily available. The majority of time spent when resolving issues isn’t necessarily on the detection and notification side of things, but in the identification of what needs to be fixed.
Monitoring tools not only need to alert you to when a problem is occurring, but also make it easy to extract insights and information to quickly remedy the problem. Data needs to quickly be analyzed to determine if an issue is due to the network, system, third-party content, or some combination of these.
Being notified that a problem exists and the availability or reliability for customers may be impacted is definitely a top priority, but what happens when a problem occurs and you don’t receive an alert? You monitor systems for known problems or to ensure that metrics are within appropriate thresholds. Graphs and dashboards are created to show when data points trend in an unexpected way; this can help to identify problem areas before failures occur and availability is impacted.
If the focus is solely on alerting and notification, what happens when a problem occurs and an alert was not triggered because the situation has not previously been encountered (also known as an unknown-unknown)? Having data and diagnostic tools available to understand performance regardless of whether an alert was generated or not should be an important part of the SRE toolbox.
If you haven’t yet read the full SRE Report, download your copy today.