At Catchpoint, our mission is to provide customers with actionable data that will help them reduce MTTR and maintain a positive digital experience. We measure "from where the users are" to ensure the data reflects real end-user experience.
As someone that's part of the Catchpoint on-call chain, this is extremely important to me. I do not want to be woken up at 2 AM because a server is misbehaving, only to find out that the application failed over gracefully and no users were impacted. By measuring "from where the users are," I know that a Catchpoint alert means that our teams need to look at it - even at 2 AM.
Being woken up at 2AM is a very annoying problem, but it's not the end of the world. What about the opposite scenario - not being alerted because the tracing or APM system didn't detect an issue? That's much worse. This article is a cautionary tale of one such scenario.
When Beta Testing Does Not Go To Plan
While we put the finishing touches on new features here at Catchpoint, we often deploy a beta version and give select customers a chance to try them out. In this case, we were working with a couple of customers to perfect a specific feature while our Operations team got our infrastructure ready for prime time.
We hadn't yet built out the full monitoring suite for this functionality. We had the synthetic monitoring set up but didn't have the alerts enabled. What we did have was a very detailed APM. We could tell exactly how long each portion of the feature took to execute, statistics on how often it was executing, error logs, everything!
I went to sleep knowing that everything was great. If there were any issues, we'd get alerts. Well, when I woke up there weren't any alerts, but there were emails from our support team saying that the functionality wasn't working properly - all of a sudden it had gotten slow. Hmmm...That was not part of the plan...
Root Cause Analysis Using Synthetic Monitoring
We were aware that there was a problem since customers were experiencing it in real-time, but the APM data was not helpful. The team looked through all of our traces and statistics and found no issues. Everything was performing just as it had been several hours earlier.
Then we looked at the synthetic tests. The graphs below illustrate performance around the time when the issue occurred.
These tests were running from 196 Catchpoint backbone nodes, mostly in the U.S. At first glance, the data did not indicate anything noteworthy, the scatterplot doesn't look particularly different, and the line graph looks consistent. Certainly not worth a support ticket.
The support team, on the other hand, was handling queries from customers who were experiencing very slow performance. So we dug in further.
Granular Data Breakdown For Better Root Cause Analysis
This application's infrastructure is geographically distributed, handled by servers in different datacenters with different IP addresses. When we broke the data down by IP address, we saw that most looked like the graph below, on the left. But some server performance looked like the graph on the right.
The graph above indicates that there was a spike in the webpage response time!
Flipping the breakdown to see which end users might be impacted, we saw that most users were not impacted – Miami and Montreal (on the right), in this example, had no spikes. But cities on the U.S. West Coast (Las Vegas and Los Angeles, on the left) did experience slow performance!
Now we knew the where -we still had to figure out the what.
We switched from the statistical view to a scatterplot view and enabled all the different request components in the view. The problem became immediately obvious.
The send and DNS time weren't impacted at all. However, all the other metrics including connect, wait, load, and response times were impacted. The average wait time went from about 65ms before the issue started to about 122ms after!
The tests were still succeeding but every packet took longer. There was the root cause:
When users from the west coast connected to specific systems, the bandwidth they saw was suddenly lower.
We contacted our provider and quickly resolved the issue.
Monitoring What Matters From Where It Matters
We cannot overstate the value a comprehensive monitoring strategy brings to organizations. Understanding the real end-user experience is central to maintaining performance. This requires implementing a monitoring strategy that does not depend on APM tools alone.
As seen from the incident we handled, ultimately it was the range of data from different sources and locations that made the difference. For me and my team, there were a couple of major lessons learned (well, confirmed) here:
- We must not rely solely on APM. It has its place to help trace issues with the software, but it doesn't take into account everything else in the content delivery pipeline.
- Where you monitor from, matters. Just like monitoring from inside of the application isn't sufficient, monitoring from only a handful of locations isn't either. Imagine if our synthetic monitoring was only from Miami in the example above. We wouldn't have seen the issue. The synthetic nodes need to be located where users are.
- Even if you’re running the right tests from the right locations, unless you can slice and dice the data in different ways until the problem becomes evident, the data isn’t very useful.
You must implement advanced monitoring solutions that go beyond traditional APM tools to achieve true observability. These lessons on monitoring are essential to any incident management strategy and can have a big impact on the time taken to detect and resolve an issue. A holistic approach to monitoring that also provides the end user's perspective will greatly help ITOps, DevOps, and SRE teams in dealing with highly distributed and complex systems. To learn more about monitoring and observability, click here.