Blog Post

External Performance Monitoring. How to Draw a Sharp Picture?

Monitoring is not just about detecting failures, it is about watching an application 24×7, collecting and understanding the data to optimize performance.

External monitoring tools draw a very powerful picture, that of your website performance, availability, and reliability – from the end user perspective. That said the picture can fall short at times in answering your main question – what happened and where? Things can get blurry!

The main reason for the shortcoming is that from the outside these tools can only see one server and one request, however internally a site or application is performing many tasks and relying on other systems and services hidden from the end user. The picture becomes more powerful when you can add additional context to it, internal application context!

Let me illustrate such a case. We instructed Catchpoint to measure how long it takes to get from Google the search results for the keyword “Google” utilizing Internet Explorer.

The chart below displays the performance of the test over the last 30 days.

The chart is displaying the Response of the server, just getting the HTML from the server, and Webpage Response, loading any requests referenced in the HTML (images, stylesheets, JavaScript, etc). From the chart you can clearly see spikes in both of the responses, which means that the slowness was in receiving the HTML from the server, and not one of the child requests in the webpage. However, it is unclear what caused the spikes in Response. Was it the Internet/Connectivity? Our Monitoring nodes? Or Google itself?

When using external performance monitoring services, like Catchpoint, you can overlay additional metrics to help understand where the slowness occurred. In this case we added the following:

  • Wait – the time it takes the server to respond to the request
  • Load – the time it take to get the entire response from the server

In the new chart you can now see that there was corresponding increase in both Wait and Load, a possible sign that the Response time was driven by some kind of “internal” bottleneck. However, these two metrics can also be affected by slow connection, the slower the connection between the client and the server the longer it takes to get the data from the server- so we are still not 100% sure what the cause is. It could be Google, or it could be the Internet.

In the case of Google search results page, Google gives us one extra piece of information: how long it takes for their “backend” system to process the search request before sending it to the user. The metric is clearly tied to how their internal performs – it is not impacted by external factors like connectivity in the Internet.

Thanks to our Catchpoint Insight product we can capture and overlay this “internal” context with the external performance data we collect during each test.

The new chart clearly displays that the cause for the spikes was in fact tied to the performance of the Google backend and not the connectivity or monitoring nodes!

One interesting observation is that when the internal Google performance jumped from 100ms to 290ms, the response time jumped from 180ms to 489ms.

Monitoring is not just about detecting failures, it is about watching an application 24×7, collecting all the data, and understanding the data in order to optimize your performance and avoid failures in the future.

The Catchpoint product is like a Phoropter, you can overlay different lenses on the data until you get a clear picture of the problem.

Catchpoint.

Methodology:

Synthetic Monitoring
SLA Management
Workforce Experience
SaaS Application Monitoring
This is some text inside of a div block.

You might also like

Blog post

The cost of inaction: A CIO’s primer on why investing in Internet Performance Monitoring can’t wait

Blog post

Mastering IPM: Key Takeaways from our Best Practices Series

Blog post

Mastering IPM: Protecting Revenue through SLA Monitoring