Blog Post

End-user experience monitoring: Catchpoint and Google Lighthouse #2

Updated

Published

August 28, 2019

mins read

Nilabh Mishra

in this blog post

Heading 2

Click here for Part 1, in which we cover the tenets of Digital Experience Monitoring, and how both Catchpoint and Google Lighthouse work independent of each other.

Catchpoint and Chrome Lighthouse are powerful tools that complement each other, allowing users to take actions which are focused/directed towards making end user experiences better.

Running Lighthouse + Catchpoint in parallel

This setup requires you to run synthetic tests from Catchpoint’s global network of 800+ vantage points in parallel to tests against running using Lighthouse against the same URLs.

Catchpoint from a monitoring standpoint brings in a lot of value as it has access not just to Front End metrics but also to telemetry related to the entire delivery chain including DNS, Content Delivery Network (CDN), ISP level data. Let’s look at some charts to understand this better.

In the screenshot below, we are charting the same “Time to Interactive” metric from Part 1 of this series that we are capturing from the Lighthouse reports as well.

Based on the chart, the metric is trending around the 22-23 second mark, which is significantly high.

Time to Interactive is dependent on multiple factors that can impact how soon the end user can click around on the page. Some of these factors are:

Time spent mapping the Domain name to an IP Address
Time spent establishing TCP Connection with the server
Time taken by the server/CDN to serve the first byte of data
Overall load time for the base page (HTML)
Load time for all other requests on the page including DNS lookup, TCP connect, and load time
Is the critical rendering path optimized for the page?
How many render-blocking resources are on the page?
Are the best practices for page development being followed?

The chart above indicates that “Time to Interactive” is significantly high for this page. What we do not know as of now is whether it’s the:

Network components slowing down the page
Page components
Both the points mentioned above

Now comes the critical step: putting telemetry from multiple tools to good use. The telemetry allows us to reduce the time taken to detect performance degradations, to build a focused optimization effort, and to set up a strategy for continuous monitoring.

When we look at a page taking 20+ seconds to become interactive, we immediately understand that there is a problem. Our next step is determined by the type of telemetry we have access to.

Analyzing the Lighthouse report, we get a lot of details from an optimization perspective.

What we have here is gold! We have multiple recommendations and optimization efforts which we can focus on to improve the page’s performance and end-user experience.

What is missing here is the visibility into details regarding some of the other components in the delivery chain, which also have an impact on “Time to Interactive.”

This is where Catchpoint adds a lot of value.

Now let’s look at how some of the metrics captured by Catchpoint give insight into other important perspectives. These metrics are helpful not only because they speed up the entire process of triaging a performance issue, but also provides additional details regarding some of the other layers involved in a page load process.

A. Backend components (including Network)

In the chart above, we immediately see a correlation between the “Wait” metric (time taken to serve the first byte of data) and “Response” (time to load the base HTML).

We see a “Wait” time of almost 4+ seconds, which is not ideal.

Now, going back to the point that I brought up the previous blog, about how the new landscape of digital architecture affects how we monitor and troubleshoot. If we were analyzing this chart ten years ago, we all would have known what to do. The slow wait time directly indicates:

Latency between the end users and the server
Issues with the server itself, leading to an increase in the time taken to serve content to the user

But alas, it is 2019, and we have a lot more complexity to deal with. The URL we were testing is using a CDN to serve the content. Now, some questions that immediately pop up in our mind are:

Is it my CDN that is slow?
Is there a problem with my caching configuration? Are the requests being served from the origin because of stale cache?
How is my origin performing? Is it taking more than expected time for the CDN to fetch content from origin?

These are some important questions for which we need answers. Let’s look at some charts again to see if we can find our answers in the data.

1. Is it my CDN that is slow?

If you remember earlier, we saw “Wait time” of 4+ seconds when loading the base HTML page. From the chart above, we see that all the HTTP requests for the base HTML page are getting served from the origin and not the CDN (all requests are TCP_MISS). The CDN is not even in the picture, in this case, as it is not serving the requests.

Now the second thing to check is how the origin servers are performing as they serve the base page.

TCP connect and wait time metrics for tests running against the origin servers showing the performance of origin servers:

2. Is there a problem with my caching configuration?

Quite possible! Looking at the HTTP headers for the request, we see that the request is set to not be cacheable by the CDN.

It is quite possible that this was done on purpose, but what is important here is having visibility into this detail.

3. How is my origin performing? Is it taking more than expected time for the CDN to fetch content from the origin?

A couple of important CDN metrics worth highlighting here are edge-to-origin latency and edge-to-mid-mile latency, which bring in a lot more value when debugging issues related to the delivery chain.

B. Front-End Metrics

The second set of important metrics after Network is the Front-End metrics. This set of metrics correlates directly with how end users perceive performance and complements the recommendations/optimizations suggested by Lighthouse.

The Client time metric highlighted above is an indicator of how much time is being spent parsing the HTML and executing the JavaScript, whereas the “Wire” time represents the time spent by HTTP requests on the wire. Here we see a direct correlation between the “Wire” time and “Document Complete” (onLoad event) metrics.

C. Page-Level Metrics

The third set of metrics are focused on detecting any changes concerning the number and size of important page-level components such as HTML, Images, CSS, Scripts, and Fonts.

Optimization efforts resulting from Lighthouse scores would often lead to changes in these metrics, and it is extremely important to capture not just the changes in the Lighthouse scores, but also these metrics mentioned above. This allows correlation between the two datasets in order to understand how the optimization efforts are leading to improvements in end-user experiences.

Example of a chart showing Lighthouse metrics trending over time:

In all the examples discussed above, we saw how to better understand end-user experiences using different sets of metrics from both Lighthouse and Catchpoint. It is extremely difficult to depend on a single metric to represent “the state of end-user experience.”

One of the other areas where Catchpoint and Lighthouse add a lot of value is in the “CI/CD pipeline.” The tools and the metrics from these tools can be used for “Regression detection and testing” in the pre-production environment.

To summarize,

Our digital architectures are very complex, and different monitoring tools were developed to tackle different issues and challenges.
When our objective is to improve end-user experience, we need to have a look at the data from multiple perspectives, including telemetry from different tools.
In this blog series, we focused on how two different monitoring tools can be used for detecting and triaging performance related issues for web applications. However, this methodology applies to a lot of other tools. For example, synthetically generated data and data from actual real users (RUM) can be used together to better understand user experiences and the challenges being faced by end users.
There is no “single metric” which can present the complete picture of how users perceive performance. We have to look at different sources, identify the positives which each system brings and ensure we have a foolproof strategy which is aimed at improving the experience an end user has when interacting with your application.
When there are problems or challenges faced by the user, our strategy should be capable of detecting them and ensuring the relevant data for relevant metrics are passed on to the right teams and stakeholders.

Summary