Subscribe to our
weekly update
Sign up to receive our latest news via a mobile-friendly weekly email
An expert shares real-world scenarios where DEM can be leveraged to improve service reliability.
The move to a more hybrid and distributed application architecture has pushed cloud providers to offer higher availability. Availability has become the key differentiator among competitors. The focus on offering higher and higher availability comes at the cost of other vital performance factors such as service reliability.
This blog discusses some of the important takeaways from our recent on-demand webcast on improving service reliability. We look at Digital Experience Monitoring’s simplified approach to service reliability and some real-world scenarios where DEM can be leveraged to improve service reliability.
Higher availability is not the sole indicator of high performance. Monitoring methodologies are based on four major performance pillars – Reachability, Availability, Performance and Reliability. Reliability, in simple words, determines performance consistency. It is a critical metric in the application delivery chain and there are multiple challenges when it comes to guaranteeing reliability.
The challenges do not mean that it is impossible to monitor reliability. DEM tools measure all metrics that are relevant to the four performance pillars but there are ways to make reliability monitoring more effective. The monitoring methodology should account for each of the challenges listed above and then fine-tune it by working through some, specifically three, overlooked aspects of reliability.
The digital world is highly distributed, everything from networks, microservices, DNS, CDN and even cloud services. It is designed to improve speed, performance and efficiency while eliminating single points of failures. But all these distributed components add to the complexity of monitoring reliability.
To understand the impact of such a distributed system, let us take the example of CDN mapping. End users are mapped to CDN PoP passed on their location and this mapping determines how quickly the application loads for the end user. And how quickly the critical components of the application loads determine the end user’s “perceived” reliability of your service. In figure 1, we compare the impact of CDN performance. There is a significant difference in the way the page renders and this sums up the end-user experience which also translates to poor reliability as perceived by the end user.
Figure 1
The CDN geo-mapping plays a crucial role in determining end-user experience. For instance, a user in Boston, MA on AT&T is served content from Ashburn, VA. This raises several questions – was the mapping efficient? Does same mapping pattern apply for traffic from different providers?
Understanding how the CDN mapping works will help you evaluate different CDN providers and improve the current service. It will also help you identify and resolve incident faster. Working with your CDN partner and optimizing the content distribution paths will greatly improve service reliability for all end users.
The end user perspective is crucial to reliability. The end user is unaware of the complexities involved in delivering an application, reliability to them is a perception of how the system worked for them and not based on any actual quantifier. So, to ensure reliability you need the real user’s insight which is possible with the use of Real User NEL (Network Error Logging). NEL captures and reports errors (DNS errors, TCP timeouts, HTTP errors) encountered by the real user. The data captured is very helpful when trying to evaluate the actual end-user experience.
For example, as you push out a new release into production intending for it to be a ‘no-downtime’ release – you can compare the performance baselines of the build and verify if users experienced any errors. The data from real users will give answers to reliability questions such as:
Combining performance data from synthetic monitoring with real user monitoring provides complementary viewpoints which can then be leveraged to improve the synthetic monitors you use and in turn improve service reliability.
The third most important aspect that is usually overlooked is historical reliability patterns. Analyzing day to day data may not give you the true reliability picture. Consider this example, we are measuring the page render time for google.com. Figure 2 illustrates performance data for over a year, if we were to analyze reliability for a day or week of any month, we would conclude that it is mostly above average. But when you look at the historical trend, there is a clear pattern of performance degradation. How we determine the reliability of the service is directly related to the time window we observe.
Figure 2
We discussed three important aspects you need to reconsider when trying to improve reliability. If you want to monitor reliability effectively then start with this reliability checklist:
The webcast offers many more insightful use cases that help you improve service reliability. Watch the webcast here and learn about digital experience monitoring’s simplified approach to service reliability.