Blog Post

Mastering IPM: The essential customer experience monitoring framework

How can you ensure a great customer experience? We cover the pillars of Internet Resilience, what your test setup should include, and a case study to see the framework in action.

In the previous installment of our Internet Performance Monitoring (IPM) Best Practices Series, we explored the critical importance of monitoring what matters, from where it matters.  

Now, we pivot to a core aspect of Internet Resilience: Customer Experience (CX).  

This blog explores the critical role of IPM in achieving faster Mean Time to Detect (MTTD) and Mean Time to Recover (MTTR).  

First, let’s dive into the foundational elements that are vital for any digital service – reachability, availability, performance, and reliability.  

The four pillars of monitoring for a flawless Customer Experience  

Delivering flawless CX is non-negotiable in today’s digital landscape. It demands consistency in content delivery, availability, and performance across all markets and niches. Users have high expectations for application responsiveness and will quickly switch to alternative solutions if expectations aren’t met.  

In the same way that Maslow's Hierarchy of Needs outlines the fundamental requirements for human well-being, we can apply a similar framework to the services we provide.  

The 4 Pillars of Internet Resilience

Specific to IPM, we've distilled the essentials into 4 essential pillars to ensure seamless CX:  

  • Reachability: The starting point of digital interaction is ensuring that user requests make it to your server without a hitch, setting the initial tone for the user experience.  
  • Availability: True availability extends beyond an "HTTP 200 OK" response; it involves ensuring all functionalities of an application are working as intended. For example, if an eCommerce site's images or descriptions fail to load, it's not fully available to the user.  
  • Performance: This pillar focuses on the speed and responsiveness of your application. It’s about benchmarking your performance against industry leaders to provide a competitive user experience.  
  • Reliability: The hallmark of reliability is the application's ability to consistently deliver on reachability, availability, and performance, ensuring all users receive a uniform experience, no matter the circumstances.  

At a foundational level, your service must be up and running. Once it’s available, users expect it to be fast. Next, you must enable your users to access the service from anywhere in the world or at least make it accessible from all regions within your target market. And finally, your website must be consistent in its availability, performance, and reachability.  

Each of these pillars is crucial in shaping the overall customer experience. By monitoring these aspects meticulously, we can pre-emptively identify potential incidents and optimize user interactions, ensuring a smooth and satisfying digital journey for every user.  

Implementing a Customer Experience suite with IPM  

Let's look at a practical example of an IPM CX suite.  

This framework provides continuous visibility into the user experience, promptly identifying and resolving any potential issues. By implementing such a setup, businesses can avoid problems that could negatively impact the customer journey, such as slow website performance or security certificate errors.  

With 15 years of unrivaled experience, Catchpoint is the go-to expert in IPM. We've perfected best practices and methodologies, ensuring top-tier digital experiences for global industry leaders.  

Here's a snapshot of what a robust CX monitoring setup might look like:  

  • Single Object Test: Monitors individual elements like images or scripts to ensure they load correctly.
  • DNS Experience Test: Ensures the DNS resolution process functions efficiently, connecting users to your website without delay.
  • Trace Route Test: Provides insights into the path network traffic takes to reach your server, helping to pinpoint potential delays.
  • SSL Test: Checks the validity and expiration of security certificates to prevent security warnings that could deter users.
  • Chrome Transaction Test: Simulates user interactions to test complex transactions and application flows.  

Each test runs at a specific frequency, represented in minutes, to balance thoroughness with resource efficiency:  

Test Type Node Type Frequency (in minutes)
Single Object Backbone 5
DNS Experience Backbone 5
Trace Route Backbone 5
SSL Backbone 1,440 (once a day)
Chrome Transaction Backbone 60

It's essential to note that these frequencies and test types are tailored to typical use cases. Depending on the unique needs of a service, the frequency could be increased to sub-minute intervals, and additional tests may be incorporated for a more granular view.  

By integrating such a suite into your monitoring strategy, you can gain comprehensive visibility into your application's performance and availability, DNS resolution, TCP layer connectivity, certificate health, and user transaction functionality, ensuring a superior customer experience.  

Now that we’ve considered an ideal CX setup, let’s explore a recent incident involving a well-known file-sharing company.  

We’ll consider how a similar setup contributed to rapid outage detection, faster MTTD, and ultimately aided in quickly resolving the issue.  

Outage case study: What happened?    

A significant outage struck a prominent file-sharing company on December 15, 2023, lasting from 6:00 AM to 9:11 AM Pacific Time. This incident affected numerous critical services, such as the files tool, APIs, and user logins, and compromised the core functionalities of uploading and downloading. As a result, countless users found themselves unable to share files or access their accounts—a clear disruption to business and personal operations.  

Early detection: The role of IPM.   

One key factor determining how long an outage lasts is how early you detect it.  

In this case, proactive IPM played a crucial role. Our IPM platform raised the alarm at 04:37 AM PST, well before the official reporting time, when it detected the first signs of critical API failures.   

A screenshot of a chatDescription automatically generated

 Assessing the alert: Distinguishing false alarms from genuine threats.   

Upon receiving an alert, this is the next obvious question for any team: Is it worth waking someone up, or is it a false positive? Is it a network error or an application error?    

In this case, a consistent pattern of 5XX errors emerged in the logs, indicating a genuine and substantial issue requiring urgent attention.    

A screen shot of a computerDescription automatically generated

How widespread is the outage?   

A critical step for any response team is determining the outage's extent. Is it only the API, or are other services also impacted? This step is vital to developing an appropriate response strategy.  

In this scenario, concurrent alerts from multiple services—including APIs, uploads, downloads, and logins—painted a picture of a far-reaching outage, informing the strategy for response and recovery.  

A screenshot of a computerDescription automatically generated

 Analysing the potential impact   

Outages like the one explained here can potentially disrupt operations and pose severe financial repercussions. A 2023 study by Forrester Consulting on behalf of Catchpoint revealed that eCommerce companies lose millions of dollars annually due to Internet disruptions.  

Forrester found that 88% of respondents estimated that their companies lost over $100,000 due to disruptions in the month leading up to the survey. Over a year, this translates to $1.2 million annually in losses. Moreover, 51% reported losing over $500,000 in the previous month alone.  

Having a robust IPM strategy for CX is indispensable.  

The outage, lasting just over 3 hours, could have extended much further without the early detection afforded by proactive IPM.  

The 4 pillars of reachability, availability, performance, and reliability form the backbone of a stellar CX. The recent file-sharing company outage illustrates how IPM's early detection is instrumental in curtailing potential losses by finding and fixing issues before the business is impacted.  

Join us in the next update of our IPM Best Practices Series, where we’ll dive into the critical role of neutral, third-party data in service level objective tracking and how to navigate the challenges of service level agreements.

This is some text inside of a div block.

You might also like

Blog post

Key findings from The Internet Resilience Report 2024

Blog post

Internet Stack Map: A gamechanger for Internet Performance Monitoring

Blog post

Solving the challenge of cost-effective monitoring across multiple locations and branches