Blog Post

Accelerating detection to resolution: A case study in Internet Resilience

Updated

Published

January 25, 2024

mins read

Moiz Khan

in this blog post

Heading 2

Today, any revenue-generating website is like a house of cards, poised to collapse with multiple points of failure. The modern service delivery chain relies on intricate multi-step transactions and third-party API integrations, making the system more complex and interconnected.

Web Application Architecture: A Guide Through the Intricate Process of Building an App | LITSLINK Blog — *High-Level Application Architecture*

A single point of failure in the architectural diagram above can lead to slowdowns and outages with tangible consequences on your bottom line. Therefore, reducing the time it takes for you to detect and resolve incidents is a mission-critical capability for Internet-dependent businesses – which, let’s face it, is every business out there.

Key contributors to outage Costs: Unpacking MTTD, MTTI and MTTR

There are three major factors that impact the cost of outages:

#1 - MTTD – Mean Time to Detection

MTTD, or Mean Time to Detection, represents the pivotal moment when you become aware of an incident causing the outage. This phase is of paramount importance, as the swifter you can identify the issue, the faster you can embark on resolving it.

#2 - MTTI – Mean Time to Identification

MTTI, or Mean Time to Identification, is the next crucial step. It involves identifying the root cause of the outage, a step that is often overlooked but equally essential. A swift MTTI expedites the resolution process by pinpointing the specific component or issue responsible for the outage.

#3 - MTTR – Mean Time to Resolve

Lastly, MTTR, or Mean Time to Resolve, represents the time taken to fully address and fix the incident once it has been identified. This step encompasses all the actions and measures needed to bring the system back to normal operation. Reducing MTTR ensures that downtime is minimized, minimizing the impact on your business.

Consider, for instance, a DNS outage. Any number of issues could be at play:

All nameservers are down
DNS hijack
Only a few Nameservers are down
Performance degradation
Changes to your Nameservers

Identifying the specific component responsible can be extremely challenging. In this case study, we explore how a leading tech company’s proactive monitoring approach with Catchpoint yielded substantial improvements in its incident response times and overall Internet Resilience.

A case study in reducing MTTR

Problem

The company faced a significant challenge with some of its products and services sold through an eCommerce site. This particular business group was just getting started and had no monitoring systems in place. Their posture was reactive, often relying on customers to inform them when issues arose, such as the inability to complete purchases. This approach resulted in millions of dollars in revenue losses.

Solution

The company was already using Catchpoint for other aspects of its business, but its newly established eCommerce site required support. Catchpoint extended its suite of Internet Performance Monitoring (IPM) services to cover the company’s eCommerce platform. Within this framework, we established tests aimed to aid the business in promptly detecting and responding to failures.

A screenshot of a computerDescription automatically generated

Results

The outcomes were significant. After our tests were deployed, the company’s MTTD improved significantly, reducing from 2-4 hours to just 5 minutes. We also configured real-time dashboards to provide continuous updates. These dashboards facilitated immediate alerts, offering precise information about the nature and cause of each failure. These alerts were directed to the relevant team, enabling them to promptly investigate and consequently decrease their Mean Time to Resolve (MTTR) from 8-10 hours to a mere 30 minutes.

Unless your monitoring tool can provide this crucial information, you'll spend a lot of time troubleshooting, which can be both exasperating and time-intensive. Such inefficiencies directly influence your Mean Time to Identification (MTTI) and Mean Time to Resolve (MTTR) metrics. The example above pertains to DNS, but numerous other components within your system can trigger outages, each with its own set of potential failure causes.

Best practices to reduce MTTD and MTTR

Comprehensive Monitoring Strategy. Establish a robust monitoring strategy that covers all aspects impacting User Experience, including availability and performance at all levels.

Optimal Test Configuration. Configure tests appropriately and ensure they run at optimal frequencies. Avoid running availability tests infrequently and develop strategies to minimize false positives by distinguishing between warning signs and critical issues.

Data Utilization. Make the most of the data collected by your monitoring system. Create customized real-time dashboards with shareable links and integrate data with Observability tools and other components in your monitoring stack, such as APM and Tracing.

Here's a checklist to ensure comprehensive monitoring for reducing MTTD and MTTR

Ensure that all critical pages are available and reachable.
Verify that all essential functions are working correctly to enable users to complete transactions.
Monitor page load times to ensure they are within acceptable limits.
Confirm that microservices are functioning correctly and at optimal performance.
Set up monitoring at high frequencies to facilitate timely actions.
Monitor from your user's perspective and experience.
Include 3rd-party components in your monitoring strategy, as they can impact website availability and performance.
Maintain 24/7 monitoring to detect issues as soon as they arise.

Reduce MTTD and MTTR with Catchpoint

Catchpoint offers over 40 test types and seamless integrations with industry-leading incident management tools, facilitating swift incident detection, identification, and resolution. This streamlined approach significantly reduces outage costs. Our advanced alerting mechanism empowers you to send essential data for customizing your automation processes, enabling the resolution of many incidents with minimal human intervention. Additionally, playbooks can be incorporated into the alert payload to effectively handle incidents. Explore our guided product tour, or contact us to learn more.  

Summary