Glossary of Terms

Mean Time to Resolve (MTTR)

What is Mean Time to Resolve (MTTR)?

MTTR, or mean time to resolve, is a metric that equals the total time spent from the start of each issue to resolution, divided by the total frequency of issues.

Total resolution time / frequency = MTTR

Example: If an IT team measures a total of 5 issues, with total resolution times of:

  • 5 minutes
  • 4 minutes
  • 3 minutes
  • 2 minutes
  • 1 minute

Then, the total minutes from start to resolution — 15 minutes — is divided by the total number of issues, 5.

15 min. total resolution time / 5 total incidents = 3 min. MTTR.

How do businesses use MTTR?

Meantime to resolve (MTTR) helps businesses establish essential operating and maintenance costs for equipment, parts, services, procedures, and processes.

Different businesses define MTTR differently according to business needs. Companies define MTTR differently based on what they determine is the starting and end point of incident resolution.

This article will focus on the definition and use of MTTR in Information Technology (IT), although many other businesses use MTTR. For example, Coca Cola might use MTTR to improve the health and maintenance of their cola-making machines.

MTTR helps IT decision makers, like IT Managers, Chief Information Officers (CIOs), and Site Reliability Engineers (SREs), choose how to best handle a service, machine, or process. MTTR can help answer questions like:

  • Should the process or machine be kept in service?
  • Should it be replaced with something faster or more robust?
  • Can the process be automated?
  • Should the service be fortified?
  • Are there better alternatives to what is currently in place?

For example, if an SRE notices a domain name system (DNS) outage via their performance monitoring software, they can quickly switch over to their backup DNS provider. If this DNS continues to experience outages, the SRE and their IT team might consider switching DNS providers.

Finding problems early means IT Managers, CIOs, DevOps leads, and SREs can act sooner. Acting sooner keeps resolution time down.

The longer a problem persists, the more difficult it will be to solve and the more customers it will affect. System performance monitoring strategies often employ concepts to catch smaller issues before they become large problems, which improves user experience and reduces MTTR.

How can you improve MTTR?

Businesses can improve MTTR by implementing system observability strategies. The sooner an organization knows a system has issues, the sooner decision makers can react and triage the problem. The ability to act quickly will always improve MTTR.

Performance observability and MTTR

Performance observability is the use of software to observe the performance of each piece that makes up an application or website.

Observing your infrastructure is like having a watchdog 24/7 to look over all of the components that make up an application and any automated processes, and provide you with actionable information if an issue is discovered. Ultimately, observing all parts of infrastructure from as many vantage points as possible will provide the best coverage.

Alerting and notification

IT teams can reduce the cost of failure by using performance observability software that can notify them of issues as soon as they’re detected. Detected anomalies trigger alerts to be sent to decision-makers so they can act and get the system back online and operating properly.

Alerting a company to potential pitfalls early reduces the amount of time it takes to resolve a problem.

Team communication and MTTR

Another way of improving MTTR is to ensure that everyone on the team knows how to log a bug and alert any teammates responsible for fixing the problem or communicating with third parties. Responsible parties might include:

  • QA team or IT team that escalate issues.
  • SREs, IT managers, or anyone involved in resolving the issues.
  • The person(s) that need to communicate with any third party experiencing outages.
  • The person(s) that need to alert other departments or employees within the company.

Listening to customers and MTTR

Although it’s best to catch problems before they affect too many customers, customers can play a role in improving MTTR.

First, customers can alert a company of an issue on social media or through a support ticket. Readily available documentation, including help articles, how-to guides, tip sheets,and FAQs, is another great line of defense for MTTR.

Ideally, businesses create documents with the customer in mind. They should be easily accessible and clearly written. Strong user documentation solves many non-problems and leaves technical support teams available for actual issues.

Customers, now more than ever, want to troubleshoot and seek out answers on their own when they know they can easily do so. Most would rather follow a quick reference guide to solve their problem than wait on the phone to talk to tech support.

Most people will automatically reach for help or FAQs before they pick up the phone. If you have a strong library of updated documentation, tech support can free up more time to resolve infrastructure or process issues.

How MTTD affects MTTR

In IT, mean time to detect (MTTD) refers to the amount of time it takes from the start of an issue until the IT team detects the issue.

(sum of all detection times) / (number of detected incidents) = MTTD

Detecting a problem quickly will directly affect how quickly an IT team can resolve the problem. The longer it takes to identify a problem the longer it takes to get the system back to a usable state. The goal is to decrease the time to detection to also decrease time to resolution.

Again, observability plays a big role here and can be very important in reducing MTTD. It’s important to observe each piece of infrastructure to detect issues, pinpoint sources, and resolve them before they affect many customers.

Conclusion

IT Managers, Chief Information Officers (CIOs), Site Reliability Engineers (SREs), and DevOps leads all are important in making decisions about how to proceed with a machine, system, process, or service. Quick identification of issues is of prime importance to all of them when it comes to system recovery. The longer a small problem sits unfixed, the larger the chances are of it becoming a big issue and affecting many users.

Implementing a comprehensive performance observability strategy will help IT teams improve both MTTI and MTTR.