In a world of CI/CD (continuous integration and continuous delivery) and constant testing throughout the software delivery lifecycle, teams can easily overlook the importance of post-incident reviews. The continuous improvement of monitoring and incident response processes require thorough post-incident analyses of both technology and people.
Integrating experience monitoring tools with on-call incident response and management software gives DevOps and IT teams visibility into how a system behaves over time. Real-user monitoring and synthetic monitoring allows you to run proactive tests and identify weaknesses in your services. Then, you can leverage these insights alongside the way your team handles real-time incident response to create a holistic, proactive system for building and maintaining reliable applications and infrastructure.
Effective monitoring and alerting are only the initial parts of a successful incident management equation. Here, we’ll walk through the anatomy of a well-built, on-call process and how to conduct post-incident reviews—helping you protect the time and money invested in service reliability.
Making the most of your monitoring investment
Every tool your team uses should serve a purpose. Most engineering and IT teams are leveraging numerous tools for software delivery, monitoring and alerting. Over time, the team will spend countless hours setting up and tweaking network, server, application and experience monitoring tools and thresholds—not to mention budgeting for the associated financial costs. But, the costs of downtime far outweigh the costs of an effective monitoring strategy. To make the most of your monitoring investment, you also need an action plan when an incident inevitably strikes.
Monitoring and alerting is only helpful for detecting and identifying problems. Without a collaboration plan in place, you won’t be equipped to act on the alert. DevOps and IT teams are constantly talking about ways to improve incident detection and MTTA (mean time to acknowledge). But, you can shorten an incident’s lifespan the most by focusing on the difference in time between MTTA and MTTR (mean time to resolve) and finding ways to get the right information to the right person quickly.
In The State of On-Call Report from VictorOps, you’ll see that, on average, 73% of the incident lifecycle is spent in the response phase. Further, the report states only 5% of the lifecycle is spent in alerting—whereas 12% is spent in documentation and analysis. You can see that teams spend a disproportionate time improving alerting workflows as compared to preparing for future incident response. As you spend more time conducting thorough post-incident reviews, you’ll create better services and drive more reliable customer experiences—getting the most value from your monitoring and alerting tools over time.
Defining the goals of your post-incident review
The goals of your post-incident review should be comprehensive and cover every piece of the incident lifecycle. The common question behind every metric should be, “How does this help us deepen the reliability of our services without sacrificing speed?”
The top objectives for any effective post-incident review process will be the following:
How can you reduce MTTA and help on-call teams know about an issue faster?
Incident response and resolution
How do you recover from an incident faster to lower MTTR?
What did you learn about the people, processes, and technology involved in your entire software delivery and incident lifecycles?
Find ways to improve on the above insights. Measure the speed at which you can come up with a quick fix to appease customers. But, ensure you also measure the time it takes to come up with the full resolution to an incident.
You’ll need visibility into system data and human workflows to track incident management goals and conduct accurate post-incident reviews. Then, you’ll be able to learn how an incident affects both the system and your people—helping you bolster service reliability, add observability to the system’s known unknowns, and implement process and tooling improvements.
Post-incident review KPIs and metrics
- Time to incident acknowledgment
- Time to incident resolution
- How much time was spent in each individual phase of the incident lifecycle? (detection phase, response phase, remediation phase).
Build a detailed incident timeline
- When was the incident first identified? (exact date and time of day)
- When did you restore the service? (exact date and time of day)
- Who was notified of the incident first? How were they notified?
- Who was the first on-call responder to acknowledge the incident and take action?
- What types of escalation did the incident go through? Who else entered the fray in order to solve the issue?
- What tasks or commands were executed by the team? Who executed these commands and when did they do so?
Maintain a record of communication history throughout the incident timeline
- What kind of information was shared?
- Which channels were people using to communicate?
Analyze all of the data from the above bullet points and draw some conclusions
- Which tasks, communications, or processes made a positive impact on incident resolution?
- Which tasks, communications, or processes made a negative impact on incident resolution?
- What happened that made no impact whatsoever?
Centralizing data and communication for holistic incident analysis
Creating an effective post-incident review process relies heavily on access to the right information. By centralizing your critical alert data and communication history in a single-source-of-truth timeline, you’ll be able to find critical information faster. If you’re only looking at your monitoring data during a post-incident review, you’re only looking at half the equation. The way your team responds to the monitoring metrics they’re viewing is equally important to the rapid remediation of outages and incidents.
A powerful monitoring and alerting system relies on collaboration and transparency across your organization. Not only does transparency lend itself to DevOps or IT success but also drives value faster for business teams.
Making the most of monitoring with post-incident reviews
Taking the time to conduct post-incident reviews and implement changes to systems and processes based on your learnings will save you time and money. You’ll be able to focus on developing new features and services while responding to fewer incidents in your applications and infrastructure.
As with many DevOps practices, this will seem counterintuitive at first. But, spending more time spreading systemic knowledge across teams and sharing ownership of the services you build will make both development and incident management easier.
Gartner estimates the average cost of downtime at $5,600 per minute. Join us for a free joint-webinar with VictorOps, Death to Downtime, on April 25th to learn how the right approach to monitoring and incident response can minimize the impact and cost of downtime.
This post was written by Melanie Macari, Product Marketing Manager at VictorOps+Splunk