Skip to content

Metrics

PaulKBaumann edited this page Oct 9, 2024 · 5 revisions

Introduction

This document outlines the procedures for tracking Mean Time to Resolve (MTTR) for incidents where VRO-team is identified as the root cause. It details how MTTR is measured, the process for establishing a baseline for these metrics, and the tools we will use to track MTTR. This document is intended for internal use within our team and for communication with external stakeholders.

Summary

  • Mean Time to Resolve (MTTR) is automatically calculated by PagerDuty using incident creation and resolution times.
  • Timely incident creation, acknowledgement, and resolution is crucial for accurately representing the team's ability to respond to and resolve incidents.
  • Each sprint review we should log MTTR, overall and by severity, for that sprint so we can present these metrics to the enablement team as well as to build a historical reference to analyze trends of the team's responsiveness to incidents over time.

What is MTTR?

Mean Time to Resolve (MTTR) is the average time taken to fully resolve an incident from the moment it is detected until normal service is restored. MTTR is a critical metric for assessing the efficiency of our incident response process and identifying areas for improvement. More information can be found here.

Mean Time to Respond/Acknowledge is a similar DORA metric measuring the average time between incident creation and acknowledgement. This metric is not currently being tracked by VRO but is available on PagerDuty's Analytics tab.

How MTTR is Measured

MTTR is calculated as the sum of time to resolve each incident divided by the number of incidents. With PagerDuty, the time to resolve each incident is measured from the moment the incident is created to the time it is marked as resolved.

Tracking MTTR for VRO

Incident Management in PagerDuty

PagerDuty is used to create and manage incidents for our services, and also serves as a platform for determining MTTR for our team. Incidents created through Slack's incident report workflow and others which are created manually on PagerDuty's website are automatically used to calculate MTTR. Creating, acknowledging and responding, and closing incident tickets in a timely manner is necessary in order to most accurately represent VRO's response and resolution capabilities.

  1. Incident Creation:
    • All incidents are created and managed in PagerDuty, as outlined in VRO's on-call documentation.
    • Ensure that incidents where our team is identified as the root cause are correctly labeled.
    • Adjust the Service field for incidents where VRO is not identified as the root cause.
  2. Incident Resolution:
    • Ticket creation and resolution times are used to compute MTTR automatically by PagerDuty.
    • Ensure all relevant details are documented, including steps taken to resolve the incident and any contributing factors. Refer to the on-call documentation for further elaboration on responding to incidents.
  3. MTTR Calculation:
    • PagerDuty automatically calculates MTTR and displays it on the dashboard found under the Analytics tab.
    • The Insights page also contains incident information and displays a more concise view of metrics compared to the Dashboard.
    • Filtering on Priority provides a representation for MTTR provides a representation of MTTR by incident severity level and should be used when generating MTTR reports to the Enablement team.

Forming a Baseline for MTTR

MTTR between each sprint should be recorded just prior to sprint review. Periodically, we should analyze this table for trends and identify any recurring issues that may impact resolution times. After several sprints of collecting these metrics, VRO should establish initial benchmarks based on this data, so we may use it to define acceptable MTTR thresholds for each severity and to track improvements or deviations from the baseline. These benchmarks should be adjusted as necessary based on ongoing performance and incident trends.

Review Cadence

MTTR should be captured by severity (priority) and presented to the enablement team each sprint review. Tracking and reporting these metrics will help us understand our average resolution time and identify areas for process improvement.

Sprint Metrics

Sprint Start Date End Date # Incidents MTTR Overall P1 MTTR P2 MTTR P3 MTTR P4 MTTR P5 MTTR
Sprint X 07-30-2024 08-13-2024 0 n/a n/a n/a n/a n/a n/a
Sprint Y 08-13-2024 08-27-2024 0 n/a n/a n/a n/a n/a n/a
Sprint Z 08-27-2024 09-10-2024 1 2d 5h 12m n/a n/a n/a n/a n/a
Sprint 1 09-10-2024 09-24-2024 0 n/a n/a n/a n/a n/a n/a
Sprint 2 09-24-2024 09-08-2024 0 n/a n/a n/a n/a n/a n/a
Clone this wiki locally