This post will answer a simple question, “What is MTTD?” The answer—or at least the start of it—was already spoiled by the post title. Sure enough, MTTD stands for “Mean time to detect.” It refers to an important KPI (key performance indicator) in DevOps. Is the question answered? Can we call it a day with that definition? Of course not.
In the rest of the article, we’ll go deeper into MTTD. You’ll learn its complete definition first—in other words, the “what.” Then we’ll proceed to explain why you and your organization should care about it. With the “what” and “why” out of the way, we finally get to the “how,” by showing how to calculate MTTD in practice.
Then we wrap-up by listing some other metrics that are essential for DevOps and are intimately related to our topic. Let’s get started.
What is MTTD: A Definition
As you’ve already seen, MTTD stands for “Mean Time to Detect.” The next question naturally follows: “But to detect what?” In a nutshell, problems.
MTTD means the average time it takes to detect a problem inside an organization. In other words, this indicator measures the elapsed time between the start of an issue (such as software malfunction or hardware failure) and its detection by the DevOps team.
It can be simple to calculate MTTD since all it takes is that for the DevOps team to look at the total number of defects and the total period during which the system was down. It’s easier said than done, though, and we’ll look into how to perform this calculation in detail later.
Why is MTTD is so important? The intuitive answer would probably be something along the lines of “the sooner we discover an issue, the sooner we can fix it.” That famous saying “you can’t improve what you don’t measure” also comes to mind. Putting the two together, we might say that MTTD is important because by measuring the time it takes to fix an issue, we take the first step toward bringing that time down. And since it’s been known for a long time that fixing a problem earlier is cheaper than fixing it later, it stands to reason that we should try our best to bring MTTD down.
Sure, that’s part of the answer. But the complete answer is both more complex and more interesting. Keep reading to learn why.
Why Should You Care About MTTD
We’ve just handled the “what” of MTTD, so now let’s focus on the “why.” What are the reasons that make this metric so crucial for organizations?
As you’ve just seen in the previous paragraph, one of the reasons why you should care about MTTD is that measuring it is required in order to bring it down. But that’s not all.
MTTD also has an additional—and arguably more important—benefit: it serves as a test of your monitoring mechanisms. Here’s the thing—your organization already adopts tools and processes to monitor incidents. If these tools and processes work as intended, it shouldn’t be that hard to keep your organization’s MTTD low. The opposite is also true. If you adopt incident management mechanisms that aren’t up to the task, you and your DevOps team will have a hard time keeping MTTD down, which can result in catastrophic consequences for your organization.
Finally, a third reason for caring about MTTD is that it’s related to other important DevOps KPIs, such as MTTF (Mean time to failure) or MTTR (Mean time to resolve.)
Calculating MTTD: A Practical Example
We’ve just covered the “what” and “why”, so now we’re left with the “how.” Let’s learn the formula for MTTD, which is somewhat simple. First, we add all the incident detection times (for a given team, let’s say.) Then, we divide that by the total number of incidents.
Let’s see an example. Consider the following table:
The table represents seven incidents that happened during the period in which the organization tracks its MTTD (let’s say it’s two weeks.) Each incident has both a starting time and a detection time, which were recovered by analyzing log files. Finally, we have a third column that just displays the elapsed time (detection time – start time) in minutes. To calculate the mean time to detection, we now add all of the periods together:
240 + 46 + 10 + 40 + 39 = 375
Then we divide that by the number of incidents. In our example, 5:
375 / 5 = 75
So, the MTTD, in our example, is 75. That doesn’t have to be the end of the story, though. Further calculations or analyses can be done after you’ve reached the result. For instance, some companies might choose to remove outliers from the list of values, which might make sense depending on the circumstances. Sure, the very definition of outlier will vary from organization to organization, depending on the methodology you choose. In our example, we can easily see that the lowest and highest values are very distant from the typical values, so let’s remove them and use the formula again:
(46 + 40 + 39)/ 3 = 41.6666666667
By ignoring the outliers, we’ve obtained a different MTTD: 41.67, approximately.
Another way in which organizations can tweak with their MTTD results consists of grouping incidents by severity. That way, they can conclude, for instance, that even though the overall MTTD has been falling month after month, the MTTD for the high-critical issues has been steadily going down. This would be great news, but the organization wouldn’t find out about it just by looking at the general MTTD.
Other Metrics To Be Aware Of
MTTD is a key indicator to track the efficiency of your monitoring and incident response tools and processes, but it’s not the only one. Some others include:
- MTTF (Mean time to failure): The amount of time a product/device/system is expected to work.
- MTTR (Mean time to repair): The time it takes to fix an issue after its detected.
- MTBF (Mean time between failures): The time the organization goes without a system outage or other issues.
Back to You
In this post, we’ve answered the question, “What is MTTD?” Hopefully you understand more about this metric—and why it matters—then you did before reading this article. In the case of MTTD, it’s all about monitoring.
Monitoring plays a vital role in the DevOps era we all live in. Metrics such as mean time to detection and others are crucial for IT organizations. In the case of MTTD, specifically, it’s purpose is two-fold. First, the metric is fundamental for the organization’s efforts to bring it down. As the saying goes, “you can’t improve what you don’t measure,” so measuring is the first and perhaps most important step if you want to improve anything.
Finally, MTTD may serve the purpose of being a sanity check for the organization’s incident managing strategy. If you use the proper tools and processes for managing the incidents on your organization, you shouldn’t struggle that much to keep MTTD down. On the other hand, if the metric’s results are consistently high for your organization, that’s a red flag.
Now it’s up to you. What are the next steps? For starters, don’t stop studying. Keep reading on this blog and elsewhere, not only about MTTD but other metrics and broader topics as well. Learn about the available tools at your disposal that can help you with MTTD. For instance, take a look at XpoLog, which is a full automated log management tool that can collect, parse, and tag logs from a plethora of different sources. Log management is essential for implementing MTTD because logs are one of the most important sources of information you can use to determine the start times of problems.
That’s it for today. Thanks for reading and I’ll see you next time.
This post was written by Carlos Schults. Carlos is a .NET software developer with experience in both desktop and web development, and he’s now trying his hand at mobile. He has a passion for writing clean and concise code, and he’s interested in practices that help you improve app health, such as code review, automated testing, and continuous build.