Whether you’re slinging code, managing developers, wrangling servers, or filling most other roles in the modern tech firm, you care about keeping your software running while bringing home the bacon. If your website or application is down, you’re not making money. (Or, if you aren’t in this for profit, your message isn’t getting to the people who need it.) Therefore, it’s everyone’s job to keep things running smoothly.
But let’s face it: Some downtime is inevitable. When things break, it’s essential to fix them as quickly as possible.
Luckily, technology has come a long way. All sorts of tools, processes, and improvements can help you not only to write better software faster but also to fix it faster when it breaks.
As part of establishing effective processes, keep track of how often things break and how fast you fix them. You can track these as metrics and work to get the best grades possible—just as if you’re in school again. Probably the most important metric to track is MTTR, which stands for mean time to repair or recovery. Let’s take a closer look at this important number.
A Bold New World
Software engineering has come a long way since the dark days of the eighties and nineties. First, the agile revolution brought levelheadedness to our engineering process. Instead of trying to plan out every detail of the entire application before even writing one line of code, we’ve learned to tackle one small piece, figure out what works, adjust if necessary, and continue. There’s ample evidence that this approach makes better software than the old “waterfall” approach.
Then the rise of DevOps brought similar benefits for sysadmins. It brought dependability to provisioning and maintaining systems. We dumped the pile of ad hoc scripts and arcane knowledge tucked away in somebody’s brain and embraced configuration as code, complete with version control and shared responsibility. Tools like Puppet and Chef allowed companies to standardize their provisioning and system administration. This reduced the bus factor of having one wizard who knew where everything was.
And cloud computing changed the way companies think about hardware and their computing needs. Now you don’t have to buy servers, maintain them, or replace them when they get old. Instead, you just fill out some web forms and push some buttons, and a server magically becomes available to you (more or less). You can spin up a cheap or even free low-powered machine to test something and move to a beefier machine when you need it. You can also set up automatic scaling to bring extra computing power online when necessary.
Agile allowed software efforts to focus on small pieces and course-correct as the software took shape. DevOps allowed teams to treat their infrastructure as another development effort, with version control and repeatability. Cloud computing has allowed you to stop worrying about hardware and has provided on-demand scaling. All these things together mean that modern software teams can move quickly and deliver features faster than ever.
Continuous Integration/Delivery is the icing on the cake. It allows companies to release features many times a day and to respond quickly whenever trouble arises. As soon as your team has identified the issue and created a ticket, you can bang out a fix, get it through QA, and deploy the fix to production—all in the same day. It’s helpful for a team to recognize how quickly it can fix issues. This is where MTTR comes in.
Calculating MTTR is simple. Take the total amount of downtime in a given period, and divide it by the number of incidents in that same period. For example, if you have three downtimes in a month, and the total amount of downtime for the three combined is two hours, that’s an MTTR of 40 minutes. That’s a good MTTR by most measures. But the closer to zero you can get, the better. The lowest score wins this game.
Origins of MTTR
The website of the Institute for Telecommunication Services—the research laboratory of the U.S. government’s Telecommunications and Information Administration—has a definition of MTTR that dates to 1996. So this is not a new concept by any means. Traditionally, MTTR was short for “mean time to repair,” and it referred to just that: repairing a physical device. For many years MTTR meant, How quickly can I get this new drive installed and up to date? Or how long will it take to swap out that failed network card?
But in the modern cloud-based world, you’re rarely wrangling hardware anymore unless you happen to work for Amazon Web Services. These days, most companies with a website or a mobile application are deploying to one cloud service or another. This could be an Infrastructure as a Service provider, such as AWS. These services abstract away the hardware, but you’re responsible for pretty much everything above that. Or it could be a Platform as a Service provider, such as Heroku. These services take care of more of the systems work for you and act as more of a simple deploy target.
In either of these cases, you’re not going to have to replace drives or deal with any hardware issues. That doesn’t mean you won’t have issues, however. You can still have bugs in your code, or you can misconfigure something. When you’re cranking along at the speed of modern software development, bugs are inevitable. Since you’re not repairing hardware anymore, these days people tend to think of MTTR as “mean time to recovery.”
Not All Downtimes Are Equal
When you start adding up the minutes and dividing by the number of outages, remember that not all downtimes are equal. For that matter, you’ll probably want to consider more than just downtime. If your team has solid code-review and QA practices in place, you might not see actual downtime very often. But it’s still useful to track how often a bug sneaks through to production and how soon the fix for it goes live. Therefore, it’s important to compare apples to apples and group your MTTR reporting into logical categories.
One axis of measurement is your wallet. A minor inconvenience or display issue is unlikely to have a financial impact, but the site being down for an hour could mean a significant loss of income. You should still fix things that annoy your users because those things can add up and cause a gradual mutiny in your user base. But tracking your response times for those issues should be a separate calculation from keeping the lights on.
Similarly, it makes sense to put areas of responsibility in separate buckets. Keep those design or layout issues as a separate metric from backend validation bugs. Track downtimes due to DevOps misconfiguration issues separately from unresponsive page loads due to missing database indexes. These separate numbers might align with team boundaries or across them. But it’s important to bundle similar problems when you’re calculating MTTR.
Keep Your Headlights On
Probably the most important thing about tracking your mean time to recovery is knowing when there’s something to fix. If you don’t know that your site is down for hours, your MTTR is going to be terrible. The way you know something is wrong is with accurate monitoring and prompt alerting.
Now that many software-as-a-service monitoring tools are available, there’s no excuse for a problem going unnoticed for very long. You can choose from lots of application performance monitoring solutions. Even using free, open-source tools is better than just hoping for the best.
You’ll want to set up monitoring to track a variety of metrics that are essential to your business process. Site availability is an obvious metric. But you might also watch the login rate. If you’re an e-commerce site, tracking the rate of purchases is a good idea. Most monitoring solutions allow you to track errors, but perhaps some errors can just be logged, while others should sound alarm bells for your on-call staff.
And speaking of alarm bells, carefully consider what leads to someone being paged versus simply an automatic ticket creation. If your team members receive too many pages for non-serious issues, then they’re likely to suffer alert fatigue and be slower to respond to pages. Don’t drag the whole team online for every issue. Have a first responder triage the issue, and bring in others only if necessary.
One Monitoring Option
One such monitoring solution is XpoLog. This tool offers easy setup by providing dozens of preconfigured apps for popular web servers, application servers, cloud services, and so on. It brings the power of AI and machine intelligence to its log analysis. And it offers notifications as well. It’s definitely worth a look.
Regardless of which tools you use to track bugs and fixes, however, it’s well worth keeping track of your MTTR. It gives your team one number to focus on to indicate how you’re doing. Here’s to a bug-free future!