Log Management articles are gathered in a special category in our blog to help our community of readers to tackle issues like security, performance, health, compliance, and more.

We also focus on issues such as ROI, extracting value from log management tools, how to add functionality to open source tools like ELK, using out of the box tools like XpoLog.

We cover log management use cases for a different type of logs.

We wish to help you easily understand errors, problems, trends, patterns, detect risks, exceptions, anomalies across your entire environment and help you extract actionable insights out of your log data.

We bring experts from diverse fields to write articles to overcome different challenges and daily requirements.

Our articles are very practical and don’t focus on XpoLog users only.

We aim to deliver value to any IT, DevOps, Security, professional.

What Is Log Analytics? Find Important Patterns in Your Logs

If your work in IT, especially in the software field, then you’re almost certainly aware of logs. An application that doesn’t generate any type of logging would be unheard of, to say the least. What about “log analytics?” Do you know this term? Depending on your specific role, your knowledge of the subject could be at any point on the spectrum, from “passing familiarity” to “intimate knowledge.” This post targets the former end of said spectrum—we’ll explain what log analytics is and why your organization should care about it.

We start the post by offering a quick overview of logging and its importance. Feel free to skip this section if you’re well-acquainted with logging.

After that, we get into the log analytics part. We start by defining it and talking about some of its applications. Then, we proceed to explain how log analytics works and how your organization can profit from it. Let’s begin.

The Importance of Logging

Before we get into the nuts and bolts of log analytics, it’s important to take a step back and examine logging first. What is logging, and why should you care about it?

In short, logging consists of recording information about what your application does while it’s running. The idea is to have this documentation about your app’s behavior written to a persistent medium, so you can access it later. Why would anyone want to do that in the first place, though?

There are several reasons for that. The primary use for logging is to be able to perform what people sometimes call a “post-mortem debug.” That is, when—not if—your software fails in production, you want to be able to read through your log files in order to understand what went wrong. By carefully analyzing the log files, you should be able to reconstruct the scenario, “replaying” step after step until you hit the point when things went sour.

This is essentially a reactive use of logging, though. Using log entries to reconstruct failure scenarios, while useful, is no more than putting out fires. You’re reacting to something that happened. What if you could be one step ahead and avoid the fires altogether?

Stepping Up Your Game: Log Analytics

As it turns out, there are ways in which you can turn logging into a more proactive approach, so to speak. It’s possible to analyze your log entries, obtaining valuable insights from them. How do your users really use your application? What are the features they use the most? Which ones do they rarely touch? What times of day see more and less activity?

Log analytics can be of great helping answering these and more questions.

Defining Log Analytics

Even though we haven’t formally defined log analytics, it should be somewhat clear by this point what it’s about—analyzing log entries to find insights. It’s a short and precise definition. It’s also boring and unremarkable at that. Let me propose another definition then:

Log analytics means taking an active approaching to logging, instead of a passive/reactive one. It means analyzing log entries to better understand our applications, to be one step ahead and solve problems as quickly as possible—or prevent them before they happen.

So, log analytics manage to turn your logging approach into an active process, using the log entries to obtain useful insights for the organization. What are those insights? How can we apply log analytics in practice, and what do we get from it?

Applications of Log Analytics

We keep saying that log analytics can help you get insights from your log entries. But what kind of insights are those? What are the true benefits you can get by adopting log analytics? That’s what you’re going to see in this section—the application of log analytics.

The potential applications for log analytics are many. Determining trends is something that quickly comes to mind, but the technique can benefit other areas. Take security, for instance. Log analytics might help you detect protocol failure, network outages, and increase awareness about security as a whole.

But the business as a whole benefits from log analytics and the useful insights it can bring. What follows is a non-exhaustive list of the types of insights you can obtain with log analytics:

  • Determine page popularity. Log analytics can help you determine the most visited pages on your web sites, as well as the time of day with the most visits. You can use data like this to measure the variation of the public over time.
  • Detect suspicious occurrences. Log analytics could also help you detect unusual or suspicious events. For instance, a user of a payment service making a purchase with a value way above its average.
  • Personalize the user’s experience. You could use data from logs to perform business analysis on a user’s behavior, and then use the results of such analysis to personalize their experience on your app.
  • Improve security. Through a comprehensive analysis of the security logs, you can learn about invasion attempts and other malicious activities.
  • Improve troubleshooting of issues, by helping you localize the root cause of different types of errors.

How Does Log Analytics Work: A Brief Overview

You’ll now see a quick overview of how log analytic works in practice. Keep in mind that this will be a high-level, general view—different tools can employ different approaches, and the details may vary. After this little disclaimer, we’re ready to go.

So, in order to perform your log analytics process, you first need to feed it with log files. The log analytics tool might start by collecting log files from a wide variety of sources and events. Security breaches, installation, user sessions—these events are just a fraction of the types of occurrences that usually leave logs behind.

After collecting log files, the tool should you start parsing/analyzing them. How it does that will depend on the information included in each log entry. At the bare minimum, this is what a log entry should record for the average application:

  • A timestamp. There’s no use in recording an event without the date and time in which it happened.
  • A logging level. Important to understand the severity of the event, for later filtering and searching.
  • The log message. A description of the event of interest.
  • The identification of the logged user.

The log analytics process consists of organizing the data found on the log entries. To do that, it employs techniques like classification and tagging, pattern recognition, artificial ignorance, and correlation analysis.

Classification and tagging are exactly what it sounds like—it classifies events into categories and assesses keywords to them. Pattern recognition consists of comparing new events to old ones and, that way, determining which events will be relevant. Artificial ignorance is a strategy that makes use of machine learning to decide which parts of a log file are noisy data and should be ignored. In other words, it’s a type of filter that helps eliminate useless data, making relevant events more noticeable.

There’s More to Logging Than Putting Out Fires

Log analytics is a technique that enables you to evolve your logging approach, putting it into use in an active (rather than reactive) way. If your organization just collects logs and then do nothing with them, you’re wasting a valuable resource. Failing to extract the insights hidden in your logs is like having a treasure buried in your property and not profiting from it.

Don’t waste your log efforts. Profit from them. Put log analytics to work for you ASAP. In this post, we’ve offered you an introduction, but what are the next steps?

Make an effort to have log entries that are not only parseable by machines, but human-readable as well. Learn about tools at your disposal that can make your life easier. Also, continue checking out this blog, which will continue to feature posts where you can learn a lot about everything log-related.

This post was written by Carlos Schults. Carlos is a .NET software developer with experience in both desktop and web development, and he’s now trying his hand at mobile. He has a passion for writing clean and concise code, and he’s interested in practices that help you improve app health, such as code review, automated testing, and continuous build.

What Is MTTR? Mean Time to Repair, Explained In Detail

Whether you’re slinging code, managing developers, wrangling servers, or filling most other roles in the modern tech firm, you care about keeping your software running while bringing home the bacon. If your website or application is down, you’re not making money. (Or, if you aren’t in this for profit, your message isn’t getting to the people who need it.) Therefore, it’s everyone’s job to keep things running smoothly.

But let’s face it: Some downtime is inevitable. When things break, it’s essential to fix them as quickly as possible.

Luckily, technology has come a long way. All sorts of tools, processes, and improvements can help you not only to write better software faster but also to fix it faster when it breaks.

 

As part of establishing effective processes, keep track of how often things break and how fast you fix them. You can track these as metrics and work to get the best grades possible—just as if you’re in school again. Probably the most important metric to track is MTTR, which stands for mean time to repair or recovery. Let’s take a closer look at this important number.

A Bold New World

Software engineering has come a long way since the dark days of the eighties and nineties. First, the agile revolution brought levelheadedness to our engineering process. Instead of trying to plan out every detail of the entire application before even writing one line of code, we’ve learned to tackle one small piece, figure out what works, adjust if necessary, and continue. There’s ample evidence that this approach makes better software than the old “waterfall” approach.

Then the rise of DevOps brought similar benefits for sysadmins. It brought dependability to provisioning and maintaining systems. We dumped the pile of ad hoc scripts and arcane knowledge tucked away in somebody’s brain and embraced configuration as code, complete with version control and shared responsibility. Tools like Puppet and Chef allowed companies to standardize their provisioning and system administration. This reduced the bus factor of having one wizard who knew where everything was.

And cloud computing changed the way companies think about hardware and their computing needs. Now you don’t have to buy servers, maintain them, or replace them when they get old. Instead, you just fill out some web forms and push some buttons, and a server magically becomes available to you (more or less). You can spin up a cheap or even free low-powered machine to test something and move to a beefier machine when you need it. You can also set up automatic scaling to bring extra computing power online when necessary.

Measuring Success

Agile allowed software efforts to focus on small pieces and course-correct as the software took shape. DevOps allowed teams to treat their infrastructure as another development effort, with version control and repeatability. Cloud computing has allowed you to stop worrying about hardware and has provided on-demand scaling. All these things together mean that modern software teams can move quickly and deliver features faster than ever.

Continuous Integration/Delivery is the icing on the cake. It allows companies to release features many times a day and to respond quickly whenever trouble arises. As soon as your team has identified the issue and created a ticket, you can bang out a fix, get it through QA, and deploy the fix to production—all in the same day. It’s helpful for a team to recognize how quickly it can fix issues. This is where MTTR comes in.

Calculating MTTR is simple. Take the total amount of downtime in a given period, and divide it by the number of incidents in that same period. For example, if you have three downtimes in a month, and the total amount of downtime for the three combined is two hours, that’s an MTTR of 40 minutes. That’s a good MTTR by most measures. But the closer to zero you can get, the better. The lowest score wins this game.

Origins of MTTR

The website of the Institute for Telecommunication Services—the research laboratory of the U.S. government’s Telecommunications and Information Administration—has a definition of MTTR that dates to 1996. So this is not a new concept by any means. Traditionally, MTTR was short for “mean time to repair,” and it referred to just that: repairing a physical device. For many years MTTR meant, How quickly can I get this new drive installed and up to date? Or how long will it take to swap out that failed network card?

But in the modern cloud-based world, you’re rarely wrangling hardware anymore unless you happen to work for Amazon Web Services. These days, most companies with a website or a mobile application are deploying to one cloud service or another. This could be an Infrastructure as a Service provider, such as AWS. These services abstract away the hardware, but you’re responsible for pretty much everything above that. Or it could be a Platform as a Service provider, such as Heroku. These services take care of more of the systems work for you and act as more of a simple deploy target.

In either of these cases, you’re not going to have to replace drives or deal with any hardware issues. That doesn’t mean you won’t have issues, however. You can still have bugs in your code, or you can misconfigure something. When you’re cranking along at the speed of modern software development, bugs are inevitable. Since you’re not repairing hardware anymore, these days people tend to think of MTTR as “mean time to recovery.”

Not All Downtimes Are Equal

When you start adding up the minutes and dividing by the number of outages, remember that not all downtimes are equal. For that matter, you’ll probably want to consider more than just downtime. If your team has solid code-review and QA practices in place, you might not see actual downtime very often. But it’s still useful to track how often a bug sneaks through to production and how soon the fix for it goes live. Therefore, it’s important to compare apples to apples and group your MTTR reporting into logical categories.

One axis of measurement is your wallet. A minor inconvenience or display issue is unlikely to have a financial impact, but the site being down for an hour could mean a significant loss of income. You should still fix things that annoy your users because those things can add up and cause a gradual mutiny in your user base. But tracking your response times for those issues should be a separate calculation from keeping the lights on.

Similarly, it makes sense to put areas of responsibility in separate buckets. Keep those design or layout issues as a separate metric from backend validation bugs. Track downtimes due to DevOps misconfiguration issues separately from unresponsive page loads due to missing database indexes. These separate numbers might align with team boundaries or across them. But it’s important to bundle similar problems when you’re calculating MTTR.

Keep Your Headlights On

Probably the most important thing about tracking your mean time to recovery is knowing when there’s something to fix. If you don’t know that your site is down for hours, your MTTR is going to be terrible. The way you know something is wrong is with accurate monitoring and prompt alerting.

Now that many software-as-a-service monitoring tools are available, there’s no excuse for a problem going unnoticed for very long. You can choose from lots of application performance monitoring solutions. Even using free, open-source tools is better than just hoping for the best.

You’ll want to set up monitoring to track a variety of metrics that are essential to your business process. Site availability is an obvious metric. But you might also watch the login rate. If you’re an e-commerce site, tracking the rate of purchases is a good idea. Most monitoring solutions allow you to track errors, but perhaps some errors can just be logged, while others should sound alarm bells for your on-call staff.

And speaking of alarm bells, carefully consider what leads to someone being paged versus simply an automatic ticket creation. If your team members receive too many pages for non-serious issues, then they’re likely to suffer alert fatigue and be slower to respond to pages. Don’t drag the whole team online for every issue. Have a first responder triage the issue, and bring in others only if necessary.

One Monitoring Option

One such monitoring solution is XpoLog. This tool offers easy setup by providing dozens of preconfigured apps for popular web servers, application servers, cloud services, and so on. It brings the power of AI and machine intelligence to its log analysis. And it offers notifications as well. It’s definitely worth a look.

Regardless of which tools you use to track bugs and fixes, however, it’s well worth keeping track of your MTTR. It gives your team one number to focus on to indicate how you’re doing. Here’s to a bug-free future!

This post was written by Ben Munat. Ben started with computers in the eighties but took a long detour through the nineties indie-rock scene. He started programming professionally in 2004, working as a consultant and for startups. Over the years, he’s used Ruby, Elixir, Java, and JavaScript. He’s worked with approximately a zillion libraries, frameworks, and APIs; solved hard problems; kept high-traffic sites running; and stomped countless bugs. He is a big fan of TDD, agile, code reviews, shared ownership, work/life balance, and having fun.

What is MTTD? Mean Time to Detect Explained In Detail

This post will answer a simple question, “What is MTTD?” The answer—or at least the start of it—was already spoiled by the post title. Sure enough, MTTD stands for “Mean time to detect.” It refers to an important KPI (key performance indicator) in DevOps. Is the question answered? Can we call it a day with that definition? Of course not.

In the rest of the article, we’ll go deeper into MTTD. You’ll learn its complete definition first—in other words, the “what.” Then we’ll proceed to explain why you and your organization should care about it. With the “what” and “why” out of the way, we finally get to the “how,” by showing how to calculate MTTD in practice.

Then we wrap-up by listing some other metrics that are essential for DevOps and are intimately related to our topic. Let’s get started.

Bring down your MTTD. Detect errors automatically with XpoLog – download now

What is MTTD: A Definition

As you’ve already seen, MTTD stands for “Mean Time to Detect.” The next question naturally follows: “But to detect what?” In a nutshell, problems.

MTTD means the average time it takes to detect a problem inside an organization. In other words, this indicator measures the elapsed time between the start of an issue (such as software malfunction or hardware failure) and its detection by the DevOps team.

It can be simple to calculate MTTD since all it takes is that for the DevOps team to look at the total number of defects and the total period during which the system was down. It’s easier said than done, though, and we’ll look into how to perform this calculation in detail later.

Why is MTTD so important? The intuitive answer would probably be something along the lines of “the sooner we discover an issue, the sooner we can fix it.” That famous saying “you can’t improve what you don’t measure” also comes to mind. Putting the two together, we might say that MTTD is important because by measuring the time it takes to fix an issue, we take the first step toward bringing that time down. And since it’s been known for a long time that fixing a problem earlier is cheaper than fixing it later, it stands to reason that we should try our best to bring MTTD down.

Sure, that’s part of the answer. But the complete answer is both more complex and more interesting. Keep reading to learn why.

Why Should You Care About MTTD

We’ve just handled the “what” of MTTD, so now let’s focus on the “why.” What are the reasons that make this metric so crucial for organizations?

As you’ve just seen in the previous paragraph, one of the reasons why you should care about MTTD is that measuring it is required in order to bring it down. But that’s not all.

MTTD also has an additional—and arguably more important—benefit: it serves as a test of your monitoring mechanisms. Here’s the thing—your organization already adopts tools and processes to monitor incidents. If these tools and processes work as intended, it shouldn’t be that hard to keep your organization’s MTTD low. The opposite is also true. If you adopt incident management mechanisms that aren’t up to the task, you and your DevOps team will have a hard time keeping MTTD down, which can result in catastrophic consequences for your organization.

Finally, a third reason for caring about MTTD is that it’s related to other important DevOps KPIs, such as MTTF (Mean time to failure) or MTTR (Mean time to resolve.)

Learn how you can use XpoLog to keep your MTTD low

Calculating MTTD: A Practical Example

We’ve just covered the “what” and “why”, so now we’re left with the “how.” Let’s learn the formula for MTTD, which is somewhat simple. First, we add all the incident detection times (for a given team, let’s say.) Then, we divide that by the total number of incidents.

Let’s see an example. Consider the following table:

what is mttd table with start and detection times

The table represents seven incidents that happened during the period in which the organization tracks its MTTD (let’s say it’s two weeks.) Each incident has both a starting time and a detection time, which were recovered by analyzing log files. Finally, we have a third column that just displays the elapsed time (detection time – start time) in minutes. To calculate the mean time to detection, we now add all of the periods together:

240 + 46 + 10 + 40 + 39 = 375

Then we divide that by the number of incidents. In our example, 5:

375 / 5 = 75

So, the MTTD, in our example, is 75. That doesn’t have to be the end of the story, though. Further calculations or analyses can be done after you’ve reached the result. For instance, some companies might choose to remove outliers from the list of values, which might make sense depending on the circumstances. Sure, the very definition of outlier will vary from organization to organization, depending on the methodology you choose. In our example, we can easily see that the lowest and highest values are very distant from the typical values, so let’s remove them and use the formula again:

(46 + 40 + 39)/ 3 = 41.6666666667

By ignoring the outliers, we’ve obtained a different MTTD: 41.67, approximately.

Another way in which organizations can tweak with their MTTD results consists of grouping incidents by severity. That way, they can conclude, for instance, that even though the overall MTTD has been falling month after month, the MTTD for the high-critical issues has been steadily going down. This would be great news, but the organization wouldn’t find out about it just by looking at the general MTTD.

Other Metrics To Be Aware Of

MTTD is a key indicator to track the efficiency of your monitoring and incident response tools and processes, but it’s not the only one. Some others include:

  • MTTF (Mean time to failure): The amount of time a product/device/system is expected to work.
  • MTTR (Mean time to repair): The time it takes to fix an issue after its detected.
  • MTBF (Mean time between failures): The time the organization goes without a system outage or other issues.

Back to You

Hopefully you understand more about this metric—and why it matters—then you did before reading this article. In the case of MTTD, it’s all about monitoring.

Monitoring plays a vital role in the DevOps era we all live in.

Metrics such as mean time to detection and others are crucial for IT organizations.

In the case of MTTD, specifically, it’s purpose is two-fold.

First, the metric is fundamental for the organization’s efforts to bring it down. As the saying goes, “you can’t improve what you don’t measure,” so measuring is the first and perhaps most important step if you want to improve anything.

Finally, MTTD may serve the purpose of being a sanity check for the organization’s incident managing strategy.

If you use the proper tools and processes for managing the incidents on your organization, you shouldn’t struggle that much to keep MTTD down.

On the other hand, if the metric’s results are consistently high for your organization, that’s a red flag.

Now it’s up to you. What are the next steps? For starters, don’t stop studying. Keep reading on this blog and elsewhere, not only about MTTD but other metrics and broader topics as well.

Learn about the available tools at your disposal that can help you with MTTD. For instance, take a look at XpoLog, which is a full automated log management tool that can collect, parse, and tag logs from a plethora of different sources.

Log management is essential for implementing MTTD because logs are one of the most important sources of information you can use to determine the start times of problems.

You can also download XpoLog and see in your own eyes how it helps you with different monitoring goals, such as MTTD, but other aspects as well.

That’s it for today. Thanks for reading.

What Should You Learn from the Massive Data Breach “Operation Soft Cell”?

Apache Error Log & Apache Access Log: Complete Guide to Analyzing and Troubleshooting Apache 2020