This is the one post I hope you’ll never need. However, should you ever need it, this is your one-stop shop for understanding how to proceed with DevOps incident management. Have you just been attacked? Did the commit go wrong? A CI pipeline went haywire? Don’t worry. I got you.
I know there’s plenty of material out there already, but what makes this post different is that we’ll discuss the how-tos, namely, the steps to follow in DevOps incident management, as well as the best practices. As a DevOps manager, you understand how important it is to maintain a secure infrastructure. Combine a knowledgeable team with state-of-the-art infrastructure and a proper DevOps workflow, and you’re ready to tackle whatever comes your way.
It all begins with understanding what it all means. DevOps is not just a meaningless list of steps to follow. It combines people, actions, and tools to create workflows and responses. Moreover, it’s almost a spirit that drives the development life cycle forward in a safe, controlled, and efficient way. However, incidents do happen, and we’ll go over what to do about them.
So, what’s an incident? Your company may define it differently, but in essence, it’s when something happened to the data or the infrastructure you manage. The term, as well as the causes, can be described that loosely. The consequences, however, are not: server downtime, security breach, data corruption.
Having discussed the basics here, let’s jump into the steps to actually fix things.
As I mentioned before, as soon as there’s an incident, there are five well-known steps to follow. Unfortunately, as smart as I want to seem, I didn’t come up with them. And although they’re easily accessible, I think they’re due for a refresh. That is, these well-known concepts have been around since the late 2000s, and since then, the applications and concepts have changed drastically. While they’re very useful, you always need to remember that there’s no one-size-fits-all solution.
So, what are the five steps to DevOps incident management?
Firstly, early detection is critical. Your log monitoring comes to play an important part at this stage. Similarly, it’s important to remember how key it is to set up your metrics and alert systems. You can set up alerts on a number of different things. Server monitoring includes network usage, CPU and/or GPU usage, HDD capacity, etc. A constant collaboration culture will allow your organization to define what to look for. False positives are always a thing.
The crucial part to understand is that as soon as you detect an incident—whether by yourself or using a monitoring tool—a ticket or event should be recorded adequately by the team. Now, you’re ready to respond.
At this point, the DevOps team starts to receive all of the information coming from the monitoring tools. Normally, there’s an on-call team, leading what needs to be done. In simpler terms, that means rerouting the information to the proper channels.
The exact channels depend on your company’s regular day-to-day operations. Some people might use the ever-popular Slack to communicate. Fancy a Microsoft solution? Microsoft Teams will work just as well. And, of course, if you’re lucky enough to all be in the same place, reaching out to the relevant parties and gathering them in a conference room will be the way to go.
But who exactly do you reach out to?
- Customer support
- Product management
Or in more extreme cases (I hope this isn’t you), you might contact these groups:
- Digital forensics
- Security and upper management
Hopefully, your issue is properly channeled and responsibilities have been assigned. Now, it’s time to do some problem-solving.
Now let’s fix what happened. At this point, you, as incident manager (IM), have all the tools to understand what happened, but more importantly, you should be able to provide context. Following that, it’s time to address the situation and limit its potential setbacks. Was cybersecurity involved? That always takes priority. In this stage, systems and data are recovered, and applications should return to normal.
Collaboration is very important. Your sysadmins and engineers should work together to respond to tickets and commit patches that remedy the situation. If required, the database should probably be restored from a backup.
Good news though—you’ve made it this far. The next step is to learn from our mistakes and retrace our steps to analyze what happened.
Also called the analysis stage, this is where we commit to learning from what happened. As IM, the job’s on you to gather (even more) information and metrics from the relevant members of the response team—they’ll come in handy.
I can’t stress enough that this stage should be blameless at its essence. Pointing fingers never helps anybody, and it’s considered a reactive move, completely the opposite of a proactive measure, which should be the way to go. What’s more important, at this stage, we assign responsibilities to prevent this situation from happening again. Also, at this point, the tickets are closed.
The solution should also be coordinated and framed within the organization’s security policy. Now you’re ready for the last step, which involves including the fix as part of the continuous integration strategy.
The name should explain it all by itself. These are the proactive measures your company takes to prevent the incident from reoccurring. Moreover, this is particularly important for your providers and users out there. Post-incident information is processed and sent to the engineering managers; this way, they prioritize permanent fixes along the development pipeline.
In addition to understanding the steps to take, the second most important thing is to know the best practices out there. DevOps can be confusing, and everybody seems to be an expert. Here are a few tips.
Don’t take the one-tool approach. This involves getting your team used to tracking and monitoring on different tools. Basically, tracking just servers isn’t enough. Application logs are important too.
As a matter of fact, following a 360-degree approach will serve you even better. Looking for ideas? XPLG offers a great product that speaks to a number of different tools and log formats. Their extensive marketplace allows you to connect with hundreds of stacks out there.
Hand in hand with the above, keeping a safe balance between under- and over-alerting will keep everyone happy. Your team won’t appreciate an irrelevant alert in the middle of the night. They want alerts of higher importance and that they can do something with. In a word, actionable. Even better, some companies develop their own systems to assess severity as issues arise.
Finally, assess your security policies. Having an updated security policy is key. In fact, it determines how your company will react to an event and reassures your users that their data is safe. Moreover, talk about how you comply with the latest in security practices out there.
Responsibility. As a potential IM, you need to know firsthand who’s who. That will make it easier to assign incident management roles. A common set of roles for companies out there includes the following:
- Communications leader: Undeniably, this is the key person who serves as the liaison between the IM and the business and tech parties.
- Tech leader: The person in this position is in charge of assessing the situation from a technical level and applying the proper fixes in a timely manner.
- Engineering manager: This position is normally present beforehand. Typically, this person is in charge of following up with the post-incident reports. If the tech leader is in charge of the investigation, the engineering manager takes care of the long-term remediation.
Defining metrics and procedures should be part of every DevOps team. I hope this post becomes the one resource you need to evaluate how your company is doing and how it approaches DevOps incident management. Let’s be clear—incidents can happen. In the era of data science, your data and monitoring should drive the operations.
What are you waiting for to get started?
This post was written by Guillermo Salazar. Guillermo is a solutions architect with over 10 years of experience across a number of different industries. While his experience is based mostly in the web environment, he’s recently started to expand his horizons to data science and cybersecurity.