What Is Fault Management? A Definition & Introductory Guide

All you telecom engineers out there must have already heard of fault management, right? Well, those of you who haven’t yet heard of it and need to understand what it’s about, worry no more! I’ll help you. Let’s start by defining the term.

First, What’s Fault Management All About?

ISO (International Organization for Standardization) defined a network management framework of which fault management is a component. That framework is FCAPS (Fault management, Configuration, Accounting, Performance, and Security). In other words, fault management relates to network management. In the same vein, a network management system must include a fault management system. The latter finds network problems and takes action to correct them. It also identifies and prevents potential or known problems that may occur in the future.

Therefore, the prime goal of fault management is to maintain network connectivity at all times. And by managing faults in a network, applications and services that rely on that network remain up and running. Most importantly, those applications and services stay accessible and properly functioning.

However, you know networks fail and go down, don’t you? So, what we all want are applications and services with fault tolerance and downtime minimization mechanisms in place. That’s when fault management systems come into play! Take a look at the next section for an explanation of what fault management systems do exactly.

How Can You Fight Network Faults?

Fault management systems are tools to prevent network faults from happening. Their major motto is fault tolerance, and downtime minimization is their main concern. But let’s first understand where those faults come from.

Network faults originate from events in the network that have an impact on service delivery. Those events may only interfere with service delivery, but, worse than that, they can diminish or block service delivery. Hardware failure, connectivity loss, and power outages are three examples of network faults.

Upon fault detection, a fault management system notifies the network administrator by triggering alarms. This means that a fault management system embeds an alarm system. Think of an alarm as a notification that can be viewed in the fault management system itself. Additionally, the network administrator can receive an alarm via email or SMS.

The fault management system shall monitor a more faulty area more frequently and thoroughly. So, depending on the frequency that an area of a network experiences faults, the intensity of monitoring shall be adjusted.

Sometimes, fault management systems can automatically solve a fault, dismissing the network administrator’s manual action. They can even use programs or scripts to prevent some faults from occurring! Pretty cool, isn’t it? But not every problem that affects the operation of a network is major or requires special attention. Many problems just require a trivial automatic fix performed in no time. As a result, IT teams can focus on major problems that are more difficult to fix.

Next, we’ll see how exactly a fault management system works.

More About Fault Management Systems

First and foremost, a fault management system must have a clear picture of the network topology. That topology contains a map of every device and node connected to the network. This allows the fault management system to oversee every point of the network that may cause downtime.

But how does a fault management system work? It frequently queries devices and nodes to evaluate whether the hardware is behaving well or not. Then it collects the information retrieved from those queries and analyzes it. Its goal is to catch any network performance problem that requires a solution. Sometimes, devices and nodes send information on performance problems to the fault management system by their initiative.

Fault management systems keep networks operational with the features that follow.

Threshold Definition

Thresholds defined in fault management systems are based on prior knowledge of conditions that led to faults. Therefore, thresholds are a warning mechanism to prevent potential faults. Let me give you some examples of thresholds:

  • A certain limit in the capacity of a node’s processing. If going over that limit led to faults in the past, a threshold shall be defined for the limit.
  • A particular link utilization. A link is a connection between two nodes in a network. Sometimes, traffic over a link is high enough to cut access to the whole network. That happens when a link’s traffic consumes all of the network bandwidth. And if it caused problems in the past, then that amount of traffic shall become a threshold. To calculate the threshold, divide the average traffic over the link by the total link capacity. You can even calculate the threshold by millisecond, second, minute, hour, and so on. Some tools use a weighted average, meaning more recent values weigh more than older ones.
  • The network utilization. Modern networks consist of many links. So, the average link utilization in a network may be a threshold.

To sum up, you must have a proper network infrastructure layout. If you can’t, at least determine improvement areas. Frequently, a single bottleneck in the infrastructure is a major source of faults in your network.

Constant Network Monitoring

A fault management system constantly monitors the status of a network.

Continuous Network Scan for Threats

For instance, viruses can lead to faults in networks. So, fault management systems must be able to detect them and either act accordingly or sound an alarm.

Alarm Triggering

Network administrators receive fault event notifications sent by the fault management system. The same system may automatically solve those faults, but some faults demand major action.

Fault Location Tracing

A fault management system needs to trace the locations of faults. One of the main reasons for that is to adjust the intensity of monitoring for the most faulty areas. By doing so, the fault management system can better prevent faults in those areas.

Automatic Correction of Fault Conditions

If it doesn’t require much effort, a fault management system can automatically prevent faults. It does so by correcting the conditions that may cause those faults. To achieve that, the system executes programs or scripts to perform minor fixes that are neither complex nor time-consuming. The same programs or scripts also enable the fault management system to automatically solve actual faults.

Comprehensive Logging

A fault management system creates detailed logs of system status and the preventive or reactive actions it took. From the perspective of fault prevention, logging with details is extremely important.

Now you know how a fault management system works and what its main features are. The next step is to distinguish between active and passive fault management systems. Let’s take a look …

Active vs. Passive Fault Management

Fault management and fault management systems can be active or passive.

Active fault management systems use strategies such as ping or port status checks to query devices and nodes. That allows determining the status of those devices and nodes by routine. It’s an active approach to fault management. That is to say that the identification and correction of conditions that potentially lead to future faults are proactive.

On the other hand, passive fault management systems monitor the network for actual fault events that have already occurred. It’s more of a corrective than a preventive approach. To clarify, it may only discover faults until there’s nothing left to do.

Now, focusing on the passive fault management approach, what’s the process of detecting a fault and solving it? Check it out in the next section!

The Fault Management Cycle

The fault management workflow is cyclical and continuous. It starts with fault detection, follows some steps until fault resolution, and ends where it began: fault detection. This is the general fault management cycle, as you may find below in more detail. However, any fault management system may implement a specific process that goes beyond the basic steps below.

Fault Detection

Consider that a fault management system is monitoring a network. Consequently, it discovers an interruption in the service delivery or that the service delivery performance is deficient.

Let the investigation begin! Go to the next step.

Fault Diagnosis and Isolation

The fault management system determines the source of the fault and its location in the network topology.

OK. So, the system already knows where smoke is coming from. But you know a bad thing never comes alone. What if there are a bunch of fault events all related to each other? It’s time for some alarm grouping!

Fault Event Correlation and Aggregation, Plus Alarming

A single fault can buzz multiple alarms. But that could be disturbing to the network administrators. And that’s why fault management systems combine related fault events and conduct a root cause analysis on them. Only after that, those systems fire an aggregated alarm for network administrators.

The network fault buzzer sounded! Now what?

Restoration of Service

Once the alarm is out to the network administrator, the fault management system automatically performs a quick and simple fix. It executes programs or scripts to get the service up and running again as soon as possible.

Service automatically restored, available, and working? Check. But what if the kind of fault demands a less quick and more complex fix?

Fault Resolution

Depending on the complexity of the fault, automatic restoration of service may not be possible. In those cases, the network administrator or a competent technician performs a manual intervention.

In this last step of the workflow, someone manually solves the fault. The resolution may be a correction, a repair, or a replacement.

At this point, you may be wondering what you need to do to put things into practice. Allow me to show you the way in the next section.

Ok, but How Can I Start With Fault Management?

You can either develop your own fault management system or buy one. If you’re going to develop your own, I must say that agile methodologies are appropriate. You can start by working on the most important root causes and observed signs of fault. Or focus on an area of your network. Or even on a type of device or node.

After that, here are the main steps that fault management systems specifically comprise:

  1. Define diagnostic goals for the system.
  2. Know and involve subject matter experts capable of providing reference knowledge.
  3. List possible root causes and observed signs of fault and prioritize them by impact and frequency.
  4. Specify thresholds.
  5. Test the system in a simulated environment with simulated values.
  6. Test the system in the real environment with live data.

Now, imagine you’ve got an IoT network. That demands special care like you’re about to find out next.

And What About IoT, Blockchain, and Cloud Computing?

FCAPS is useful to set a straightforward common ground for talking about network management with corporate management. And it still applies today. Nevertheless and when it comes to IoT solutions, they didn’t exist when FCAPS was defined. And IoT sensors are likely to generate measurements that can be confused with faults. Fault detection, diagnostics, and isolation are thus vital in IoT networks to ensure accurate data sets.

FCAPS is quite appropriate for centralized single-provider environments. But in a blockchain, how do you know which provider is responsible for a fault? In a blockchain, fault management takes place by sharing the state of each vendor’s network across the entire blockchain. In case of a fault, the whole blockchain receives a data log.

When applications are on a cloud, they go from server to server according to loads. For example, fault detection is tougher with virtualized servers. But there’s more:

  • Different tenants may experience a fault that originated from the same source (an overloaded server or an overloaded link).
  • Also, the high number of devices, nodes, and links contributes to the likelihood of fault occurrence.
  • The constant addition, upgrade, or replacement of devices contributes to configuration errors and, consequently, opportunities for faults.
  • And a change in one device can affect others.

Keep in Mind: Scan, Detect, and Solve

An operational fault management system is one of the most important assets against actual or potential faults in a network. That system can smell the fault and go after its source. It does it nonstop.

Once the fault management system gets to the source of the fault, it studies the fault. As a result, it suggests a solution to those in charge. It may even automatically execute restoration programs or scripts to instantly fix the fault.

Nobody wants to offer service over a network that’s down! So, set up your fault management system to prevent and react to fault events in your network.

This post was written by Sofia Azevedo. Sofia has most recently taught college-level courses in IT, ICT, information systems, and computer engineering. She is fond of software development methods and processes. She started her career at Philips Research Europe and Nokia Siemens Networks as a software engineer. Sofia has also been a product owner, working in the development of software for domains such as telecom, marketing, and logistics.

0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *