Hybrid Cloud Challenges: How to optimize monitoring and analytics for ITOps

Bonus: “Logs You Should Monitor” free list

By: Sadequl Hussain, Product Marketing Associate

Let’s start with a simple picture and a question.

You are involved in the day-to-day running of a complex and hybrid cloud of IT systems and applications.

These systems, technologies, applications, processes and other platforms perform a number of critical business functions.

And as an operations team member – whether you are an IT monitoring, IT Operation manager, DevOps engineer, a system administrator or a DevSecOps – your job is to ensure these functions are running smoothly, your systems are always available and running at peak performance and security.

But the environment you and your team is responsible for isn’t simple.

It’s a massive mishmash of legacy and modern technologies.

You have both data centers and a cloud footprint, each talking to the other with APIs.

Your network consists of many pieces of infrastructure, operating systems, databases, distributed applications and security systems: some legacy, some very new.

There are large applications running in n-tier server farms side-by-side with dockerized microservices running in managed Kubernetes clusters.

Public-facing API endpoints don’t make things any simpler.

To make it more interesting, increased sophistication of global cyberattacks means security is also a core business function for your enterprise.

To fight against attacks like internal threats, unauthorized user activity, DDoS, SQL injection or ransomware, you have to look for threats and unknown risks constantly.

Applications and access logs need to be looked at more closely for attack patterns or warning messages.

With all that’s going around you, here’s the question: at any time, are you completely aware of how your systems and applications are doing? 

Well, you may already have your performance monitoring tools and after-hours paging systems in place.

Application performance monitoring (APM) platforms, SIEM products and log management solutions could all be doing their jobs.

But most of those solutions are as good as the business rules and thresholds your team has manually input into them.

In today’s highly dynamic and hybrid environment, you need to make sure that these products are capable of using AI/ML technologies to continuously analyze events and learn about errors, patterns and problems.

Ask yourself: Can they run anomaly detection, predictions and correlations based on those learnings?

In other words, are these technologies capable of adapting to the dynamic nature of hybrid clouds and modern applications architecture?

Take the example when something breaks after an automatic patch update.

Can the solutions you are using “learn” from the events that led up to the error and warn you about such outages before they happen again?

In this article, we will show you how to get these answers with a next-generation breed of tools specialized for the hybrid cloud.

Challenges of a complex and hybrid cloud environment

Things used to be much simpler before.

Traditional Network Operations Centres (NOC) would continuously monitor IT infrastructures like server hardware, network devices, storage, virtual machines, firewalls or physical security among many other things.

Modern enterprises also often use Application Performance Management (APM) tools that target distributed applications.

These type of solutions can traverse call stack traces between app modules, measure UI performance, capture network traffic performance to identify bottlenecks. 

The problem however, lies somewhere else.

Most large organizations typically have an on-premise presence and also sometimes a cloud presence.

Monitoring hybrid networks end-to-end is something only a few solutions can do.

Together with the complexity of the network infrastructure, there’s the ever increasing complexity of applications.

As companies go through their digital transformation, they often have to run on-premise monolithic legacy systems side-by-side with cloud-hosted microservice-based distributed applications.

Monitoring both at the same level of detail isn’t always possible.

Moreover, all these systems generate data – lots and lots of data. Metrics, logs, counters, events, traces, call stacks, sniffed traffic  – all these can be classed in a single term: “machine data” and “cloud data”.

But how do you make sense of this machine data? Where do you save it all, and how do you save it? How can you process all this feedback information efficiently and access it just in time?

Sounds like a big data problem? Yes, it is.

The world of ITOA and AIOps

The explosion of feedback information from monitoring systems is directly related to the growing complexity of IT systems.

Information Technology Operation Analytics (ITOA) is a method that tries to bring order to this chaos.

A system capable of ITOA can collect, parse, analyze, retrieve and report on the machine data collected from heterogeneous sources.

Just like any big data processing system, it uses data analytics to produce business insight for the system administrator.

So what are these analytics methods?

We won’t go into the details here, however Gartner classifies the analytics processes into five categories, one of which is log analysis.

With almost every system generating logs in one form or other, logs are a treasure trove of information.

Parsing, analyzing and correlating logs can answer the questions of who, when, how and perhaps – why.

Gartner also defines the following abilities of an ITOA-enabled application:

  • Indexing and searching unstructured text data and developing inference from the search results
  • Running searches on multidimensional databases
  • Running complex operations and event processing on ingested data
  • Discovering statistical patterns from events and recognising them from future event streams

For example, an ITOA-capable system can build the “model” of normal operations of a web server by looking at its access logs over time.

When the current logs from the web server show events that deviate from this model, the system can trigger alerts.

This gives the operations team time to investigate before something bad happens.

At other times, this same system could correlate access log events with other pieces of information like metrics, unstructured data and security information to create operational insights.

Log, unstructured data and metrics handled by an effective ITOA-enabled system should include the following (this is not an exhaustive list though) :

  • Server access logs
  • Server event logs
  • Server patching logs
  • Web server access logs
  • Web server error logs
  • Database access logs
  • Database error logs
  • Database performance metrics
  • Firewall rule change logs
  • Network traffic logs (e.g. VPC flow logs)
  • Network device logs (e.g. routers, hardware VPNs and firewalls)
  • Load balancer logs
  • Endpoint protection system logs
  • Application configuration management logs
  • API call logs
  • Infrastructure modification logs
  • Identity and access management (IAM) system logs
  • DNS change logs
  • CPU, Memory, Disk, Network metrics
  • Etc. 

The next step in making use of machine data is to create predictive models based on real-time information.

This is in contrast to the historical event data that ITOA handles.

This is the area of AIOps: Artificial Intelligence for IT Operations or Algorithmic IT Operations.

What ITOps mean?

AIOps apply machine learning models and AI on real-time log data.

It also uses a broader and diverse set of data sources.

Cloud or on-premise systems, metrics, traces, performance counters, alerts, ITSM CMDB, business data can all be its data source.

An AIOps-enabled platform can therefore:

  • Inform the operations team the exact reason for an incident as it happens – since the platform is correlating all the real time log data and verifying it against one or more pre-built models
  • Look at historical trends of past incidents and find the root cause and events that lead up to the incident. Looking at current events, it can compare against past occurrences and warn the operations team in advance
  • Identify IT systems and infrastructure more susceptible to faults and failures. It can cross-check against previous incident reports from ITSM applications and make

IT operations monitoring for today’s enterprise

As time goes, the breadth and complexity of the enterprise IT landscape will only increase.

Operational monitoring solutions, therefore, need to evolve to meet these challenges.

An effective solution should have a number of attributes such as:  

  • The ability to capture a wide variety of operational data from different systems both in the on-premise network and the cloud. This should be done natively, or with trusted connectors. The collection process should be automatic and with minimal disruption to the target systems

    The source data can include logs (use the list above as a guide), performance metrics, metadata, traces, streaming data, config items from CMDB, change logs, incident reports from service management software etc.
  • The capacity to store, parse and index this massive volume of data in a compressed and efficient way. The data can be structured, unstructured, semi-structured with varying degrees of ingestion rate
  • The power to apply big data analytics on the data for operational intelligence and insights. As an ITOA-enabled application, it should be able to look at historical data to automatically detect anomalies or deviations from normal events, and yet be smart enough to identify edge/outlying conditions.

    With the ITOA capability, the system should also be able to perform complex correlations between events captured from multiple sources for a given timestamp and show their relationship. System administrators should then be able to drill-down and slice-and-dice these results
  • A native capability to apply AI and ML models on live, operational data to predict outages or breaches before they happen and assign risk levels to these predictions. The system should be able to identify new errors and anomalies in the data and automatically assign risk levels to these errors  
  • A collection of pre-built charts, dashboards, and widgets should enable the operations team get started quickly with their most basic questions.

Skip between sections:

Don’t be fooled by the D.I.Y approach

Organizations often try to build their own operations monitoring solution using open-source log management platforms like the ELK stack (Elasticsearch, Logstash and Kibana).

With this approach, logs are collected by beats agents from different systems and then sent to the central Elasticsearch server which then indexes this data.

After that, Kibana provides an easy-to-customize visualization layer for the user.

While the do-it-yourself (DIY) approach can work for small and highly technical development teams, larger enterprises and busy IT operations teams quickly face a number of roadblocks when building, enhancing and maintaining such systems:

  • The solution does not scale well for hybrid networks with both on-premise and cloud footprint. Not all systems can send data through an agent, the data formats can vary, and often the target systems are unreachable due to security constraints
  • As more and more systems are onboarded, the solution often becomes slow as it can’t cope with the massive amount of data. Making it run at optimal speed requires continuous fine-tuning
  • A properly designed system to withstand the data load usually means a complex deployment model, which can be expensive from cost perspective
  • There is no native support for proactive analytics and AI/ML. Building such capability on top of the solution means more effort and cost

All these problems and the time, effort and money required to keep going push the overall total cost of ownership (TCO) high up.

SaaS for Operations monitoring 

Some organizations prefer Software-as-a-Service (SaaS) solutions for their simplicity and lower maintenance overheads.

While SaaS platforms can offer great features, they typically can’t integrate well with on-premise networks, meaning part of the IT landscape remains out of its reach. 

Out-of-the-box AIOps solutions

It is clearly impossible for internal IT teams to dedicate time to enhance and improve systems they have built for monitoring other systems.

An out of the box solution, specialized for operational readiness typically fares better here.

This is because the solution’s developers are dedicated teams always trying to improve the platform’s capabilities.

It’s their business, and vendors will dedicatedly work to improve their products’ quality.

In the end, the costs and time outweigh the benefits.

The next-generation IT ops 

XPLG AIOps™ is a sophisticated, state-of-the-art IT operations solution offering a number of unparalleled features.

Built for both private and public cloud, it offers unlimited collection, parsing and processing of logs from a wide variety of sources in a hybrid network.

As a latest generation platform of its kind, its log collection process is non-invasive (agentless), which means there is very little overhead with system integration.

XPLG AIOps™ can work with many different datasets including those coming from SaaS or ELK-based solutions.

Metrics sent by Elastic beats collectors and streaming logs are effectively married up by harnessing the advanced capabilities of machine learning.

This saves operations teams from the long and expensive deployment and management of an Elasticsearch infrastructure.

As an ITOA and AIOps enabled application, XPLG AIOps™ can automatically apply sophisticated machine learning and analytics on its data and show easy-to-understand insights.

It can also warn about possible anomalies and breaches.

To give an example, let’s consider the image below.

Here, XPLG has analyzed a set of logs to identify incidents of possible break-in attempts: 

The next generation of AIOps to handle hybrid cloud challenges

Such features are one of the core strengths of the product.

This automation is made possible with XPLG’s unique technology for “deep text inspection” which automatically scans all incoming logs for errors, risks, statistical problems, and predefined rules.

XPLG AIOps™ can perform deep inspection and anomaly detection in auto-detect mode.

The analytics system can then compare events against these criteria and map out any incident.

It also applies advanced AI and ML algorithms to monitor new errors and exceptions in the logs.

With the workflow mechanism in place, operations teams can get daily reports on risk levels, and new errors that can have impact on operational health. 

The XPLG platform also comes with:

Today, XPLG AIOps™ is helping IT operations, DevOps and DevSecOps teams in many companies to stay ahead of the growing demands of their modern infrastructure and applications that need to run on peak performance and tight security.