Reduce recovery time with AI-enabled troubleshooting

Machine learning algorithms for anomaly detection can help DevOps in day-to-day work routines, where generalized ML models are trained and applied to detect hidden patterns and identify suspicious behavior. Machine learning applied to IT operations (AIOP) is beginning to move from research to production environments in enterprises.

Florian Schmidt, postdoctoral researcher at Technische Universität Berlin, spoke about AI-based support for troubleshooting log files at DevOpsCon Berlin 2021.

According to Schmidt, companies hiring experts to troubleshoot log files are facing unaffordable costs as there is a dearth of DevOps/SREs, while applications are increasing their number of unique hosted components through services and containerized functions.

Schmidt explained how machine learning can be used to reduce troubleshooting time:

I see the major role of machine learning models in helping DevOps/SREs detect anomalies combined with insightful reporting. This process includes identifying root cause components (for example, network switch configuration, memory leak in running service, or hardware issue), providing major abnormal log messages, prioritization of incidents when several occur at the same time and the enrichment of additional information such as the analysis variable helping to solve the concrete problem.

Machine learning models can detect hidden patterns and identify suspicious behavior in log data, as Schmidt explained:

In more detail, there are two types of anomalies in log data. The first type is called flow anomalies. Flow anomalies refer to anomalies capturing an indicated problem in the frequency and sequence of incoming log messages. ML models learn the frequency, ratio, and sequence of incoming log message patterns to detect missing expected log messages, newly derived messages, and changing message counts.

The second type is called cognitive abnormalities. Cognitive abnormalities represent the identification of problems in the log message. Since log messages are usually written as unstructured text provided by developers for developers, anomalies are represented in text semantics. ML models learn this semantics through NLP techniques to detect groups of words, which are generally associated with abnormal behaviors such as: exception, timeout and failure. Additionally, variables inside a message provide valuable information (like HTTP response codes) indicating anomalies. These anomalies are also classified as cognitive anomalies, but require additional types of ML models capable of detecting variables in log messages and applying time series analysis.

InfoQ interviewed Florian Schmidt on AI-powered troubleshooting.

InfoQ: What is the state of the practice for troubleshooting complex applications using logs?

Florian Schmidt: Companies leverage log management frameworks like Elastic-Stack to systematically monitor all application components, store log data in a data warehouse, visualize application-specific performance KPIs, as well as to apply configurable alerting capabilities.

Such frameworks make it possible to systematically automate the troubleshooting of applications. DevOps/SREs can add self-defined queries to automatically find suspicious regex patterns in log messages, and additionally add thresholds to be alerted.

Yet, many companies have not yet integrated log management frameworks into their operational processes, but try to troubleshoot with many experts, manually searching log files, to get a mean time to recovery (MTTR) fast.

InfoQ: What are the pros and cons of these approaches?

Schmidt: Organizations that already have key infrastructure components of log management in place can build on this by adding analytics tools. The benefits are definitely to automate the alerting process by identifying the root cause components of an application and delivering the most suspicious log messages to the expert 24/7. Such assistance allows the expert to focus on solving the problem rather than wasting valuable time in the identification process.

I believe in the movement towards log management frameworks, as they provide standardized APIs to interact with log data and allow the integration of other plugins, capable of applying even more complex analyzes thanks to machine learning . It can further help DevOps quickly determine the correct root cause and speed up MTTR.

InfoQ: What role can machine learning models play in troubleshooting?

Schmidt: In my PhD thesis on anomaly detection in cloud computing environments, I showed that troubleshooting steps for generalized ML models can be trained and applied in production environments.

We conducted a case study in which we were able to show that ML-based anomaly detection is able to reduce search time by 98% compared to manual search. The key idea of ​​anomaly detection is to capture “normal” behavior as a high-dimensional distribution of the service being monitored as part of day-to-day operation. The distribution can be learned automatically (with AutoAD4j, a framework for unsupervised anomaly detection) while operating a service, while alerting of anomalous/atypical situations (data that does not correspond to “normal” operation learned).

The distribution of time series data such as monitoring metrics (CPU, memory, network, etc.) can be captured by reconstruction models such as autoencoders and forecasting models like ARIMA, while data from log are typically described by auto-encoder models via word occurrences in the log message on time. When alerting to abnormal behavior by deviation from the “normal” distribution, a reconstruction error is calculated indicating the severity of the anomaly. The most serious anomalies are then reported to DevOps indicating the concrete service and log messages.

InfoQ: What did you learn?

Schmidt: Key lessons learned from implementing machine learning to IT operations are:

  1. Build a team of DevOps and Data Scientists to effectively learn from each other.
  2. Data is a key resource. Like most ML domains, labeled data is the most important value for building applicable models. Ask your DevOps to label concrete log messages, which mainly helps to identify the underlying problem or fix the problem. Further building test environments with chaos engineering techniques can also help capture problematic behaviors more effectively.
  3. In practice, it is important to focus on the simplicity of ML models and the management of them. Models that require high training times or need to be retrained with DevOps feedback are often difficult to maintain. Models that generalize and use AutoML or unsupervised techniques make it easier to maintain when it comes to operating within the infrastructure.

InfoQ: What do you think the future will bring to AI applied to troubleshooting?

Schmidt: Current research and early applications like show the ability to detect flow anomalies and cognitive anomalies in logs with very accurate results. For logs, the future is in the direction of structured logging to standardize the way logs are written, while companies additionally leverage more complex deep learning models to automatically detect anomalies.

In the long term, I expect that at some point there will be a fully automated self-healing pipeline, which will not only be able to detect anomalies, but also recover and mitigate any anomalies . It would be an end-to-end solution, which I consider to be an immune system designed for computers.

Previous Progress has released a new troubleshooting solution, Fiddler Jam
Next SkyWest computer problems cancel some flights to Fresno