What Is AIOps Architecture and How Does It Work?

2024/08/24

The Importance of Automating IT Service Processes and Operations

The rate of change in the IT field, particularly in infrastructure and various platforms, is no longer what it once was, with the speed of changes having significantly increased. Organizations are faced with vast amounts of data collected by monitoring systems, the growing complexity of IT systems, and the need to deliver fast and reliable services to users. Many of these services are provided online, making the availability of these services critically important. The quality of service provided to users, often defined by a Service Level Agreement (SLA), is also of great importance. With the increasing complexity of services, IT services, and infrastructure, achieving SLA requirements requires a set of services and operations. Given the complexity of infrastructure, manually executing these operations can present significant challenges. The above points reflect the numerous challenges in managing and optimizing IT service operations. For instance, some of the operations involved in infrastructure maintenance and management include: resource consumption management, system error management, alert and notification management, and automatic detection of adverse events within the infrastructure. Hence, the importance of having tools and processes to automate the management of complex systems and services is more evident than ever. In this context, the concept of AIOps has been introduced. In the following, we will explore AIOps and its applications in more detail.

Introduction to AIOps

AIOps stands for Artificial Intelligence for IT Operations. This concept was first introduced by Gartner in 2016. According to the definition provided by Gartner, AIOps is a combination of big data and machine learning aimed at automating IT-related processes. Technologies associated with AIOps leverage machine learning, natural language processing, and other advanced AI techniques to automate the performance of various operations such as monitoring, event correlation, anomaly detection, root cause analysis, and more, within the IT domain. In essence, AIOps can be seen as a systematic approach to leveraging AI methods to reduce human intervention in IT systems and operations.

A Proposed Framework for AIOps

An AIOps-based platform supports various complex technologies and methods, presented in the form of an AIOps architecture framework. According to the proposed framework, the architecture of AIOps-based platforms consists of three main components: Monitoring (Observe), IT Service Management (ITSM), and Automation (Act). The diagram below shows these three core components.

Key components of a proposed architecture for AIOps

Section 1 - Observe (Monitoring)

The first component of the AIOps architecture is the Monitoring section. The goal of this section is to collect data and perform various analyses on the data to identify adverse events within the system. Monitoring systems are one of the primary data sources for AIOps. Data types such as metrics, logs, and traces can be utilized in this phase. Appropriate processing is performed on this data, followed by various analyses. Typically, machine learning-based methods are employed to carry out these analyses. Some of the most important analyses that can be performed in the AIOps approach include:

Event Detection: Identifying undesirable events based on threshold limits of metrics.
Anomaly Detection: Detecting abnormal behaviors or patterns within the system.
Event Correlation: Determining relationships and correlations between adverse events.
Root Cause Analysis: Analyzing and identifying the root causes of issues.
Event and Metric Forecasting: Predicting future system events and trends of metrics (increasing, decreasing, constant).

AIOps monitoring process and steps

Section 2 - Engage

The second component of the AIOps architecture is called Engage, which is related to IT Service Management. After incidents are generated based on data analysis in the Observe phase, it is necessary to implement processes for managing these incidents by service operators. The goal of the second section in the AIOps architecture is to automate operations and services that were traditionally handled manually by human operators. Some of the key processes in this core of AIOps architecture include:

Incident Management: Managing and handling system incidents.
Change Management: Managing and overseeing changes in the system.
Configuration Management: Managing system configurations.
Service Level Agreement (SLA) Management: Managing and ensuring adherence to SLAs.
Availability and Capacity Management: Ensuring system availability and managing capacity.

Each of these areas can be elaborated further, and they will be discussed in more detail in future articles.

Section 3 - Act

In the first core of the AIOps architecture, Monitoring, undesirable events and incidents in the system are identified by observing various system parameters and using artificial intelligence and machine learning methods. An undesirable event could be a memory shortage in SQL Server, leading to performance degradation, or a failure in power supplies on HPE servers, which is another type of undesirable event. Once these events are identified, an incident must be created for each event, and actions must be taken to resolve or mitigate the impact of these incidents.

In the AIOps philosophy, the ideal state is that these operations, known as Remediation Actions, are automatically performed to restore the system to its normal state.

In the Observe and Engage phases, operations related to detection, analysis, and management of undesirable events are carried out. The third section, Act, maximizes the benefits of using the AIOps approach for an organization. This is because organizations, not only are able to detect system errors and find the root cause, but also have mechanisms to automatically resolve these errors without human intervention. One of the most important processes in this core of AIOps architecture is the automatic remediation of incidents. Various solutions can be implemented to achieve this goal. For example, running automated scripts in PowerShell or Shell Script to reboot or restart services is one of the simplest automated operations that can be performed in this context.

Classification of methods used in AIOps

In the previous section, one of the proposed frameworks for AIOps was introduced. Here, we will examine one of the suggested categorizations related to the methods employed in AIOps. In this categorization, the approaches that have gained significant attention from AIOps researchers in recent years are mainly grouped into two categories: 1) Failure Management and 2) Resource Provisioning.

In the Failure Management category, various methods are discussed related to error prevention, error prediction, error detection, and root cause analysis of errors. In the Resource Provisioning category, various methods related to resource allocation, resource consumption scheduling, and workload forecasting are explored.

Each of these categories utilizes different methods, which will be discussed in more detail in future articles.

Classification of recent studies in the field of AIOps

Conclusion

In this article, the concept of AIOps was introduced, highlighting the necessity of adopting an AIOps-driven approach in various organizations. Additionally, different components of a proposed AIOps framework were presented. The three core elements of the proposed AIOps architecture are Observe, Engage, and Act, each of which serves different goals and covers various processes in the lifecycle of an AIOps-based platform.

Considering the complexity of IT infrastructures and services on one hand, and the increasing use of artificial intelligence-based methods on the other, it is expected that the AIOps mindset will continue to penetrate the field of IT services and monitoring tools. Over time, more organizations and companies are likely to move towards automating their processes with the AIOps approach.

Share This Article