Kyndryl AIOps

Introduction to Kyndryl AIOps

Expand menu Collapse menu

Critical Changes

Published On Aug 30, 2024 - 10:42 PM

Critical Changes

This page describes the Critical changes Insight of Integrated AIOps.

Critical changes insights provides critical changes risk for the account within 72 Hours. This insight highlights changes which require additional due diligence to avoid business critical outage. It uses AI/ML to look at the risk which the support teams raised the change on and performs a risk assessment against failure of the change or change causing incidents.

Business Value and Benefits

Automatically assesses upcoming changes for which the risk is underestimated and alert account team to take additional due diligence before executing the change to avoid a business critical outage.

Metrics

This insight highlights changes that require additional due diligence to avoid business critical outage. It uses AI/ML to look at the risk which the support teams raised the change on and performs a risk assessment against failure of the change or change causing incidents.

Filters: The default filter applied when navigating from Landing page is a time period of 72 hours. This can be modified using the filters available.

KPI/Metric Name	KPI/Metric Description
Total Changes	Total number of changes scheduled for the next 72 hrs (3 days).
Critical Changes	Change tickets where AIOPs risk is higher than account team assigned risk and the AIOps assigned/predicted risk is critical or major.
High Risk Changes %	Displays critical change risk for the account within 72 hours.

How Critical Changes insight works

When a change is created into the ITSM tool it is assigned manually a risk rating. This is normally done only based on the information and properties available at change creation time. It is an indication but could be proven inaccurate due to changes in properties that change over time.

Integrated AIOps Critical Changes insight uses a method that identifies and analyzes the relationship between incident and changes that caused them. The method used looks at four predictors across the following two dimensions:

Change Failure Risk - what is the risk that change executions is going to fail?

Major Incident/Outage Risk - what is the risk that by executing this change an incident/outage is caused?

Linking incidents to changes

The linkage between a change request and an incident ticket is identified by two mechanisms. The first one is by explicit mentioning of an incident ticket within a historical change. This is defined as explicit linkage. The other one uses term frequency–inverse document frequency (TFIDF), a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.

If a change request and an incident ticket has many common tokens that are otherwise infrequently present, this pair is identified as a text-based linkage. In the example below, for incident 6, the candidate changes that are related are change 2, change 3, change 4 and change 5.

Change Risk – Dimensions

Change Failure Risk Dimension – Potential Risk Predictors

Based on the different pieces of information available at change creation time, and experiments to determine the predictors for change risk, the following set of predictors was identified:

Failure Rate by Category: Historic failure rates for changes belonging to a given category

Failure Rate by Owner: Historic failure rates for changes belonging to a given owner group

Failure Rate for similar changes: Historic failure rates for similar changes (based on change description)

Failure rate by configuration item(s): Historic failure rates for involved CI (e.g., hostname, mainframe cluster, etc.)

Failure Rate by Category and Failure Rate by similar changes are evaluated cross account change information. Failure Rate by Owner and Failure Rate for similar configuration item(s) is only evaluated for the data of one account.

Based on the different pieces of information available at change creation time, and experiments to determine the predictors for change risk, the following set of predictors was identified:

Incident Rate by Category: Historic incident rates for changes belonging to a given category

Incident Rate by Owner: Historic incident rates for changes belonging to a given owner group

Incident Rate for similar changes: Historic incident rates for similar changes (based on change description)

Incident rate by configuration item(s): Historic incident rates for involved CI (e.g., hostname, mainframe cluster, etc.)

Incident Rate by Category and Incident Rate by similar changes are evaluated cross account change information. Incident Rate by Owner and Incident Rate for similar configuration item(s) is only evaluated for the data of one account.

Critical Changes Insight Overview

When opening the Critical Changes insight, it shows the number of changes within the selected period which have a higher predicated risk rating determined by the Integrated AIOps AI/ML versus the human assigned risk. Also, the details of each change are shown in the table, including a highl level explanation of risk predictors for each of the two applicable dimensions.

When drilling down into one particular change, more details for each of the risk predictors for each of the two applicable dimension and related changes/tickets are made available.

Best Practice Use

The following image shows how to use the Integrated AIOps Critical Changes insight within the account management system. It supports identifying those changes in the upcoming days for which the AL/ML identified a higher risk profile than is currently set on the change by the change management process.

Daily Standup – Upcoming critical changes in the next 3 days

This use case provides the possibility for the account team to review changes during the daily standup. This to identify changes with higher predicted risk in the next 3 days to do extra due diligence on the change to avoid a business-critical outage.

Weekly CAB – Upcoming critical changes in the next weekly CAB cycle

This use case provides the possibility for the account team to review changes during the weekly CAP meeting. This to identify changes with higher predicted risk in the CAB cycle to do extra due diligence on the change to avoid a business-critical outage.

Quarterly Improvement – Identify Changes which require less review rigor

This use case provides the possibility for the account team to quarterly review changes during continuous improvement process. This to identify changes where the predicted risk is not higher than the human assigned risk. This allows for relaxing the rigor or specific types of changes and time spend on them during the review process.

Mean Time Between Failure for Applications

Do you have two minutes for a quick survey?

Take Survey