Kyndryl AIOps

Introduction to Kyndryl AIOps

Expand menu Collapse menu

Home screen insights

Published On Jul 22, 2024 - 10:19 AM

Home screen insights

Learn about the insights available on the Home screen of Kyndryl AIOps.

The landing page takes a totally different approach than other dashboards; it only shows what needs to be done to become better. The landing page does not show how good things are being done, but it only shows those items that are critical to be addressed with a one-click to the next best action.

Business Value and Benefits

The landing page of Kyndryl AIOps provides for various roles within the account a single-pane of glass with: observability, actionable insights to address areas of concern or improvement as well as IT Health Indicators which show where additional actions need to be taken to avoid business critical issues. The Home page offers the following benefits:

Cut down resources by reducing toil (Noise reduction) and automating as much on the manual efforts in the account.

Re-skilled or re-deployed resources to concentrate on the customer needs.

Avoid potential business impacting issues by taking proactive actions.

Reduce outages and thus improves uptime and availability on Servers and Applications.

Improve and sustain the compliance posture for the account.

Provide the best delivery experience to customer.

Metrics

The following table provides a description and calculation for each KPIs/Metrics used within the insight.

KPI/Metric Name	KPI/Metric Description	Data Range
Alerts	Shows active ticketed events grouped by business application. When available, other observability information can be displayed as well. (Amber Event Severity Warning: 3 / Red Event severity Critical: 4 or Fatal: > 5)	-
Actionable Insights	Shows insights based on AI/ML cross referencing multiple data types to address the area of concern or improvement and provide the next best action. Only applicable actionable insight for the account from the insight library are shown.	Current Month to Date (MTD)
Servers having less than 99% uptime	Shows the percentage of servers for the account which have less than 99% uptime. Percent of servers whose uptime is less than 99%. (Gray 0-5% / Red > 5%).	Rolling 30 days
Business Application MTBF Hours	Shows the Mean Time Between Failure (MTBF) for 3 applications which has the least MTBF for the account. The total uptime for each application is calculated based on the % Uptime and the Total Working Time %Uptime of application is calculated as the product of the % uptime of all the servers mapped to the business application MTBF for application based on underlying server's uptime %. (Gray 95-100% / Red 0-95%).	Past 90 days
Total Inventory Items	Shows the number of devices within the inventory. The drilldown also shows the mapping to the business applications.	Latest from SESDR
Remediation with corrective closure	The Automated Corrective Closure number is specified as all incident tickets for which a CACF playbook took a corrective action. The Automated Correction Closure percentage is calculated by dividing the Automation Corrective Closure number by all incident tickets in Kyndryl scope. Corrective closure percentage based on Automated Corrective Closure number. (Gray > 75% / Amber 50-75% / Red < 50%).	Past 3 months
Incidents per server per month	Incidents per server per month is calculated by dividing all incident tickets in Kyndryl scope (by all servers) of the account. Only servers, which do not include network devices, storage devices etc. Incidents per server/month prorated per day. (Gray 0-0.8 / Amber 0.8-1.0 / Red > 1.0).	Default selection for past 30 days. Historic data of 210 days
Best practice deployment	Displays attainment of all policies across environment. Deviation of best practices alignment is measured by how many findings (policy check failures) are reported via the best practice checks executed by SDE Automation Tools (SAT). (Gray 90-100% / Amber 85-90% / Red 0-85%).	Default selection of past 90 days
SSL certificates expiring (<30 days)	Shows all identified SSL certificated which expire in the next 30 days. # of SSL certificates expired or expiring in 30 days. (Gray certificates alive / Amber certificates expired / Red certificates expiring within next 30 days).	Default selection of next 30 days.
P1/P2 active Incidents	Shows all the active (non-resolved) P1 and P2 incident tickets for the account. Grey - If there are P1 & P2 active incidents only in the range of P1 - 0-1 day and P2 0-2 days, then the active incidents count shows in Grey Red - If there are even 1 incident which meets the criteria P1 > 1 day or P2 > 2 days, then the active incidents count shows in Red	Default selection of past 30 days. Historical data of 210 days
P1/P2 active Problem	Shows the active (non-resolved) P1 and P2 problem tickets for the account. Displays count of problem tickets P1 > 6 days or P2 > 10 days. (Grey P1 0-6 day and P2 0-10 days / Red P1 > 6 day or P2 > 10 days).	Default selection of 30 days Historical data of 210 days
Critical Changes	Highlights changes which require additional due diligence to avoid business critical outage. It uses AI/ML to look at the risk which the support teams raised the change on and performs a risk assessment against failure of the change or change causing incidents. Displays critical change risk for the account within 72 hours. (Gray 0-1 Critical Change / Red > 1).	Upcoming changes in next 72 hours Historical data for past 1 year
Unsuccessful Changes	Shows the percentage of changes which failed. Displays % of unsuccessful changes for the account. (Gray = 0 / Amber 0-5% / Red > 5%).	Default selection of past 30 days. Historical data of 210 days
Patch Overdue %	Shows for which devices the patches have not been applied on time. Displays the % of devices which has patches overdue. (Gray = 0 / Amber 0-2 / Red > 2).	Default selection of past 1 day.
Devices with health check issues	Shows the percentage of devices which is either missing the health check or have devices which has >5 health check deviations. Displays the devices with either missing health check run or with more than 5 health check deviations. (Gray 0-2% / Red > 2%).	-
Failed Backups	Shows for which devices the backup has failed. Displays the % failed for the account. (Gray 0-2% / Red > 2%).	-
Devices out of capacity	Shows for which devices there is less capacity of CPU, Memory and/or Disk than best practice recommendation. Displays the % of devices which are running out of capacity for the account. (Gray = 0-10 / Amber is 10-30% / Red > 30%).	Past 3 months
Devices EOS/EOL	Shows the End of Support or End of Life dates that have passed or are due withnin one year. Displays % of devices which are at the End of Life (EOL) or End of Support (EOS) for the account. (Gray = 0-10 / Amber is 10-30% / Red > 30%).	Latest from HWSW Inventor

Metrics – Data Refresh Frequency

The following table provides an overview of the refresh frequency recommendation for different types of data.

All insights using this data are not refreshed at the same rate. Real-time should be read as "near real-time".

Mode of refresh	Date Type	Schedule
Real-time	Events	Every 2 min
Real-time	Incident tickets	Every 2 min
Real-time	Change requests	Every 2 min
Real-time	Problem tickets	Every 5 min
Real-time	Service tickets	Every 5 min
Real-time	Automation tickets	Every 30 min
Daily	Inventory
Daily	Netcool LDS

Data Types for Integrated AIOps Landing Page (and drill down insights*)

Use cases

The following table presents the associated use cases.

Metric Name	Purpose	Frequency of Usage	Persona using the Insight
Alerts	Account Business Health	Daily	Delivery Partner , Delivery Manager, SRE and team leaders
Actionable Insights	Address areas of concern and areas of improvement	Weekly	SRE, Delivery Partner, Delivery Manager, T&I and team leaders
Servers having less than 95% uptime	Reduce Outages in the system	Weekly	SRE, Delivery Partner, Delivery Manager, T&I and team leaders
Business Application MTBF Hours	Reduce Outages in the system	Weekly	SRE, Delivery Partner, Delivery Manager, T&I and team leaders
SSL Certificate Expiring	Reduce Noise in the system	Weekly	SRE, Delivery Partner, Delivery Manager, T&I and team leaders
Remediation with Corrective Closure	Reduce manual activities	Weekly	SRE, Delivery Manager and T&I
Incidents per server per month	Reduce the noise in the system	Daily or Weekly	SRE, Delivery Partner, Delivery Manager, T&I and team leaders
Best Practice deployment	Apply industry best practices	Weekly	SRE, Delivery Manager and T&I
P1/P2 active Incidents	Account Business Health	Daily	Delivery Partner , Delivery Manager, SRE and team leaders
P1/P2 active Problem	Account Business Health	Daily	Delivery Partner , Delivery Manager, SRE and team leaders
Critical Changes Upcoming in next 3 days	Account Business Health	Daily	Delivery Partner , Delivery Manager, SRE and team leaders
Critical Changes Upcoming in next weekly CAB cycle	Account Business Health	Weekly CAB	Delivery Partner , Delivery Manager, SRE and team leaders
Critical Changes Changes which require less review rigor	Account Business Health	Quarterly Continuous Improvement meetings	Delivery Partner , Customer, Delivery Manager, SRE and team leaders
Unsuccessful Changes	Account Business Health	Weekly	Delivery Partner , Delivery Manager, SRE and team leaders
Patch Overdue %	Account Compliance Posture	Weekly	SRE , ISA (Integrated security analyst) and team members
Devices with health check issues	Account Compliance Posture	Weekly	SRE , ISA (Integrated security analyst) and team members
Failed Backups	Reduce backup failures	Weekly	SRE, Delivery Partner, Delivery Manager, T&I and team leaders
Devices out of capacity	Capacity management for the account	Weekly	SRE, Delivery Partner, Delivery Manager, T&I and team leaders
Devices EOS/EOL	Obsolescence management	Monthly	SRE, Delivery Partner, Delivery Manager, T&I and team leaders

Do you have two minutes for a quick survey?

Take Survey