Kyndryl AIOps

Introduction to Kyndryl AIOps

Home screen insights
Published On Jun 04, 2024 - 1:22 PM

Home screen insights

The landing page is the first page a user sees when logging on to Integrated AIOps.
The landing page takes a totally different approach than other dashboards; it only shows what needs to be done to become better. The landing page does not show how good things are being done, but it only shows those items that are critical to be addressed with a one-click to the next best action.

Business Value and Benefits

The landing page of Kyndryl Bridge - Integrated AIOps provides for various roles within the account a single-pane of glass with: observability, actionable insights to address areas of concern or improvement as well as IT Health Indicators which show where additional actions need to be taken to avoid business critical issues. The Home page offers the following benefits:
  • Cut down resources by reducing toil (Noise reduction) and automating as much on the manual efforts in the account.
  • Re-skilled or re-deployed resources to concentrate on the customer needs.
  • Avoid potential business impacting issues by taking proactive actions.
  • Reduce outages and thus improves uptime and availability on Servers and Applications.
  • Improve and sustain the compliance posture for the account.
  • Provide the best delivery experience to customer.

Metrics

The following table provides a description and calculation for each KPIs/Metrics used within the insight.
KPI/Metric Name
KPI/Metric Description
Data Range
Alerts
Shows active ticketed events grouped by business application. When available, other observability information can be displayed as well. (Amber Event Severity Warning: 3 / Red Event severity Critical: 4 or Fatal: > 5)
-
Actionable Insights
Shows insights based on AI/ML cross referencing multiple data types to address the area of concern or improvement and provide the next best action. Only applicable actionable insight for the account from the insight library are shown.
Current Month to Date (MTD)
Servers having less than 99% uptime
Shows the percentage of servers for the account which have less than 99% uptime. Percent of servers whose uptime is less than 99%. (Gray 0-5% / Red > 5%).
Rolling 30 days
Business Application MTBF Hours
Shows the
Mean Time Between Failure
(MTBF) for 3 applications which has the least MTBF for the account.
The total uptime for each application is calculated based on the % Uptime and the Total Working Time
%Uptime of application is calculated as the product of the % uptime of all the servers mapped to the business application MTBF for application based on underlying server's uptime %. (Gray 95-100% / Red 0-95%).
Past 90 days
Total Inventory Items
Shows the number of devices within the inventory. The drilldown also shows the mapping to the business applications.
Latest from SESDR
Remediation with corrective closure
The
Automated Corrective Closure
number is specified as all incident tickets for which a CACF playbook took a corrective action. The Automated Correction Closure percentage is calculated by dividing the Automation Corrective Closure number by all incident tickets in Kyndryl scope.
Corrective closure percentage based on Automated Corrective Closure number. (Gray > 75% / Amber 50-75% / Red < 50%).
Past 3 months
Incidents per server per month
Incidents per server per month is calculated by dividing all incident tickets in Kyndryl scope (by all servers) of the account.
Only servers, which do not include network devices, storage devices etc. Incidents per server/month prorated per day. (Gray 0-0.8 / Amber 0.8-1.0 / Red > 1.0).
Default selection for past 30 days.
Historic data of 210 days
Best practice deployment
Displays attainment of all policies across environment. Deviation of best practices alignment is measured by how many findings (policy check failures) are reported via the best practice checks executed by SDE Automation Tools (SAT). (Gray 90-100% / Amber 85-90% / Red 0-85%).
Default selection of past 90 days
SSL certificates expiring (<30 days)
Shows all identified SSL certificated which expire in the next 30 days. # of SSL certificates expired or expiring in 30 days. (Gray certificates alive / Amber certificates expired / Red certificates expiring within next 30 days).
Default selection of next 30 days.
P1/P2 active Incidents
Shows all the active (non-resolved) P1 and P2 incident tickets for the account. Grey - If there are P1 & P2 active incidents only in the  range of P1 - 0-1 day and P2 0-2 days, then the active incidents count shows in Grey
Red - If there are even 1 incident which meets the criteria P1 > 1 day or P2 > 2 days, then the active incidents count shows in Red
Default selection of past 30 days.
Historical data of 210 days
P1/P2 active Problem
Shows the active (non-resolved) P1 and P2 problem tickets for the account. Displays count of problem tickets P1 > 6 days or P2 > 10 days. (Grey P1 0-6 day and P2 0-10 days / Red P1 > 6 day or P2 > 10 days).
Default selection of 30 days
Historical data of 210 days
Critical Changes
Highlights changes which require additional due diligence to avoid business critical outage. It uses AI/ML to look at the risk which the support teams raised the change on and performs a risk assessment against failure of the change or change causing incidents. Displays critical change risk for the account within 72 hours. (Gray 0-1 Critical Change / Red > 1).
Upcoming changes in next 72 hours
Historical data for past 1 year
Unsuccessful Changes
Shows the percentage of changes which failed. Displays % of unsuccessful changes for the account. (Gray = 0 / Amber 0-5% / Red > 5%).
Default selection of past 30 days.
Historical data of 210 days
Patch Overdue %
Shows for which devices the patches have not been applied on time. Displays the % of devices which has patches overdue. (Gray = 0 / Amber 0-2 / Red > 2).
Default selection of past 1 day.
Devices with health check issues
Shows the percentage of devices which is either missing the health check or have devices which has >5 health check deviations. Displays the devices with either missing health check run or with more than 5 health check deviations. (Gray 0-2% / Red > 2%).
-
Failed Backups
Shows for which devices the backup has failed. Displays the % failed for the account. (Gray 0-2% / Red > 2%).
-
Devices out of capacity
Shows for which devices there is less capacity of CPU, Memory and/or Disk than best practice recommendation. Displays the % of devices which are running out of capacity for the account. (Gray = 0-10 / Amber is 10-30% / Red > 30%).
Past 3 months
Devices EOS/EOL
Shows the 
End of Support
or
End of Life
dates that have passed or are due withnin one year. Displays % of devices which are at the End of Life (EOL) or End of Support (EOS) for the account. (Gray = 0-10 / Amber is 10-30% / Red > 30%).
Latest from HWSW Inventor

Metrics – Data Refresh Frequency

The following table provides an overview of the refresh frequency recommendation for different types of data.
All insights using this data are not refreshed at the same rate. Real-time should be read as "near real-time".
Mode of refresh
Date Type
Schedule
Real-time
Events
Every 2 min
Real-time
Incident tickets
Every 2 min
Real-time
Change requests
Every 2 min
Real-time
Problem tickets
Every 5 min
Real-time
Service tickets
Every 5 min
Real-time
Automation tickets
Every 30 min
Daily
Inventory
Daily
Netcool LDS
Data Types for Integrated AIOps Landing Page (and drill down insights*)

Use cases

The following table presents the associated use cases.
Metric Name
Purpose
Frequency of Usage
Persona using the Insight
Alerts
Account Business Health
Daily
Delivery Partner
, Delivery Manager, SRE and team leaders
Actionable Insights
Address areas of concern and areas of improvement
Weekly
SRE,
Delivery Partner, Delivery Manager, T&I and team leaders
Servers having less than 95% uptime
Reduce Outages in the system
Weekly
SRE,
Delivery Partner, Delivery Manager, T&I and team leaders
Business Application MTBF Hours
Reduce Outages in the system
Weekly
SRE,
Delivery Partner, Delivery Manager, T&I and team leaders
SSL Certificate Expiring
Reduce Noise in the system
Weekly
SRE,
Delivery Partner, Delivery Manager, T&I and team leaders
Remediation with Corrective Closure
Reduce manual activities
Weekly
SRE,
Delivery Manager and T&I
Incidents per server per month
Reduce the noise in the system
Daily or Weekly
SRE,
Delivery Partner, Delivery Manager, T&I and team leaders
Best Practice deployment
Apply industry best practices
Weekly
SRE,
Delivery Manager and T&I
P1/P2 active Incidents
Account Business Health
Daily
Delivery Partner
, Delivery Manager, SRE and team leaders
P1/P2 active Problem
Account Business Health
Daily
Delivery Partner
, Delivery Manager, SRE and team leaders
Critical Changes
Upcoming in next 3 days
Account Business Health
Daily
Delivery Partner
, Delivery Manager, SRE and team leaders
Critical Changes
Upcoming in next weekly CAB cycle
Account Business Health
Weekly CAB
Delivery Partner
, Delivery Manager, SRE and team leaders
Critical Changes
Changes which require less review rigor
Account Business Health
Quarterly Continuous Improvement meetings
Delivery Partner
, Customer, Delivery Manager, SRE and team leaders
Unsuccessful Changes
Account Business Health
Weekly
Delivery Partner
, Delivery Manager, SRE and team leaders
Patch Overdue %
Account Compliance Posture
Weekly
SRE
, ISA (Integrated security analyst) and team members
Devices with health check issues
Account Compliance Posture
Weekly
SRE
, ISA (Integrated security analyst) and team members
Failed Backups
Reduce backup failures
Weekly
SRE,
Delivery Partner, Delivery Manager, T&I and team leaders
Devices out of capacity
Capacity management for the account
Weekly
SRE,
Delivery Partner, Delivery Manager, T&I and team leaders
Devices EOS/EOL
Obsolescence management
Monthly
SRE,
Delivery Partner, Delivery Manager, T&I and team leaders
Do you have two minutes for a quick survey?
Take Survey