Kyndryl AIOps

Introduction to Kyndryl AIOps

Actionable Insights
Published On Jun 04, 2024 - 1:23 PM

Actionable Insights

The Actionable Insights of Integrated AIOps uses algorithms across available data sources to identify critical areas that need to be addressed.
The Actionable Insight is a feature in Integrated AIOps available that helps solving complex infrastructure problems using algorithms. Actionable Insight shows the observation of the data and provide a one-click to the recommended next best action. The Actionable Insights is built on Artificial Intelligence and Machine Learning concepts.
Natural Language Classification, Language Translation, Global Explain-ability, Local Explain-ability and, Discover Entity relationship are the key factors of Actionable Insights. An actionable insight is not the same as a dashboard.
Dashboards visualize the data and allow users to interact with this data to further narrow down manually and as such identify areas of interest. Whereas, each actionable insight provides a decisive answer (read: Recommended action) to a specific area of opportunity and/or issue to be addressed. Generally, different types of data from multiple sources (As mentioned in Metrics table) are correlated to come to an actionable insight.
Each actionable insight essentially provides the following information:
  1. Summary:
    It displays information records for the account by category of the insight (e.g., incident tickets).
  2. Top 10 items:
     Top 10 objects displayed as a donut chart. It represents the key areas impacted by this actionable insight.
  3. Recommended Actions:
    It provides detailed recommendations to address the observation of the actionable insight based on the available standard operation procedure (SOP) and automation playbooks.
  4. Observation from Data:
    The observations were made based on the specified data type(s). This is the further drill down of data which can be considered as evidence and additional details of the underlying data.
  5. Tickets Details:
    This section provides the complete list of ticket details, which can be searched and filtered for further analysis and investigation.
The observations of the actionable insights are Month to Date (MTD) data and are updated regularly based on the availability of new data.

Business Value and Benefits

Actionable Insights are key in reducing the time for an account squad to identify and address areas of opportunity and/or issues within the customer environment. As such they provide key value to our customers and Kyndryl as a whole. Our aim is to extend the library of actionable insights significantly.

Metrics

The following tables provides a description for each KPIs/Metrics used within the insight. They are grouped by domain area.
Top 5 Actionable insights
KPI/Metric Name
KPI/Metric Description
Expected outcome by taking recommended actions
Device with the most incident tickets
The insight shows top devices with most incident tickets for a specific month along with the previous month data. It helps to address the server with most occurring issues and provide recommendations on the repeated issues
Incident Reduction and thus improved Incident to Server ratio
Top automation playbook opportunity for resolving incident tickets
Provides and insight on Top automation playbook opportunity for resolving incident tickets and provide recommendations
Improved Automated Corrective Closure
Changes in the next 7-day window with predicted higher risk
Change tickets where AIOPs risk is higher than account team assigned risk in the next 7- day window and provide recommendations
Reduced Failed Changes
Device with most expiring SSL certificates
It especially provides the account insight on when SSL Certificate are going to expire in next 30 days along with recommendations
Avoid potential business impact
Top device with Must Fix deviations
Provides an insight of Top device with Must Fix health check deviations
Reduce Noise in the system and avoid potential business impac

Use cases

  1. Top devices, with most disk and swap space related incident:
    The insight shows the greatest number of incidents with disk and swap space tickets for a specific month. It helps to reduce the tickets on disk and swap.
  2. Top devices, with most CPU spikes:
    The insight shows devices with the greatest number of CPU performance spikes. It helps to recommend potential devices, where CPU plays a critical role. CPU hikes affect the application and system performance.
  3. Top Devices with highest CPU utilization:
    The insight shows devices with higher CPU utilization. It helps to identify the servers causing memory issues. This insight helps to improve the service uptime and reduce noise.
  4. Top Devices with highest memory utilization:
    The insights show devices with higher Memory utilization. It helps to identify the servers causing memory issues. This insight helps to improve service uptime and reduce noise.
  5. Top devices, with most incident tickets:
    The insight shows top devices with most incident tickets for a specific month along with the previous month data. It helps to address the server with most occurring issues and provide recommendations on the repeated issues.
  6. Top office locations of devices, with most incidents:
    The insight shows the top office locations where the devices are located with most incident tickets for a specific month along with the previous month data. It helps to address the server with most occurring issues and provide recommendations on repeated issues.
  7. Top devices, with monitoring frequently offline:
    The insight shows the top devices that are monitored offline. It helps to address the node down issues on a server basis and provide recommendations to identify the servers to take the corresponding actions.
  8. Top devices, with most auto-resolved tickets:
    The insight shows the auto resolved tickets for a given server. It helps to identify the automated tickets across servers.
  9. Top business applications, with most incident tickets:
    The insight shows the top business applications mapped with the servers with most incident tickets for a specific month along with the previous month data. It helps to address the issues occurring in the most critical business application.
  10. Devices with most frequent process or service down issues:
    The insight shows the devices with higher process and service down issues. It helps to identify the servers causing performance issues and improve the service uptime and noise.
  11. Devices with most missed or failed backup issues:
    The insight shows the devices with higher backup issues. It helps to improve the performance and service uptime.
  12. Devices with most database issues:
    The insight shows the devices with higher database issues. It helps to improve the performance and service uptime.
  13. Devices with most batch jobs and job abend issues:
    The insight shows the devices with job abend issues. It helps to improve the performance and service uptime.
  14. Top devices, with most change tickets:
    The insight shows the devices with most change tickets by servers. It helps to recommend devices with most change tickets, address Change risks/ business needs, and analyze the servers criticality.
  15. Top devices, with most change-induced incidents:
    The insight shows the devices with most change tickets by servers. It helps to recommend the devices with most change tickets, address Change risks/ business needs, and analyze the servers criticality.
  16. Top change groups with most change-causing incidents:
    Shows devices with most change tickets that were performed that caused an incident. It helps with recommendations to monitor the devices with most incidents after performing changes, results in triggering an incident. Also, allows users to determine RCA and decision to approve new changes to these devices by type of incident.
  17. Storage devices below the minimum level of firmware and not in compliance:
    The insight shows many storage devices that are below the minimum level of firmware and non-compliant. It helps to reduce the non-compliant storage devices and set the devices to a minimum required level.
  18. Storage devices not at the target level of firmware and are recommended for upgrade:
    The insight shows many storage devices that are not at the target level of firmware and recommended for an upgrade. It helps to recommend upgrades to specific storage devices.
  19. Number of Tickets are not flowing into the automation engine:
    The insight shows the devices which are not flowing to the automation engine. It helps to identify the manual tickets which can be automated to improve uptime, and availability of the infrastructure.
  20. Number of tickets generated by top resolver groups, as fully automated ticketing is not deployed on all servers:
    The insight shows the devices which have not installed automation. It helps to identify the servers with no automation.
  21. Number of tickets resolved by adjusting the automation matcher:
    The insight shows the devices resolved by adjusting the best automata available using matcher. It helps to identify the resolved tickets that are automated on adjusting/ configuring automata. This action improves the uptime and availability of the infrastructure.
  22. Number of diagnostic tickets that need to be converted to auto resolved tickets:
    The insight shows the tickets arriving as diagnose tickets but covered as resolved tickets. It helps to reduce the failed diagnostic tickets and improve the server uptime.
  23. Number of tickets with automation connection failures that need to be fixed:
    The insight shows the automation required tickets but needs attention on automation connection. It helps to identify the connection issues to avail the appropriate automation.
  24. Number of tickets auto-resolved if Netcool Resolve on Clear (RoC) is enabled:
    The insight shows the auto resolved potential ticket when Netcool ROC is enabled. It helps to identify the tickets where Netcool ROC is not enabled and resolve the tickets using automation. Also, it helps to improve uptime and availability.
  25. Number of automation tickets with automata failures that need to be fixed:
    The insight shows the ticket which failed using automation. It helps to identify the unresolved tickets using automation.
  26. Events not automated:
    The insight shows the events that can potentially be automated. It helps to increase automation for the identified event opportunities.
  27. Events ticketed without automation requests:
    The insight shows the ticketed events that can potentially be automated. It helps to increase the automation for identified event opportunities.
  28. Incident ticket reduction from event correlation:
    The insight shows the tickets created from Correlated Events. It helps to reduce the number of incidents created for repeated events and avoids duplication of incidents.
  29. Number of servers ready for Cloud Migration:
    The insight shows the devices ready to move to the cloud. It helps to recommend actions/ migrate the devices that are ready for cloud migration. Also, it helps the clients to optimize the infrastructure from cost factors.
Do you have two minutes for a quick survey?
Take Survey