Kyndryl AIOps

Introduction to Kyndryl AIOps

Mean Time Between Failure for Applications
Published On Jun 04, 2024 - 1:23 PM

Mean Time Between Failure for Applications

This page describes the Mean Time Between Failures (MTBF) Business Application insight of Integrated AIOps.
In the F1 Landing page, there are 4 IT Health Indicators for MTBF and Uptime.
  1. 1st widget provides the Uptime and MTBF for all the servers of an account (Average Time between system failures).
  2. The next 3 widgets provide the MTBF & Uptime for top 3 applications which has the least MTBF.
On Integrated AIOps, MTBF is the average time between two P1 incidents for a server, on the account, occurring within a duration of 30 days rolling period (720 hours).
MTBF is a crucial maintenance metric to measure performance.
  1. If Uptime is high and MTBF is low, this means that there are frequent failures with short duration. This might lead to a potential business impact if not addressed quickly, so immediate actions to be taken to prevent the frequent failures.
  2. If Uptime and MTBF is high, then it means the failures are non-frequent and the tickets are resolved quickly reducing business impact.
If the MTBF is at 30 days (720 hours), then the drill down of the metric would be disabled.
The general formula is the following:                                                                    
Mean time between failures (MTBF) equals (Total Working Time - Total Breakdown Time) divided in Number of Failures (P1 incident Tickets)
The following image shows IT Health indicators:

Business value and Benefits

  1. MTBF helps in estimating the failure trends of the server or application.
  2. Close monitoring of this metric and taking measures to increase MTBF will help in avoiding the potential business impact for the account.

Metrics

This insight is measured over 30 days rolling period (720 hours) only taking into account P1 incidents to ensure sharp focus on high priority impacting issues.
It is required to have the hosts linked to a business application.
The general formula is the following: Mean time betwen failures (MTBF) equals (Total Working Time - Total Breakdown Time) divided in Number of failures (P1 incident tickets).
Overall MTBF and UPTIME and UPTIME % Calculation (Sample data)
Business Application
Server Name
Server Uptime (hrs)
Total Working time
(720hrs = 30days*24hrs)
% uptime (for each server)
(Uptime/Total Working Time) of the server
Number of Failures
App-1
server-1
719.87
720
0.9998 = Uptime1
4
App-1
server-2
719.63
720
0.9995 = Uptime2
4
App-1
server-3
719.93
720
0.9999 = Uptime3
2
App-1
server-4
719.93
720
0.9999 = Uptime4
2
App-1
server-5
719.93
720
0.9999 = Uptime5
1
App-1
server-6
365.05
720
0.5070 = Uptime6
1

Calculation Method

KPI/Metric Name
KPI/Metric Description
MTBF (in Hours)
Overall MTBF for the application based on the Total uptime and Total no. of P1 failures
Total Uptime (in hrs)
It is the Total uptime for the application calculated based on the % Uptime and the Total Working Time
% Up Time
It is calculated as the product of the % uptime of all the servers mapped to the business application
Number of Failures
Total number of P1 Incidents tickets for the application
Server MTBF
It is the MTBF calculated for each server based on the uptime and number of P1 failures
Server Uptime
It is the Uptime for each server calculated based on the Total working and breakdown time
Note - Overlaps of tickets are considered without double counting breakdown time
Total Working Time
Total Server available time for 30day rolling window i.e., 720 hours/server
Total Breakdown Time
The cumulative time measured between the open date (DD, HH:MM:SS) to the resolution date (DD, HH:MM:SS)of the P1 incident tickets.
Note - Overlaps of tickets are considered without double counting breakdown time
Details table provides list of all the servers that have incidents with their uptime and number of failures within the said duration. For P1 incident details, click the number of failures count across each server.

Use Cases

Provides the MTBF and Uptime details for the application based on the servers mapped in the Business Application Mapping.
  1. If Uptime is high and MTBF is low, this means that there are frequent failures with short duration. This might lead to a potential business impact if not addressed quickly, so immediate actions to be taken to prevent the frequent failures.
  2. If Uptime and MTBF is high, then it means the failures are non-frequent and the tickets are resolved quickly reducing business impact.
Do you have two minutes for a quick survey?
Take Survey