Kyndryl AIOps

Introduction to Kyndryl AIOps

Expand menu Collapse menu

Mean Time Between Failure for Applications

Published On Aug 30, 2024 - 10:42 PM

Mean Time Between Failure for Applications

This page describes the Mean Time Between Failures (MTBF) Business Application insight of Integrated AIOps.

In the F1 Landing page, there are 4 IT Health Indicators for MTBF and Uptime.

1st widget provides the Uptime and MTBF for all the servers of an account (Average Time between system failures).

The next 3 widgets provide the MTBF & Uptime for top 3 applications which has the least MTBF.

On Integrated AIOps, MTBF is the average time between two P1 incidents for a server, on the account, occurring within a duration of 30 days rolling period (720 hours).

MTBF is a crucial maintenance metric to measure performance.

If Uptime is high and MTBF is low, this means that there are frequent failures with short duration. This might lead to a potential business impact if not addressed quickly, so immediate actions to be taken to prevent the frequent failures.

If Uptime and MTBF is high, then it means the failures are non-frequent and the tickets are resolved quickly reducing business impact.

If the MTBF is at 30 days (720 hours), then the drill down of the metric would be disabled.

The general formula is the following:

Mean time between failures (MTBF) equals (Total Working Time - Total Breakdown Time) divided in Number of Failures (P1 incident Tickets)

The following image shows IT Health indicators:

Business value and Benefits

MTBF helps in estimating the failure trends of the server or application.

Close monitoring of this metric and taking measures to increase MTBF will help in avoiding the potential business impact for the account.

Metrics

This insight is measured over 30 days rolling period (720 hours) only taking into account P1 incidents to ensure sharp focus on high priority impacting issues.

It is required to have the hosts linked to a business application.

The general formula is the following: Mean time betwen failures (MTBF) equals (Total Working Time - Total Breakdown Time) divided in Number of failures (P1 incident tickets).

Overall MTBF and UPTIME and UPTIME % Calculation (Sample data)

Business Application	Server Name	Server Uptime (hrs)	Total Working time (720hrs = 30days*24hrs)	% uptime (for each server) (Uptime/Total Working Time) of the server	Number of Failures
App-1	server-1	719.87	720	0.9998 = Uptime1	4
App-1	server-2	719.63	720	0.9995 = Uptime2	4
App-1	server-3	719.93	720	0.9999 = Uptime3	2
App-1	server-4	719.93	720	0.9999 = Uptime4	2
App-1	server-5	719.93	720	0.9999 = Uptime5	1
App-1	server-6	365.05	720	0.5070 = Uptime6	1

Calculation Method

KPI/Metric Name	KPI/Metric Description
MTBF (in Hours)	Overall MTBF for the application based on the Total uptime and Total no. of P1 failures
Total Uptime (in hrs)	It is the Total uptime for the application calculated based on the % Uptime and the Total Working Time
% Up Time	It is calculated as the product of the % uptime of all the servers mapped to the business application
Number of Failures	Total number of P1 Incidents tickets for the application
Server MTBF	It is the MTBF calculated for each server based on the uptime and number of P1 failures
Server Uptime	It is the Uptime for each server calculated based on the Total working and breakdown time Note - Overlaps of tickets are considered without double counting breakdown time
Total Working Time	Total Server available time for 30day rolling window i.e., 720 hours/server
Total Breakdown Time	The cumulative time measured between the open date (DD, HH:MM:SS) to the resolution date (DD, HH:MM:SS)of the P1 incident tickets. Note - Overlaps of tickets are considered without double counting breakdown time

Details table provides list of all the servers that have incidents with their uptime and number of failures within the said duration. For P1 incident details, click the number of failures count across each server.

Use Cases

Provides the MTBF and Uptime details for the application based on the servers mapped in the Business Application Mapping.

If Uptime and MTBF is high, then it means the failures are non-frequent and the tickets are resolved quickly reducing business impact.

IT Health Indicators

P1/P2 active Incidents (Incident Management)

P1/P2 active Problem (Problem Management)

Critical Changes

Unsuccessful Changes (Change Management)

Do you have two minutes for a quick survey?

Take Survey