Explore actionable insights that assist in gaining a deep understanding of cluster resource consumption and version maintenance, enabling an optimized cloud service inventory.
Introduction
As a Site Reliability Engineer (SRE), you gain actionable insights to manage your container resources effectively. These insights allow you to identify which clusters may require updates due to unsupported Kubernetes, OpenShift versions, or those nearing end-support. Additionally, you can pinpoint containers that are either under or over-utilizing resources by comparing actual consumption against allocated Memory and CPU usage.
Enhanced Real-Time Insights
We have increased the frequency of our analysis to provide real-time insights. This improvement means that any adjustments to container resources are immediately reflected, enhancing your decision-making process. You can now utilize live data for more effective management, ensuring optimal resource performance.
Identifying Over and Under-Utilized Resources
For instance, if Actionable Insights indicates 165 Containers of Memory utilized, the SRE can infer potential over-investment in memory resources and consider reducing this investment.
Calculations for Optimal Resource Utilization
The DevOps team establishes a baseline for Memory and CPU usage. The algorithm considers CPU request and limit, Memory request and limit, along with historical data, to make the following recommendations:
Optimal Memory Limit Recommendations for Containers
:
Metrics include memory requests, limits, frequency of Out of Memory Kills, and usage history.
Action: SRE reviews and adjusts values as needed.
Optimal CPU Limit Recommendations for Containers
:
Metrics include CPU requests, limits, and consumption history.
Action: SRE adjusts node sizing and plans VM migrations for load balancing.
Utilization Rules
Containers are underutilized if CPU or Memory usage is less than 50% of the requested amount.
Containers are overutilized if experiencing frequent Out of Memory issues or if Memory usage exceeds 90% of the set limit.
Upscaling is considered if CPU usage exceeds 90% of the set limit for significant periods.
Containers lacking CPU Limit or Request settings are flagged accordingly.
A minimum of 7 days of usage data is required for generating accurate insights. Container replicas are grouped under their deployment names for resource usage analysis.
Actionable Insights page
Select any tile within the Actionable Insights widget on the CCM dashboard to navigate to the Actionable Insights page. The data displayed on this page depends on which tile you selected from the CCM dashboard. If, for example, you selected Memory Underutilized, the page will display data associated with resources that have been designated as Memory Underutilized. The page contains the following elements:
Bread crumbs for efficient navigation.
Header to identify your current location in the Kyndryl portal.
A Dashboard view that includes a filter option for Applications, Providers, Connections, Environments, Insights Type, and Insights Category.
The Summary section displays a group of tiles based on available insights.
An insights table that provides data generated based on the Actionable insights tile selected from the Summary section above.
Supported Insights
Insights for Containers and Clusters running Kubernetes or OpenShift versionsInsights provide the number of Containers that are over-utilized and under-utilized, as well as clusters nearing the end of active support or with only maintenance support. These insights include:
Containers Running out of Memory:
Shows the list of containers that are going to run out of memory in the near future. Inference: Upscale memory prior to the forecast date.
Containers Running out of CPU:
Shows the list of containers that are going to run out of CPU in the near future. Inference: Upscale CPU prior to the forecast date.
Containers with minimal CPU & Memory utilized:
CPU and memory usage is less than 10% of what is requested. Inference: Downscale CPU and memory.
Containers without memory limit:
Memory limit is not set for the corresponding deployment. Inference: Set memory limit.
Containers without CPU request and limit:
CPU limit is not set for the corresponding deployment. Inference: Set CPU limit.
Containers without memory request and limit:
Memory request and limit is not set. Inference: Set memory request and limit.
Containers with CPU over-utilized:
Shows the list of containers that consume 90% of the set CPU limits for more than 80% of the set time (per week or per month). Inference: Upscale CPU resources.
Containers with CPU under-utilized:
Shows the list of containers with CPU usage less than 50% of the set request. Inference: Downscale CPU resources.
Containers with Memory over-utilized:
Shows the list of containers having three or more OOM (Out of Memory) errors in one week and memory usage is greater than 90% of the set limit. Inference: upscale memory resources.
Containers with Memory under-utilized:
Shows the list of containers with memory usage at 50% of the set request. Inference: Downscale memory resources.
Pods with CrashLoopBackOff
: Shows the number of pods in the cluster that are repeatedly crashing after attempting to start. A "CrashLoopBackOff" status indicates that a pod is failing to run successfully and is automatically restarting due to errors. Inference: Investigate failure to start for the corresponding deployment and take corrective action.
Pods restarted:
Shows the list of pods that were restarted in the last 24 hours.
Pods with pending state:
Shows the list of pods which are in pending state.
Pods with Singleton design:
Shows the list of pods which are designed to run as a single instance across the cluster.
Clusters K8s/OCP versions without support:
Shows the clusters having obsolete versions of Kubernetes and OpenShift.
Clusters K8s/OCP versions having only maintenance support:
Shows the clusters that are having versions of Kubernetes or OpenShift with maintenance support only.
Clusters K8s/OCP versions nearing end of active support:
Shows the number of clusters in which active support for Kubernetes and OpenShift versions will end by a specified date. Inference: Update license before the specified date.