Cloud Services

Container Cluster Management

Actionable insights details
Published On Jun 11, 2024 - 8:10 AM

Actionable insights details

Container Cluster Management provides Actionable insights that assist with understanding your cloud's actual resource consumption compared to planned consumption, enabling a balanced cloud service inventory.

Introduction

As a Site Reliability Engineer (SRE), you gain actionable insights to manage your container resources effectively. These insights allow you to identify which clusters may require updates due to unsupported Kubernetes, OpenShift versions, or those nearing end-support. Additionally, you can pinpoint containers that are either under or over-utilizing resources by comparing actual consumption against allocated Memory and CPU usage.
Enhanced Real-Time Insights
We have increased the frequency of our analysis to provide real-time insights. This improvement means that any adjustments to container resources are immediately reflected, enhancing your decision-making process. You can now utilize live data for more effective management, ensuring optimal resource performance.
Identifying Over and Under-Utilized Resources
For instance, if Actionable Insights indicates 165 Containers of Memory utilized, the SRE can infer potential over-investment in memory resources and consider reducing this investment.
Calculations for Optimal Resource Utilization
The DevOps team establishes a baseline for Memory and CPU usage. The algorithm considers CPU request and limit, Memory request and limit, along with historical data, to make the following recommendations:
  1. Optimal Memory Limit Recommendations for Containers
    :
    • Metrics include memory requests, limits, frequency of Out of Memory Kills, and usage history.
    • Action: SRE reviews and adjusts values as needed.
  2. Optimal CPU Limit Recommendations for Containers
    :
    • Metrics include CPU requests, limits, and consumption history.
    • Action: SRE adjusts node sizing and plans VM migrations for load balancing.
Utilization Rules
  • Containers are underutilized if CPU or Memory usage is less than 50% of the requested amount.
  • Containers are overutilized if experiencing frequent Out of Memory issues or if Memory usage exceeds 90% of the set limit.
  • Upscaling is considered if CPU usage exceeds 90% of the set limit for significant periods.
  • Containers lacking CPU Limit or Request settings are flagged accordingly.
A minimum of 7 days of usage data is required for generating accurate insights. Container replicas are grouped under their deployment names for resource usage analysis.
Actionable Insights pageSelect any tile within the Actionable Insights widget on the CCM dashboard to navigate the Actionable Insights page. The page's data depends on which tile you selected from the CCM dashboard. If, for example, you selected Memory Underutilized, the page will display data associated with resources that have been designated as Memory Underutilized. The page contains the following elements:
  • Bread crumbs for efficient navigation.
  • Header to identify your current location in the Kyndryl portal.
  • A Dashboard view that includes an Application filter, a Provider and Connection filter, and an insight category filter.
  • The Summary section displays a group of tiles based on available insights.
  • An insights table that provides data generated based on the Actionable insights tile selected from the Summary section above.
Insight for Containers and Clusters running Kubernetes or OpenShift versionsInsights provide the number of Containers that are over-utilized and under-utilized, as well as clusters nearing the end of active support or with only maintenance support as follows:
  • Memory over-utilized Containers:
    Shows the list of containers having three or more OOM (Out of Memory) errors in one week and memory is greater than 90% of the set limit. Inference: upscale memory resources.
  • Memory under-utilized Containers:
    Shows the list of container with memory usage is 50% of the set request. Downscale memory resources.
  • CPU over-utilized Containers:
    Shows the list of containers that consume 90% of the set CPU limits for more than 80% of the set time (per week or per month). Inference: Upscale CPU resources.
  • CPU under-utilized Containers:
    Shows the list of containers with CPU usage less than 50% of the set Request. Inference: Downscale CPU resources.
  • Containers Running out of CPU:
    Shows the list of containers that are going to running out of CPU in the near future. Inference: Upscale CPU prior to the forecast date.
  • Containers Running out of Memory:
    Shows the list of containers that are going to running out of memory in the near future. Inference: Upscale memory prior to the forecast date.
  • Containers without CPU limit:
    CPU limit is not set for the corresponding deployment. Inference: Set CPU limit.
  • Containers without memory limit:
    Memory limit is not set for the corresponding deployment. Inference: Set memory limit.
  • Containers without memory request and limit:
    Memory request and limit is not set. Inference: Set memory request and limit.
  • Containers with minimal CPU & Memory utilized:
    CPU and memory usage is less than 10% of what is requested. Inference: Downscale CPU and memory.
  • K8s/OCP versions without support:
    Shows the clusters having obsolete version of Kubernetes and OpenShift.
  • K8s/OCP versions having only maintenance support:
    Shows the clusters that are having versions of Kubernetes or OpenShift with maintenance support only.
  • Clusters K8s/OCP versions nearing end of active support:
    Number of clusters in which active support for Kubernetes and OpenShift versions will end by a date specified. Inference: Update license before the specified date.
  • CrashLoopBackOff
    : Shows the number of pods in the cluster that are repeatedly crashing after attempting to start. A "CrashLoopBackOff" status indicates that a pod is failing to run successfully and is automatically restarting due to errors. Inference: Investigate failure to start for the corresponding deployment and take corrective action..
  • Pods restarted:
    Shows the list of Pods that were restarted in the last 24 hours.
  • Pods with pending state:
    Shows the list of Pods which are in pending state.
  • Pods with Singleton design:
    Shows the list of pods which are designed to run as a single instance across the cluster.
Do you have two minutes for a quick survey?
Take Survey