“Why are my pods not being scheduled on the node? Why is it always OOMKilled and restarting? How can I control how many resources I still have available for my new application without increasing the cluster cost too much?” Do these questions sound familiar? Don't lose hope, we've been there too.
In this article, we've summarized our experience to help you better understand what is going on and get you started with observability. After reading it, you will understand how to monitor memory usage on Google Kubernetes Engine (GKE) nodes, especially to avoid common issues like pods not scheduling or being OOMKilled (Out Of Memory Killed).
Memory allocation in Google Kubernetes Engine
As an example, we will use Google Cloud Provider (GCP). If we create Kubernetes node pool with an n2d-standard-2 machine type (virtual machine – VM) with 8 GB (in fact you have 8.34 GB) of memory, the question is how much memory are we able to use for our pods?
GCP reserves some memory from the VM, and you should account for it, but exactly how much it takes up is not always clear.
For memory resources, Google Kubernetes Engine (GKE) reserves the following:
- 255 MiB of memory for machines with less than 1 GB of memory
- 25% of the first 4 GB of memory
- 20% of the next 4 GB of memory (up to 8 GB)
- 10% of the next 8 GB of memory (up to 16 GB)
- 6% of the next 112 GB of memory (up to 128 GB)
- 2% of any memory above 128 GB
Calculating allocatable memory for your pods
If we apply this information to our n2d-standard-2 machine, we’ll have:
Reserved_memory_for_kubelet = 0.25 * 4 (first 4 GB) + 0.2 * 4 (remaining 4 GB)
= A total of 1.8 GB of memory is reserved for the kubelet.
Continuing with the calculation, we get the amount of memory you can use for your pods:
Allocatable_memory = 8.34 − 1.8 (Reserved_memory_for_kubelet) − 0.1 (memory reserved for OS) − 0.1 (hard eviction threshold) = 6.34 GB
Managing resource quotas
When deploying a pod to a Kubernetes cluster, you must specify resource quotas for your pod: requests and limits. Let’s say we’ve set our pod requests to 0.3 GB of memory. If the pod has fewer memory requests than allocatable_memory left, it will schedule the pod and the pod will start. This will obviously lead to a decrease in the allocatable memory on the node (6.34 − 0.3 = 6.04 GB).
Specifying requests is very crucial because the kube-scheduler decides what node your pod can fit into, and this memory is also reserved for the pod. Proper configuration will lead to proper cluster stability and vice versa.
Possible scenario: after deploying several pods on our k8s cluster, the last pod can’t be scheduled and is in a Pending state. This is because the node has less allocatable memory than pod requests and (kubelet) can’t schedule the pod.
How to anticipate this situation and avoid it? We need to know how much memory is left for requests, how much memory apps are actually using, and what to do about it.
Using Prometheus and Grafana for memory monitoring
To get a closer look, we will use the Prometheus and Grafana stack for this purpose. Since we have the allocatable node memory (6.34 GB) calculated on paper, let’s represent this number in PromQL:
## Allocatable memory for each node (GB)
sum by (instance) (node_memory_MemTotal_bytes / 10^9) − 2.01) + on(instance) group_left(nodename) (0 * node_uname_info)
It may look scary, but let's break it down for ease of understanding:
node_memory_MemTotal_bytes – indicates the total amount of memory in bytes, it has instance label (IP address), not the node name (for convenience, we'll convert this to GB).
2.01 GB – the total amount of memory we can't use, as we calculated earlier.
+ on(instance) – combines two metrics with one common label.
group_left(nodename) (0 * node_uname_info) – takes “nodename” label from node_uname_info and puts it together to the node_memory_MemTotal_bytes
This way, we have calculated our allocatable memory per node and it looks like this:
{instance="<IP_ADDRESS>", nodename="<NAME>"} -------------- 6.34
Queries for monitoring memory usage
Now we want to know how much memory each pod has requested:
## Total Requested memory for each node by all apps (GiB)
sum (kube_pod_container_resource_requests {resource="memory", node=~".+"} + on(pod) group_left(phase) (kube_pod_status_phase {phase="Running"} == 1)) by (node) / (1024^3)
The same logic applies here as before. Let’s focus on kube_pod_status_phase {phase="Running"} == 1 – this command will count only pods that are actually in a Running state.
From these two queries, we can create another one to calculate the percentage of requested memory per node:
The basic formula is: (1 − (allocatable_memory - requested_memory) / allocatable_memory) * 100
And here it is in full:
## Requested memory (%)
(1 − (((avgby(instance) ((node_memory_MemTotal_bytes) − 2.01 * 10^9) + on(instance) group_left(node) (0 * label_replace (node_uname_info, "node", "$1", "nodename", "(.*)"))) − on(node) (sum (kube_pod_container_resource_requests {resource="memory"} + on(pod) group_left(phase) (kube_pod_status_phase {phase="Running"} == 1 )) by(node))) / 6331012480)) * 100
"node", "$1", "nodename", "(.*)" – this is how we change the label name from “nodename” to “node” so we can subtract one metric from another.
Now let’s find out the actual memory usage in nodes. This is much easier:
## Memory utilization per node (GiB)
sum (container_memory_working_set_bytes {container!=" "} / (1024^3)) by (node)
And finally in percentage:
## Memory utilization per node (%)
(1 − ((avg by (instance) ((node_memory_MemTotal_bytes) − 2.01 * 10^9) + on(instance) group_left(node) (0 * label_replace(node_uname_info, "node", "$1", "nodename", "(.*)"))) − on(node) sum (container_memory_working_set_bytes {container!=" "}) by(node)) / 6331012480) * 100
Navigating the complexities of Kubernetes memory allocation can feel like a daunting task, but with the right tools and insights, it becomes manageable. By leveraging Prometheus and Grafana, you can monitor memory usage effectively, ensuring your GKE clusters remain stable and cost-efficient. Remember, a well-monitored system not only helps in predicting memory usage but also prevents common issues like pods not scheduling or being OOMKilled.
As you refine your monitoring strategies, you’ll find it easier to allocate resources wisely, avoid unexpected downtimes, and ultimately, enhance the reliability of your applications.
Happy monitoring!