Authors: Dixita Narang (Google)
Kubernetes v1.27, released in April 2023, introduced changes to Memory QoS (alpha) to improve memory management capabilites in Linux nodes.
Support for Memory QoS was initially added in Kubernetes v1.22, and later somelimitationsaround the formula for calculating memory.high
were identified. These limitations are addressed in Kubernetes v1.27.
Background
Kubernetes allows you to optionally specify how much of each resources a container needs in the Pod specification. The most common resources to specify are CPU and Memory.
For example, a Pod manifest that defines container resource requirements could look like:
apiVersion: v1
kind: Pod
metadata:
name: example
spec:
containers:
- name: nginx
resources:
requests:
memory: "64Mi"
cpu: "250m"
limits:
memory: "64Mi"
cpu: "500m"
spec.containers[].resources.requests
spec.containers[].resources.limits
When the kubelet starts a container as a part of a Pod, kubelet passes the container's requests and limits for CPU and memory to the container runtime. The container runtime assigns both CPU request and CPU limit to a container. Provided the system has free CPU time, the containers are guaranteed to be allocated as much CPU as they request. Containers cannot use more CPU than the configured limit i.e. containers CPU usage will be throttled if they use more CPU than the specified limit within a given time slice.
Prior to Memory QoS feature, the container runtime only used the memory limit and discarded the memory request
(requests were, and still are, also used to influence scheduling). If a container uses more memory than the configured limit, the Linux Out Of Memory (OOM) killer will be invoked.
Let's compare how the container runtime on Linux typically configures memory request and limit in cgroups, with and without Memory QoS feature:
Memory request
Memory limit
How it works
Cgroups v2 memory controller interfaces & Kubernetes container resources mapping
Memory QoS uses the memory controller of cgroups v2 to guarantee memory resources in Kubernetes. cgroupv2 interfaces that this feature uses are:
memory.max
memory.min
-
memory.high
.
memory.max
is mapped to limits.memory
specified in the Pod spec. The kubelet and the container runtime configure the limit in the respective cgroup. The kernel enforces the limit to prevent the container from using more than the configured resource limit. If a process in a container tries to consume more than the specified limit, kernel terminates a process(es) with an out of memory Out of Memory (OOM) error.
memory.max maps to limits.memory
memory.min
is mapped to requests.memory
, which results in reservation of memory resources that should never be reclaimed by the kernel. This is how Memory QoS ensures the availability of memory for Kubernetes pods. If there's no unprotected reclaimable memory available, the OOM killer is invoked to make more memory available.
memory.min maps to requests.memory
For memory protection, in addition to the original way of limiting memory usage, Memory QoS throttles workload approaching its memory limit, ensuring that the system is not overwhelmed by sporadic increases in memory usage. A new field, memoryThrottlingFactor
, is available in the KubeletConfiguration when you enable MemoryQoS feature. It is set to 0.9 by default.memory.high
is mapped to throttling limit calculated by using memoryThrottlingFactor
,requests.memory
and limits.memory
as in the formula below, and rounding down the value to the nearest page size:
Note : If a container has no memory limits specified, limits.memory
is substituted for node allocatable memory.
Summary:
File | Description |
---|---|
memory.max |
memory.max specifies the maximum memory limit, a container is allowed to use. If a process within the container tries to consume more memory than the configured limit, the kernel terminates the process with an Out of Memory (OOM) error. |
It is mapped to the container's memory limit specified in Pod manifest. |
| memory.min | memory.min
specifies a minimum amount of memory the cgroups must always retain, i.e., memory that should never be reclaimed by the system. If there's no unprotected reclaimable memory available, OOM kill is invoked.
It is mapped to the container's memory request specified in the Pod manifest. |
| memory.high | memory.high
specifies the memory usage throttle limit. This is the main mechanism to control a cgroup's memory use. If cgroups memory use goes over the high boundary specified here, the cgroups processes are throttled and put under heavy reclaim pressure.
Kubernetes uses a formula to calculate memory.high
, depending on container's memory request, memory limit or node allocatable memory (if container's memory limit is empty) and a throttling factor. Please refer to the KEPfor more details on the formula. |
Note memory.high
is set only on container level cgroups while memory.min
is set on container, pod, and node level cgroups.
memory.min
calculations for cgroups heirarchy
When container memory requests are made, kubelet passes memory.min
to the back-end CRI runtime (such as containerd or CRI-O) via the Unified
field in CRI during container creation. The memory.min
in container level cgroups will be set to:
$memory.min = pod.spec.containers[i].resources.requests[memory]$
for every ith container in a pod
Since the memory.min
interface requires that the ancestor cgroups directories are all set, the pod and node cgroups directories need to be set correctly.
memory.min
in pod level cgroup:
$memory.min = \sum_{i=0}^{no. of pods}pod.spec.containers[i].resources.requests[memory]$
for every ith container in a pod
memory.min
in node level cgroup:
$memory.min = \sum_{i}^{no. of nodes}\sum_{j}^{no. of pods}pod[i].spec.containers[j].resources.requests[memory]$
for every jth container in every ith pod on a node
Kubelet will manage the cgroups hierarchy of the pod level and node level cgroups directly using the libcontainer library (from the runc project), while container cgroups limits are managed by the container runtime.
Support for Pod QoS classes
Based on user feedback for the Alpha feature in Kubernetes v1.22, some users would like to opt out of MemoryQoS on a per-pod basis to ensure there is no early memory throttling. Therefore, in Kubernetes v1.27 Memory QOS also supports memory.high to be set as per Quality of Service(QoS) for Pod classes. Following are the different cases for memory.high as per QOS classes:
Guaranteed pods by their QoS definition require memory requests=memory limits and are not overcommitted. Hence MemoryQoS feature is disabled on those pods by not setting memory.high. This ensures that Guaranteed pods can fully use their memory requests up to their set limit, and not hit any throttling.
Burstable pods by their QoS definition require at least one container in the Pod with CPU or memory request or limit set.
BestEffort by their QoS definition do not require any memory or CPU limits or requests. For this case, kubernetes sets requests.memory = 0 and substitute limits.memory for node allocatable memory in the formula:
Summary : Only Pods in Burstable and BestEffort QoS classes will set memory.high
. Guaranteed QoS pods do not set memory.high
as their memory is guaranteed.
How do I use it?
The prerequisites for enabling Memory QoS feature on your Linux node are:
- Verify the requirementsrelated to Kubernetes support for cgroups v2are met.
- Ensure CRI Runtime supports Memory QoS. At the time of writing, only containerd and CRI-O provide support compatible with Memory QoS (alpha). This was implemented in the following PRs:
Memory QoS remains an alpha feature for Kubernetes v1.27. You can enable the feature by settingMemoryQoS=true
in the kubelet configuration file:
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
featureGates:
MemoryQoS: true
How do I get involved?
Huge thank you to all the contributors who helped with the design, implementation, and review of this feature:
- Dixita Narang (ndixita)
- Tim Xu (xiaoxubeii)
- Paco Xu (pacoxu)
- David Porter(bobbypage)
- Mrunal Patel(mrunalp)
For those interested in getting involved in future discussions on Memory QoS feature, you can reach out SIG Node by several means:
Top comments (0)