On Kubernetes CPU Limits

Kubernetes has a widely used feature that allows you to specify both memory and CPU limits on pods you create. This feature seems simple enough on its surface, but how it works under the hood might surprise you!

Enforcement of memory limits is straightforward — if a pod exceeds its maximum allotted memory, Kubernetes will kill that pod and move on.

Enforcement of CPU limits ends up being a bit trickier, because Kubernetes does not terminate pods for exceeding CPU limits. This makes sense — you can make your code less memory-intensive, but there aren't so many ways to make your code consume CPU less eagerly. Once your process begins a compute-intensive task, it will use all the CPU it can get without restraint. So, we need a different approach to effectively limit CPU.

Kubernetes solves this with a Linux kernel extension called CFS Bandwidth Control that enables CPU throttling on individual processes.

CFS has two key components, quotas and periods. Each process has its own quota, which defines the maximum amount of time that process can use the CPU continuously before releasing it. The period is a global amount of time after which all of the quotas reset. So, for example, if a process has a quota of 100ms, and the period is 200ms, then that process can run for 100ms continuously, then must wait 100ms before using any more CPU cycles.

Kubernetes translates the percentages defined in your CPU limits to quotas, by multiplying the CPU limit by the CFS period. So, if the CFS period is one second, and your pod requests 250m CPU, the quota for your pod would be 250ms. If the period was 500ms, your pod quota would be 125ms, and so on. The Kubernetes blog has an excellent writeup of how it works. (The default period is actually 100ms, and can be overridden as of Kubernetes 0.12.)

Make sense?

Ok, but, it's exceedingly likely that you should turn it off.

First, CFS throttling is broken. Processes can be throttled too early, and quotas are sometimes not properly reset when the period expires. The causes of these issues are unclear, but it applies across many kernel versions and operating systems. Edit: this bug appears to have been fixed in kernel version 4.18.18. Thanks to /u/andor44 on Reddit for pointing this out.

Second, unless you have carefully tuned your cluster settings, you may not be fully utilizing the available CPUs. This is especially true for bursty workloads, where the majority of pods on a machine will have low demands most of the time, but periodically a single pod will consume a lot of CPU for a short burst. If you limit how much CPU that bursty pod is able to consume, you are just adding context-switching overhead, and forcing that pod to complete its task more slowly.

The fix for this mess is easy: don't use CPU limits, and run your kubelet with the --cpu-cfs-quota=false flag. This will disable CPU throttling entirely. This can cause different problems, but if you allocate pods to nodes effectively, your cluster will speed up. Anecdotally, when turning CFS throttling off, my web app's 95th percentile response time reduced by more than 50%. Henning Jacobs from Zalando noted similar improvements.