Reasons why your cluster autoscaler isn't working

Recently, I discovered that our Kubernetes cluster-autoscaler wasn’t scaling down nodes.

After some debugging and testing, we realized we needed to set a few flags for our cluster-autoscaler to scale down our nodes properly:

  • skip-nodes-with-local-storage: false A team started deploying pods with local storage directories, which caused our cluster-autoscaler to stop working effectively. The local storage directories weren’t expected to persist beyond the lifecycle of its pod, so we turned this off.
  • Scale down utilization threshold: 0.75 This flag is set to 0.5 or 50% by default. It means that once a node has enough pods that request 50% of its maximum memory, it cannot be scaled down. Our team runs workloads with guaranteed QoS (quality of service), meaning the requested memory always equals the maximum memory limit, so we bumped this number up.
  • Optional: skip-nodes-with-system-pods: false This flag is true by default. What it does is control whether or not nodes with non-daemonset pods in the kube-system namespace can be scaled down. It’s generally safer to keep this default.

The Kubernetes cluster-autoscaler keeps conservative defaults, so you should check in on whether or not your scaling is working properly every once in a while.

You can find the list of all flags here.


Master GitHub Actions with a Senior Infrastructure Engineer

As a senior staff infrastructure engineer, I share exclusive, behind-the-scenes insights that you won't find anywhere else. Get the strategies and techniques I've used to save companies $500k in CI costs and transform teams with GitOps best practices—delivered straight to your inbox.

    Not sure yet? Check out the archive.

    Unsubscribe at any time.