You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Which component are you using?: cluster-autoscaler on AWS
/area cluster-autoscaler
What version of the component are you using?: 9.45
Component version: Helm chart 9.45
What k8s version are you using (kubectl version)?:
kubectl version Output
$ kubectl version
Client Version: v1.31.2
Kustomize Version: v5.4.2
Server Version: v1.31.3-eks-56e63d8
What environment is this in?: AWS EKS
What did you expect to happen?: I am trying to figure out why the autoscaler does not honor my --ok-total-unready-count=0. It seems the node that enters the NotReady state is stuck with many terminating pods, and I observed at the same time the error in the autoscaler log.
The error is the following:
failed to list *v1.VolumeAttachment: volumeattachments.storage.k8s.io is forbidden: User "system:serviceaccount:kube-system:cluster-autoscaler" cannot list resource "volumeattachments" in API group "storage.k8s.io" at the cluster scope
When looking at the clusterrole created by the helm chart, I am not seeing this particular resource:
$ k describe clusterrole cluster-autoscaler-aws-cluster-autoscaler
Name: cluster-autoscaler-aws-cluster-autoscaler
Labels: app.kubernetes.io/instance=cluster-autoscaler
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=aws-cluster-autoscaler
helm.sh/chart=cluster-autoscaler-9.45.0
Annotations: meta.helm.sh/release-name: cluster-autoscaler
meta.helm.sh/release-namespace: kube-system
PolicyRule:
Resources Non-Resource URLs Resource Names Verbs
--------- ----------------- -------------- -----
endpoints [] [] [create patch]
events [] [] [create patch]
pods/eviction [] [] [create]
leases.coordination.k8s.io [] [] [create]
jobs.extensions [] [] [get list patch watch]
endpoints [] [cluster-autoscaler] [get update]
leases.coordination.k8s.io [] [cluster-autoscaler] [get update]
configmaps [] [] [list watch get]
pods/status [] [] [update]
nodes [] [] [watch list create delete get update]
jobs.batch [] [] [watch list get patch]
namespaces [] [] [watch list get]
persistentvolumeclaims [] [] [watch list get]
persistentvolumes [] [] [watch list get]
pods [] [] [watch list get]
replicationcontrollers [] [] [watch list get]
services [] [] [watch list get]
daemonsets.apps [] [] [watch list get]
replicasets.apps [] [] [watch list get]
statefulsets.apps [] [] [watch list get]
cronjobs.batch [] [] [watch list get]
daemonsets.extensions [] [] [watch list get]
replicasets.extensions [] [] [watch list get]
csidrivers.storage.k8s.io [] [] [watch list get]
csinodes.storage.k8s.io [] [] [watch list get]
csistoragecapacities.storage.k8s.io [] [] [watch list get]
storageclasses.storage.k8s.io [] [] [watch list get]
poddisruptionbudgets.policy [] [] [watch list]
I am not sure, but given the --ok-total-unready-count=0, I would expect the node which enters the NotReady state to be fairly quickly replaced by a node that can handle things.
What happened instead?:
The NotReady node sticks around for quite some time, with bunch of pods in Terminating state. Eventually, it'll go away after some time (maybe 30-45mn).
How to reproduce it (as minimally and precisely as possible):
Something is causing my node to get to NotReady state, I think way too much over-committment on them, especially on memory (then the kubelet then bails out).
I am afraid I can't :-/
Anything else we need to know?:
An log iteration where I see the volumeattachment error:
I0106 17:52:52.606768 1 static_autoscaler.go:274] Starting main loop
I0106 17:52:52.609136 1 aws_manager.go:188] Found multiple availability zones for ASG "eks-default_node_group-20241211130258966500000008-7cc9dae8-63f0-63d5-bce1-642871ebd84f"; using eu-central-2b for failure-domain.beta.kubernetes.io/zone label
I0106 17:52:52.758096 1 filter_out_schedulable.go:65] Filtering out schedulables
I0106 17:52:52.758116 1 filter_out_schedulable.go:122] 0 pods marked as unschedulable can be scheduled.
I0106 17:52:52.758125 1 filter_out_schedulable.go:85] No schedulable pods
I0106 17:52:52.758130 1 filter_out_daemon_sets.go:47] Filtered out 0 daemon set pods, 0 unschedulable pods left
I0106 17:52:52.758150 1 static_autoscaler.go:532] No unschedulable pods
I0106 17:52:52.758168 1 static_autoscaler.go:555] Calculating unneeded nodes
I0106 17:52:52.758182 1 pre_filtering_processor.go:67] Skipping ip-10-0-12-37.eu-central-2.compute.internal - node group min size reached (current: 3, min: 3)
I0106 17:52:52.758204 1 pre_filtering_processor.go:67] Skipping ip-10-0-28-107.eu-central-2.compute.internal - node group min size reached (current: 3, min: 3)
I0106 17:52:52.758209 1 pre_filtering_processor.go:67] Skipping ip-10-0-36-38.eu-central-2.compute.internal - node group min size reached (current: 3, min: 3)
I0106 17:52:52.758213 1 pre_filtering_processor.go:67] Skipping ip-10-0-36-82.eu-central-2.compute.internal - node group min size reached (current: 3, min: 3)
I0106 17:52:52.758473 1 static_autoscaler.go:598] Scale down status: lastScaleUpTime=2025-01-06 16:16:32.949347114 +0000 UTC m=-3582.400670434 lastScaleDownDeleteTime=2025-01-06 16:16:32.949347114 +0000 UTC m=-3582.400670434 lastScaleDownFailTime=2025-01-06 16:16:32.949347114 +0000 UTC m=-3582.400670434 scaleDownForbidden=false scaleDownInCooldown=true
I0106 17:52:52.759061 1 orchestrator.go:322] ScaleUpToNodeGroupMinSize: NodeGroup eks-default_node_group-20241211130258966500000008-7cc9dae8-63f0-63d5-bce1-642871ebd84f, TargetSize 3, MinSize 3, MaxSize 5
I0106 17:52:52.759135 1 orchestrator.go:366] ScaleUpToNodeGroupMinSize: scale up not needed
I0106 17:52:56.201819 1 reflector.go:349] Listing and watching *v1.VolumeAttachment from pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:251
W0106 17:52:56.206308 1 reflector.go:569] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:251: failed to list *v1.VolumeAttachment: volumeattachments.storage.k8s.io is forbidden: User "system:serviceaccount:kube-system:cluster-autoscaler" cannot list resource "volumeattachments" in API group "storage.k8s.io" at the cluster scope
E0106 17:52:56.206341 1 reflector.go:166] "Unhandled Error" err="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:251: Failed to watch *v1.VolumeAttachment: failed to list *v1.VolumeAttachment: volumeattachments.storage.k8s.io is forbidden: User \"system:serviceaccount:kube-system:cluster-autoscaler\" cannot list resource \"volumeattachments\" in API group \"storage.k8s.io\" at the cluster scope" logger="UnhandledError"
I0106 17:52:57.975501 1 reflector.go:879] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:251: Watch close - *v1.Node total 29 items received
The text was updated successfully, but these errors were encountered:
Which component are you using?: cluster-autoscaler on AWS
/area cluster-autoscaler
What version of the component are you using?: 9.45
Component version: Helm chart 9.45
What k8s version are you using (
kubectl version
)?:kubectl version
OutputWhat environment is this in?: AWS EKS
What did you expect to happen?: I am trying to figure out why the autoscaler does not honor my
--ok-total-unready-count=0
. It seems the node that enters theNotReady
state is stuck with many terminating pods, and I observed at the same time the error in the autoscaler log.The error is the following:
When looking at the clusterrole created by the helm chart, I am not seeing this particular resource:
I am not sure, but given the
--ok-total-unready-count=0
, I would expect the node which enters theNotReady
state to be fairly quickly replaced by a node that can handle things.What happened instead?:
The
NotReady
node sticks around for quite some time, with bunch of pods inTerminating
state. Eventually, it'll go away after some time (maybe 30-45mn).How to reproduce it (as minimally and precisely as possible):
Something is causing my node to get to
NotReady
state, I think way too much over-committment on them, especially on memory (then the kubelet then bails out).I am afraid I can't :-/
Anything else we need to know?:
An log iteration where I see the volumeattachment error:
The text was updated successfully, but these errors were encountered: