Failed to watch *v1.VolumeAttachment #7663

madchap · 2025-01-06T18:16:11Z

Which component are you using?: cluster-autoscaler on AWS

/area cluster-autoscaler

What version of the component are you using?: 9.45

Component version: Helm chart 9.45

What k8s version are you using (kubectl version)?:

kubectl version Output

$ kubectl version Client Version: v1.31.2 Kustomize Version: v5.4.2 Server Version: v1.31.3-eks-56e63d8

What environment is this in?: AWS EKS

What did you expect to happen?: I am trying to figure out why the autoscaler does not honor my --ok-total-unready-count=0. It seems the node that enters the NotReady state is stuck with many terminating pods, and I observed at the same time the error in the autoscaler log.

The error is the following:

failed to list *v1.VolumeAttachment: volumeattachments.storage.k8s.io is forbidden: User "system:serviceaccount:kube-system:cluster-autoscaler" cannot list resource "volumeattachments" in API group "storage.k8s.io" at the cluster scope

When looking at the clusterrole created by the helm chart, I am not seeing this particular resource:

$ k describe clusterrole cluster-autoscaler-aws-cluster-autoscaler
Name:         cluster-autoscaler-aws-cluster-autoscaler
Labels:       app.kubernetes.io/instance=cluster-autoscaler
              app.kubernetes.io/managed-by=Helm
              app.kubernetes.io/name=aws-cluster-autoscaler
              helm.sh/chart=cluster-autoscaler-9.45.0
Annotations:  meta.helm.sh/release-name: cluster-autoscaler
              meta.helm.sh/release-namespace: kube-system
PolicyRule:
  Resources                            Non-Resource URLs  Resource Names        Verbs
  ---------                            -----------------  --------------        -----
  endpoints                            []                 []                    [create patch]
  events                               []                 []                    [create patch]
  pods/eviction                        []                 []                    [create]
  leases.coordination.k8s.io           []                 []                    [create]
  jobs.extensions                      []                 []                    [get list patch watch]
  endpoints                            []                 [cluster-autoscaler]  [get update]
  leases.coordination.k8s.io           []                 [cluster-autoscaler]  [get update]
  configmaps                           []                 []                    [list watch get]
  pods/status                          []                 []                    [update]
  nodes                                []                 []                    [watch list create delete get update]
  jobs.batch                           []                 []                    [watch list get patch]
  namespaces                           []                 []                    [watch list get]
  persistentvolumeclaims               []                 []                    [watch list get]
  persistentvolumes                    []                 []                    [watch list get]
  pods                                 []                 []                    [watch list get]
  replicationcontrollers               []                 []                    [watch list get]
  services                             []                 []                    [watch list get]
  daemonsets.apps                      []                 []                    [watch list get]
  replicasets.apps                     []                 []                    [watch list get]
  statefulsets.apps                    []                 []                    [watch list get]
  cronjobs.batch                       []                 []                    [watch list get]
  daemonsets.extensions                []                 []                    [watch list get]
  replicasets.extensions               []                 []                    [watch list get]
  csidrivers.storage.k8s.io            []                 []                    [watch list get]
  csinodes.storage.k8s.io              []                 []                    [watch list get]
  csistoragecapacities.storage.k8s.io  []                 []                    [watch list get]
  storageclasses.storage.k8s.io        []                 []                    [watch list get]
  poddisruptionbudgets.policy          []                 []                    [watch list]

I am not sure, but given the --ok-total-unready-count=0, I would expect the node which enters the NotReady state to be fairly quickly replaced by a node that can handle things.

What happened instead?:
The NotReady node sticks around for quite some time, with bunch of pods in Terminating state. Eventually, it'll go away after some time (maybe 30-45mn).

How to reproduce it (as minimally and precisely as possible):
Something is causing my node to get to NotReady state, I think way too much over-committment on them, especially on memory (then the kubelet then bails out).

I am afraid I can't :-/

Anything else we need to know?:

An log iteration where I see the volumeattachment error:

I0106 17:52:52.606768       1 static_autoscaler.go:274] Starting main loop
I0106 17:52:52.609136       1 aws_manager.go:188] Found multiple availability zones for ASG "eks-default_node_group-20241211130258966500000008-7cc9dae8-63f0-63d5-bce1-642871ebd84f"; using eu-central-2b for failure-domain.beta.kubernetes.io/zone label
I0106 17:52:52.758096       1 filter_out_schedulable.go:65] Filtering out schedulables
I0106 17:52:52.758116       1 filter_out_schedulable.go:122] 0 pods marked as unschedulable can be scheduled.
I0106 17:52:52.758125       1 filter_out_schedulable.go:85] No schedulable pods
I0106 17:52:52.758130       1 filter_out_daemon_sets.go:47] Filtered out 0 daemon set pods, 0 unschedulable pods left
I0106 17:52:52.758150       1 static_autoscaler.go:532] No unschedulable pods
I0106 17:52:52.758168       1 static_autoscaler.go:555] Calculating unneeded nodes
I0106 17:52:52.758182       1 pre_filtering_processor.go:67] Skipping ip-10-0-12-37.eu-central-2.compute.internal - node group min size reached (current: 3, min: 3)
I0106 17:52:52.758204       1 pre_filtering_processor.go:67] Skipping ip-10-0-28-107.eu-central-2.compute.internal - node group min size reached (current: 3, min: 3)
I0106 17:52:52.758209       1 pre_filtering_processor.go:67] Skipping ip-10-0-36-38.eu-central-2.compute.internal - node group min size reached (current: 3, min: 3)
I0106 17:52:52.758213       1 pre_filtering_processor.go:67] Skipping ip-10-0-36-82.eu-central-2.compute.internal - node group min size reached (current: 3, min: 3)
I0106 17:52:52.758473       1 static_autoscaler.go:598] Scale down status: lastScaleUpTime=2025-01-06 16:16:32.949347114 +0000 UTC m=-3582.400670434 lastScaleDownDeleteTime=2025-01-06 16:16:32.949347114 +0000 UTC m=-3582.400670434 lastScaleDownFailTime=2025-01-06 16:16:32.949347114 +0000 UTC m=-3582.400670434 scaleDownForbidden=false scaleDownInCooldown=true
I0106 17:52:52.759061       1 orchestrator.go:322] ScaleUpToNodeGroupMinSize: NodeGroup eks-default_node_group-20241211130258966500000008-7cc9dae8-63f0-63d5-bce1-642871ebd84f, TargetSize 3, MinSize 3, MaxSize 5
I0106 17:52:52.759135       1 orchestrator.go:366] ScaleUpToNodeGroupMinSize: scale up not needed
I0106 17:52:56.201819       1 reflector.go:349] Listing and watching *v1.VolumeAttachment from pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:251
W0106 17:52:56.206308       1 reflector.go:569] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:251: failed to list *v1.VolumeAttachment: volumeattachments.storage.k8s.io is forbidden: User "system:serviceaccount:kube-system:cluster-autoscaler" cannot list resource "volumeattachments" in API group "storage.k8s.io" at the cluster scope
E0106 17:52:56.206341       1 reflector.go:166] "Unhandled Error" err="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:251: Failed to watch *v1.VolumeAttachment: failed to list *v1.VolumeAttachment: volumeattachments.storage.k8s.io is forbidden: User \"system:serviceaccount:kube-system:cluster-autoscaler\" cannot list resource \"volumeattachments\" in API group \"storage.k8s.io\" at the cluster scope" logger="UnhandledError"
I0106 17:52:57.975501       1 reflector.go:879] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:251: Watch close - *v1.Node total 29 items received

The text was updated successfully, but these errors were encountered:

Tarasovych · 2025-01-07T10:51:06Z

Same issue with chart 9.45.0, image 1.32.0

eon01 · 2025-01-07T18:26:35Z

Same problem with DigitalOcean / chart 9.45.0 / image 1.32.0.

kubectl edit clusterrole [autoscaler-cluster-role-name]

A temporary fix would be adding the necessary permissions for volumeattachments to the ClusterRole bound to the autoscaler service account.

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
[...]
rules:
[...]
- apiGroups:
  - storage.k8s.io
  resources:
  - storageclasses
  - csinodes
  - csidrivers
  - csistoragecapacities
  - volumeattachments # <== This
  verbs:
  - watch
  - list
  - get
[...]

devops-cafex · 2025-01-09T14:42:28Z

Seeing the same in our eks cluster missing volumeattachements reasources from the clusterrole

madchap added the kind/bug Categorizes issue or PR as related to a bug. label Jan 6, 2025

k8s-ci-robot added the area/cluster-autoscaler label Jan 6, 2025

eon01 mentioned this issue Jan 7, 2025

Adds volumeattachments resource permissions to the Cluster Autoscaler Helm chart #7674

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed to watch *v1.VolumeAttachment #7663

Failed to watch *v1.VolumeAttachment #7663

madchap commented Jan 6, 2025

Tarasovych commented Jan 7, 2025

eon01 commented Jan 7, 2025 •

edited

Loading

devops-cafex commented Jan 9, 2025

Failed to watch *v1.VolumeAttachment #7663

Failed to watch *v1.VolumeAttachment #7663

Comments

madchap commented Jan 6, 2025

Tarasovych commented Jan 7, 2025

eon01 commented Jan 7, 2025 • edited Loading

devops-cafex commented Jan 9, 2025

eon01 commented Jan 7, 2025 •

edited

Loading