Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CA DRA: support priority-based preempting pods using DRA #7683

Open
towca opened this issue Jan 9, 2025 · 0 comments
Open

CA DRA: support priority-based preempting pods using DRA #7683

towca opened this issue Jan 9, 2025 · 0 comments
Labels
area/cluster-autoscaler area/core-autoscaler Denotes an issue that is related to the core autoscaler and is not specific to any provider. kind/feature Categorizes issue or PR as related to a new feature. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.

Comments

@towca
Copy link
Collaborator

towca commented Jan 9, 2025

Which component are you using?:

/area cluster-autoscaler
/area core-autoscaler
/wg device-management

Is your feature request designed to solve a problem? If so describe the problem this feature should solve.:

If CA sees an unschedulable pod waiting for scheduler preemption (with nominatedNodeName set), it adds the pod to the nominatedNodeName in the snapshot without checking predicates, or even removing the preempted pod (so the Node can be effectively "overscheduled").

For DRA autoscaling MVP, we still just force-add such a Pod to the snapshot without modifying its ResourceClaims. This means that CA doesn't see the Node's ResourceSlices as used and can just schedule another Pod to use them in the simulations. We need to fix this scenario for production use.

Describe the solution you'd like.:

  • To handle ResourceClaims correctly, we'd have to start running scheduler predicates before adding these pods to the snapshot.
  • However, the predicates aren't likely to pass without actually removing the preempted pod from the node. I'm not sure if there is a way to determine which pod is supposed to be preempted.
  • IMO if the predicates don't pass, we shouldn't add these pods to the snapshot at all, and just wait for the preemption to happen. If we add them without modifying the claims, CA just doesn't model the cluster state correctly (it can overchedule a Node) which could lead to making wrong decisions.
  • Running additional scheduler predicates will have an impact on performance. It's however necessary for DRA pods. We could only run the predicates if a Pod references any ResourceClaims - keeping the performance for non-DRA pods unchanged.

Describe any alternative solutions you've considered.:

Keep force-adding the pods to the snapshot. This doesn't sound like a good idea for the reasons explained above.

Additional context.:

This is a part of Dynamic Resource Allocation (DRA) support in Cluster Autoscaler. An MVP of the support was implemented in #7530 (with the whole implementation tracked in kubernetes/kubernetes#118612). There are a number of post-MVP follow-ups to be addressed before DRA autoscaling is ready for production use - this is one of them.

@towca towca added the kind/feature Categorizes issue or PR as related to a new feature. label Jan 9, 2025
@k8s-ci-robot k8s-ci-robot added area/cluster-autoscaler area/core-autoscaler Denotes an issue that is related to the core autoscaler and is not specific to any provider. wg/device-management Categorizes an issue or PR as relevant to WG Device Management. labels Jan 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler area/core-autoscaler Denotes an issue that is related to the core autoscaler and is not specific to any provider. kind/feature Categorizes issue or PR as related to a new feature. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.
Projects
None yet
Development

No branches or pull requests

2 participants