Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🌱 clusterctl: add flag to skip lagging provider check in ApplyCustomPlan #11196

Conversation

w21froster
Copy link

What this PR does / why we need it:

Clusterctl runs a pre-check to see if any other providers are lagging behind the target contract before creating an upgrade plan. In the current implementation of cluster-api-operator, there are multiple controllers reconciling on each different provider type. Each one of these controllers doesn't have knowledge of the other providers, and doesn't pass in enough information to clusterctl to be able to complete this check successfully. This PR is adds a flag and UpgradeOption to allow us to skip this pre-check and successfully upgrade the provider.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
This fixes issue 570 in the cluster-api-operator repo.

Copy link

linux-foundation-easycla bot commented Sep 18, 2024

CLA Signed


The committers listed above are authorized under a signed CLA.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign chrischdi for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. do-not-merge/needs-area PR is missing an area label needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 18, 2024
@k8s-ci-robot
Copy link
Contributor

Welcome @w21froster!

It looks like this is your first PR to kubernetes-sigs/cluster-api 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/cluster-api has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot
Copy link
Contributor

Hi @w21froster. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Sep 18, 2024
@w21froster
Copy link
Author

/area clusterctl

@k8s-ci-robot k8s-ci-robot added area/clusterctl Issues or PRs related to clusterctl and removed do-not-merge/needs-area PR is missing an area label labels Sep 18, 2024
@w21froster
Copy link
Author

@JoelSpeed @Jont828 Please take a look when you are available 🙏

@w21froster
Copy link
Author

@JoelSpeed @Jont828 Are you able to take a look? Let me know if you need more context on anything.

@chrischdi
Copy link
Member

Question: Could adding this and using it in the cluster-api-operator lead to issues?

Could it be possible to have providers then running in different contract versions which could maybe lead to issues?

Upgrading using clusterctl upgrades all providers at the same time instead of each one in parallel (so some could still be running while others are already upgraded).

@w21froster
Copy link
Author

w21froster commented Oct 11, 2024

Question: Could adding this and using it in the cluster-api-operator lead to issues?

Could it be possible to have providers then running in different contract versions which could maybe lead to issues?

Upgrading using clusterctl upgrades all providers at the same time instead of each one in parallel (so some could still be running while others are already upgraded).

I don't think this should be an issue, we talked about it in the cluster-api-operator office hours and determined that this was probably the best way forward to add a flag in clusterctl to skip this check. We have different CR's for each provider, and when users upgrade their providers they typically move all versions at the same time. I guess there could potentially be a delay between reconciliation for each provider, but we haven't noticed any issues running this as a fork and upgrading Azure CAPI/CAPBK/KCP providers.

Definitely open to better approaches though! I can stop by the CAPI office hours to discuss this issue we are having in more detail.

@fabriziopandini
Copy link
Member

I personally have some concern on disabling this check, considering that the value added of clusterctl is to ensure the health of the management cluster as whole.

TBH, I think that if someone asks to the operator to upgrade a single provider, this operation must be put on hold if it can lead to an invalid cluster (leaning on "when users upgrade their providers they typically move all versions at the same time" seems weak).

The upgrade operation for the providers involved should unblock itself when the users is upgrading enough providers to reach a valid state.

The issues seems to be in "Each one of these controllers doesn't have knowledge of the other providers, and doesn't pass in enough information to clusterctl to be able to complete this check successfully", but I think there are ways to get around since AFAIK for each provider there is a CR with a desired state/target version

@w21froster
Copy link
Author

Hey @fabriziopandini, sorry for the delayed response. Thank you for providing more context on this check. We don't want users to be able to break their cluster if they have a misconfiguration, so I think a PR should be made in the CAPI operator instead of CAPI to get this to pass. I will go ahead and close this PR

@w21froster w21froster closed this Nov 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/clusterctl Issues or PRs related to clusterctl cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unable to upgrade providers when using custom fetchconfig
4 participants