Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add docs for the new kops reconcile cluster command #17191

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

rifelpet
Copy link
Member

@rifelpet rifelpet commented Jan 9, 2025

/hold for feedback

A few open questions:

  1. Do we update all the rolling-update cluster docs references to reconcile cluster?
  2. Do we return an error when a user tries to upgrade from k8s 1.30 to 1.31 using update cluster --yes? (with no --instance-group* filtering)
  3. Do we add a new permalink that the error message links to?
  4. Do we mention the new update cluster --reconcile flag? When would a user use it instead of kops reconcile cluster ?

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 9, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from rifelpet. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested a review from hakman January 9, 2025 02:39
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 9, 2025
@stl-victor-sudakov
Copy link

Do we return an error when a user tries to upgrade from k8s 1.30 to 1.31 using update cluster --yes?

Why would this be an error?

@rifelpet
Copy link
Member Author

rifelpet commented Jan 9, 2025

Do we return an error when a user tries to upgrade from k8s 1.30 to 1.31 using update cluster --yes?

Why would this be an error?

Because updating both the cluster's control plane launch templates (or other cloud provider equivalents) and node launch templates at the same time will cause new nodes to fail to join the cluster until all control plane instances have been upgraded. So if Cluster Autoscaler or Karpenter scale up nodes before or during the control plane rolling-update, they will fail to join and workloads will be stuck in Pending. This is almost certainly not what the user wants and is why we're introducing the new command.

@rifelpet
Copy link
Member Author

rifelpet commented Jan 9, 2025

We could allow the user to bypass the error if they know what they're doing. for example, on clusters that dont use Cluster Autoscaler or Karpenter.

@stl-victor-sudakov
Copy link

Do we return an error when a user tries to upgrade from k8s 1.30 to 1.31 using update cluster --yes?

Why would this be an error?

Because updating both the cluster's control plane launch templates (or other cloud provider equivalents) and node launch templates at the same time will cause new nodes to fail to join the cluster until all control plane instances have been upgraded. So if Cluster Autoscaler or Karpenter scale up nodes before or during the control plane rolling-update, they will fail to join and workloads will be stuck in Pending. This is almost certainly not what the user wants and is why we're introducing the new command.

Hold on, hasn't this sequence:

kops upgrade cluster $NAME --yes
kops update cluster $NAME --yes
kops rolling-update cluster $NAME --yes

always been the standard upgrade sequence? These steps are even documented in https://kops.sigs.k8s.io/operations/updates_and_upgrades/#automated-update And now it is an error?

@rifelpet
Copy link
Member Author

rifelpet commented Jan 9, 2025

Now it may cause node failures during the k8s 1.31 upgrade, yes. Hence the bold release note being added in this PR and my proposal to prevent users from making this mistake by returning a (skippable) error.

I'll update that docs page to note this change as well.

@stl-victor-sudakov
Copy link

Now it may cause node failures during the k8s 1.31 upgrade, yes. Hence the bold release note being added in this PR and my proposal to prevent users from making this mistake by returning a (skippable) error.

I'll update that docs page to note this change as well.

Sorry for my persistence, what has changed in k8s 1.31 that the regular kOps upgrade procedure has become dangerous?

@rifelpet
Copy link
Member Author

rifelpet commented Jan 9, 2025

Sorry for my persistence, what has changed in k8s 1.31 that the regular kOps upgrade procedure has become dangerous?

I updated this PR to link to the k/k issue that goes into more detail: kubernetes/kubernetes#127316

@stl-victor-sudakov
Copy link

stl-victor-sudakov commented Jan 10, 2025

Sorry for my persistence, what has changed in k8s 1.31 that the regular kOps upgrade procedure has become dangerous?

I updated this PR to link to the k/k issue that goes into more detail: kubernetes/kubernetes#127316

Oh, what a longread! Maybe #16907 would be shorter and more to the point, it is also mentioned within the longer post.

However I think I understand the innovation now. The new reconcile command does "update --yes && rolling-update --yes" on CP nodes first, and then does the same on worker nodes, thus enforcing that the CP is fully updated first. Is this correct?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/documentation cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants