-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 Perform draining and volume detachment once until completion #11590
base: main
Are you sure you want to change the base?
🐛 Perform draining and volume detachment once until completion #11590
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/area bootstrap |
df2f637
to
76d0706
Compare
76d0706
to
f472e1a
Compare
@@ -657,14 +657,26 @@ func (r *Reconciler) isNodeDrainAllowed(m *clusterv1.Machine) bool { | |||
return false | |||
} | |||
|
|||
if conditions.IsTrue(m, clusterv1.DrainingSucceededCondition) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not start to rely on this condition to be set as it is because its gonna be deprecated/removed in v1beta2.
Also propably a change in the behavior for all VMs (previously followed reconciliation may also run again through drain).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I unified the check under the PreTerminateDeleteHookSucceededCondition
, which allows for both methods to run until completion. It is not listed in the proposal as a part of deprecation/removal AFAICT
Signed-off-by: Danil-Grigorev <[email protected]>
f472e1a
to
1021f8e
Compare
Thanks for review @chrischdi 👍🏼 |
/lgtm A backport would be very much appreciated 🙏 |
LGTM label has been added. Git tree hash: 0bcc510758d20e84eb7b2d56a889fcf26def4356
|
/area machine |
is this a control plane Node? wouldn't this scenario make any other upcoming node deletion fail to query through the remote client as well? |
Thanks @enxebre, I think in this case it is eventually passing the deletion. Once the ETCD member is drained, the API server healthz on the node fails, which causes exclusion from LB. It is important to get the code to remove infra machine, so it completes the deletion on the provider side. Node is removed once API server connectivity is restored, and then the Machine follow. Thanks to @chrischdi suggestion the behavior is replicated in our CI with setting 2 annotations #11591 (comment), and it works well as a temporary solution. But other providers may hit the same issue. |
What this PR does / why we need it:
The draining logic continues to access the cluster even after the hook is removed or completed, causing a deadlock on machine removal in a non KCP backed CP provider implementation.
In a cluster with a kubelet local mode, competed draining and etcd membership removal causes inability to access API server externally, so the operation continues to error out even after success. This does not allow underlying machine to be removed.
The solution makes the
PreTeminateHook
agnostic to the provider, and ensures that completion of the operation does not lead to further attempts to access the cluster.Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #11591
/area machine