-
Notifications
You must be signed in to change notification settings - Fork 365
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: join node creates new cluster when initial etcd sync config fails #5151
base: main
Are you sure you want to change the base?
fix: join node creates new cluster when initial etcd sync config fails #5151
Conversation
Signed-off-by: Ethan Mosbaugh <[email protected]>
Signed-off-by: Ethan Mosbaugh <[email protected]>
Signed-off-by: Ethan Mosbaugh <[email protected]>
f40047f
to
9706878
Compare
@jnummelin @twz123 I'm a bit stuck at this point with how to proceed with handling this case. Could you take another look at this PR and give me some guidance? Thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks simple enough! However, I'd leave out 9706878 for now, since k0s will retry all join errors no matter what caused them. So returning a 503 instead of a 500 is probably not really worth it?
func (c *command) needToJoin(nodeConfig *v1beta1.ClusterConfig) bool { | ||
if nodeConfig.Spec.Storage.Type == v1beta1.EtcdStorageType && !nodeConfig.Spec.Storage.Etcd.IsExternalClusterUsed() { | ||
return !file.Exists(filepath.Join(c.K0sVars.EtcdDataDir, "member", "snap", "db")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be worth adding a comment about that file, e.g. by including a link to https://etcd.io/docs/v3.5/learning/persistent-storage-files/#bbolt-btree-membersnapdb
@@ -107,6 +107,8 @@ func (e *Etcd) syncEtcdConfig(ctx context.Context, etcdRequest v1beta1.EtcdReque | |||
etcdResponse, err = e.JoinClient.JoinEtcd(ctx, etcdRequest) | |||
return err | |||
}, | |||
retry.Delay(1*time.Second), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain why the delay was increased in the commit message, or in a comment? If I'm not mistaken this will now block for ~ 17 minutes, whereas it was blocking only around 100 secs before?
Description
Fixes #5149
If etcd join fails to sync the etcd config and the k0s process exits, the pki ca files exist and etcd creates a new cluster rather than joining the existing one. Rather than check the pki dir for embedded etcd, check the etcd data directory exists as we do here.
I am open to suggestions here if I am checking the wrong thing as I cannot test this and am taking a guess at a solution.
Type of change
How Has This Been Tested?
Checklist: