fix: join node creates new cluster when initial etcd sync config fails #5151

emosbaugh · 2024-10-23T13:56:44Z

Description

If etcd join fails to sync the etcd config and the k0s process exits, the pki ca files exist and etcd creates a new cluster rather than joining the existing one. Rather than check the pki dir for embedded etcd, check the etcd data directory exists as we do here.

I am open to suggestions here if I am checking the wrong thing as I cannot test this and am taking a guess at a solution.

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update

How Has This Been Tested?

Manual test
Auto test added

Checklist:

My code follows the style guidelines of this project
My commit messages are signed-off
I have performed a self-review of my own code
[] I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
Any dependent changes have been merged and published in downstream modules
I have checked my code and corrected any misspellings

Signed-off-by: Ethan Mosbaugh <[email protected]>

emosbaugh · 2024-11-05T18:42:21Z

@jnummelin @twz123 I'm a bit stuck at this point with how to proceed with handling this case. Could you take another look at this PR and give me some guidance? Thank you!

twz123

Looks simple enough! However, I'd leave out 9706878 for now, since k0s will retry all join errors no matter what caused them. So returning a 503 instead of a 500 is probably not really worth it?

twz123 · 2024-11-08T07:14:14Z

cmd/controller/controller.go

 func (c *command) needToJoin(nodeConfig *v1beta1.ClusterConfig) bool {
+	if nodeConfig.Spec.Storage.Type == v1beta1.EtcdStorageType && !nodeConfig.Spec.Storage.Etcd.IsExternalClusterUsed() {
+		return !file.Exists(filepath.Join(c.K0sVars.EtcdDataDir, "member", "snap", "db"))


Might be worth adding a comment about that file, e.g. by including a link to https://etcd.io/docs/v3.5/learning/persistent-storage-files/#bbolt-btree-membersnapdb

twz123 · 2024-11-08T07:21:03Z

pkg/component/controller/etcd.go

@@ -107,6 +107,8 @@ func (e *Etcd) syncEtcdConfig(ctx context.Context, etcdRequest v1beta1.EtcdReque
 			etcdResponse, err = e.JoinClient.JoinEtcd(ctx, etcdRequest)
 			return err
 		},
+		retry.Delay(1*time.Second),


Can you explain why the delay was increased in the commit message, or in a comment? If I'm not mistaken this will now block for ~ 17 minutes, whereas it was blocking only around 100 secs before?

twz123 mentioned this pull request Oct 23, 2024

If etcd fails to sync config during initial start sequence and k0s restarts, node creates a new cluster rather than joining existing #5149

Open

4 tasks

emosbaugh added 3 commits November 5, 2024 10:36

fix: join node creates new cluster when initial etcd sync config fails

838e547

Signed-off-by: Ethan Mosbaugh <[email protected]>

increase etcd sync config backoff

93eb6d1

Signed-off-by: Ethan Mosbaugh <[email protected]>

return service unavailable from k0s api when etcd is unhealthy

9706878

Signed-off-by: Ethan Mosbaugh <[email protected]>

emosbaugh force-pushed the issue-5149-etcd-creates-new-cluster-rather-than-join-if-sync-fails branch from f40047f to 9706878 Compare November 5, 2024 18:36

twz123 reviewed Nov 8, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: join node creates new cluster when initial etcd sync config fails #5151

fix: join node creates new cluster when initial etcd sync config fails #5151

emosbaugh commented Oct 23, 2024

emosbaugh commented Nov 5, 2024

twz123 left a comment

twz123 Nov 8, 2024

twz123 Nov 8, 2024

fix: join node creates new cluster when initial etcd sync config fails #5151

Are you sure you want to change the base?

fix: join node creates new cluster when initial etcd sync config fails #5151

Conversation

emosbaugh commented Oct 23, 2024

Description

Type of change

How Has This Been Tested?

Checklist:

emosbaugh commented Nov 5, 2024

twz123 left a comment

Choose a reason for hiding this comment

twz123 Nov 8, 2024

Choose a reason for hiding this comment

twz123 Nov 8, 2024

Choose a reason for hiding this comment