Improve the clustermesh status command #1834

giorio94 · 2023-07-13T14:50:19Z

This PR includes a set of improvements for the cilium clustermesh status command, to simplify troubleshooting and reduce CI flakiness:

Always check the readiness of the clustermesh-apiserver deployment.
Report a node as not ready also when the configuration has not yet been detected (as the secret modification has not propagated yet).
Report explicit errors about why a node is not ready, based on the additional information introduced in Extend clustermesh status reporting with remote configuration and synchronization information cilium#26788 (e.g., etcd connection, cluster configuration retrieval, data synchronization).
Remove duplicate checks and fix inaccuracies in output messages.

Please refer to the individual commit messages for additional information.

Fixes: #1773

I'm currently marking this PR as ready for review as it makes sense that the review occurs in parallel with that of cilium/cilium#26788, so that possible modifications can be synchronized. The commit with the go.mod changes will be dropped before merging.

rolinh

It looks like you forgot to remove the commit aptly named "TO DROP: use forked version of cilium"

giorio94 · 2023-07-14T09:17:05Z

It looks like you forgot to remove the commit aptly named "TO DROP: use forked version of cilium"

Yeah, the commit is still there because otherwise the CLI doesn't compile. This is the reason for the dont-merge/blocked label being set, as it depends on cilium/cilium#26788. At the same time, I wanted to mark this one ready for review so that they can be reviewed in parallel, and possible changes can be propagated if necessary.

asauber

We're now checking the health of the Deployment, but it may be the case that the Service is separately not able to receive traffic (e.g. an LB Service has not been assigned an IP). Is the idea here that if nodes have synced then the Service must be reachable?

giorio94 · 2023-07-17T06:48:37Z

We're now checking the health of the Deployment, but it may be the case that the Service is separately not able to receive traffic (e.g. an LB Service has not been assigned an IP). Is the idea here that if nodes have synced then the Service must be reachable?

There's actually a service readiness check, but it is hidden inside the statusAccessInformation function (which in turn calls extractAccessInformation). The inner function returns an error (thus triggering the retry mechanism) if no service IP is found. This is also the reason for the removal of the explicit service readiness check in 218fc9a, as it was already tested earlier.

The temporary commit has been dropped, and there are no more vendoring changes.

Currently, the wait observer prints the first log message only after that the warning interval has expired. Yet, this can be misleading, because the command seems to hang, and it is not clear which operation is being performed. Signed-off-by: Marco Iorio <[email protected]>

This parameter was a no-op, hence let's just remove it. Signed-off-by: Marco Iorio <[email protected]>

YutaroHayakawa

LGTM with one question

YutaroHayakawa · 2023-07-24T07:11:47Z

clustermesh/clustermesh.go

 		ServiceIPs:           []string{},
 		Tunnel:               cm.Data[configNameTunnel],
 	}

 	switch {
 	case len(endpoints) > 0:
+		svc.Spec.Type = "Manual"


What's this "Manual"?

TL;DR; I've dropped it with the last push. The idea was to propagate the type of the service to be printed by the cilium clustermesh status command, and to set it to Manual in case the endpoints were specified manually (there was a typo though, and it should have read ai.ServiceType = "Manual"). Yet, the cilium clustermesh status command does not support specifying the endpoints manually, and I agree that having a custom value might be misleading (I've also now changed that field to be of type corev1.ServiceType to make its content more explicit). Hence, the special case is actually not needed.

The retrieval of the access information already retrieves the clustermesh-apiserver service and verifies that an IP can be inferred. Hence, let's drop the subsequent service check, which is redundant. Signed-off-by: Marco Iorio <[email protected]>

Currently, the status of the clustermesh-apiserver deployment is checked only when the --wait flag is specified. Let's always enable it, and uniform the wait style to that adopted for the other checks. Signed-off-by: Marco Iorio <[email protected]>

The wait logic implemented by the ClusterMeshConnectivity function was never enabled, as it is already handled by an outer wrapper. Hence, let's just remove it for the sake of simplicity. Signed-off-by: Marco Iorio <[email protected]>

The message reported when waiting for clusters to be connected is currently not correct, as the given number refers to nodes and not clusters. Additionally, the two conditions are equivalent, as an error is always present when a node is not ready. Let's also get rid of a redundant error wrapping. Signed-off-by: Marco Iorio <[email protected]>

Let's sort the cluster names and the possible errors to enforce a consistent output. Additionally, let's make it more explicit when the local cluster is not connected to any remote clusters. Signed-off-by: Marco Iorio <[email protected]>

Currently, the number of ready clusters is determined from the status reported by the agents only. Yet, given that the configuration is mounted from a secret, it might take some time before it propagates, in case agents are not restarted at the same time (that is when host aliases don't need to be configured). Hence, let's infer the number of expected clusters from the secret, so that we can actually report the correct status and wait until they are all effectively ready. Signed-off-by: Marco Iorio <[email protected]>

Let's leverage the additional status flags introduced in cilium/cilium#26788 to provide more detailed information concerning why a node cannot connect to a remote cluster, and simplify troubleshooting. In particular, it might be due to an etcd connection problem, to the cluster config being required but not found, or to the synchronization process being still in progress. Signed-off-by: Marco Iorio <[email protected]>

giorio94 added dont-merge/blocked Another PR must be merged before this one. kind/enhancement This would improve or streamline existing functionality. area/clustermesh labels Jul 13, 2023

giorio94 requested review from a team as code owners July 13, 2023 14:50

giorio94 requested review from rolinh, asauber and YutaroHayakawa July 13, 2023 14:50

giorio94 temporarily deployed to ci July 13, 2023 14:50 — with GitHub Actions Inactive

giorio94 mentioned this pull request Jul 13, 2023

Extend clustermesh status reporting with remote configuration and synchronization information cilium/cilium#26788

Merged

giorio94 marked this pull request as draft July 13, 2023 15:31

giorio94 force-pushed the mio/clustermesh-status branch from 4833f0c to 76c774c Compare July 13, 2023 16:32

giorio94 temporarily deployed to ci July 13, 2023 16:32 — with GitHub Actions Inactive

giorio94 marked this pull request as ready for review July 13, 2023 16:34

giorio94 force-pushed the mio/clustermesh-status branch from 76c774c to c9a5394 Compare July 14, 2023 06:28

giorio94 temporarily deployed to ci July 14, 2023 06:28 — with GitHub Actions Inactive

rolinh previously requested changes Jul 14, 2023

View reviewed changes

asauber reviewed Jul 14, 2023

View reviewed changes

asauber self-requested a review July 17, 2023 15:26

asauber approved these changes Jul 18, 2023

View reviewed changes

giorio94 force-pushed the mio/clustermesh-status branch from c9a5394 to 183375f Compare July 31, 2023 07:54

giorio94 temporarily deployed to ci July 31, 2023 07:54 — with GitHub Actions Inactive

giorio94 removed the dont-merge/blocked Another PR must be merged before this one. label Jul 31, 2023

giorio94 added 2 commits July 31, 2023 16:51

clustermesh status: drop skip service check parameter

9949fea

This parameter was a no-op, hence let's just remove it. Signed-off-by: Marco Iorio <[email protected]>

giorio94 force-pushed the mio/clustermesh-status branch from 183375f to d9904f1 Compare July 31, 2023 14:51

giorio94 temporarily deployed to ci July 31, 2023 14:51 — with GitHub Actions Inactive

YutaroHayakawa approved these changes Aug 1, 2023

View reviewed changes

maintainer-s-little-helper bot added ready-to-merge This PR has passed all tests and received consensus from code owners to merge. labels Aug 1, 2023

giorio94 added dont-merge/bad-bot To prevent MLH from marking ready-to-merge. and removed ready-to-merge This PR has passed all tests and received consensus from code owners to merge. labels Aug 1, 2023

giorio94 added 7 commits August 1, 2023 14:48

clustermesh status: consistent output order

004d231

Let's sort the cluster names and the possible errors to enforce a consistent output. Additionally, let's make it more explicit when the local cluster is not connected to any remote clusters. Signed-off-by: Marco Iorio <[email protected]>

giorio94 force-pushed the mio/clustermesh-status branch from d9904f1 to e695828 Compare August 1, 2023 12:50

giorio94 temporarily deployed to ci August 1, 2023 12:50 — with GitHub Actions Inactive

giorio94 added ready-to-merge This PR has passed all tests and received consensus from code owners to merge. and removed dont-merge/bad-bot To prevent MLH from marking ready-to-merge. labels Aug 1, 2023

michi-covalent merged commit 524d834 into cilium:main Aug 1, 2023
19 checks passed

giorio94 mentioned this pull request Aug 2, 2023

CI: ClusterMesh (ci-multicluster) connectivity test failed: timeout reached waiting lookup for localhost from pod cilium-test/client2 cilium/cilium#26513

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve the clustermesh status command #1834

Improve the clustermesh status command #1834

giorio94 commented Jul 13, 2023 •

edited

Loading

rolinh left a comment

giorio94 commented Jul 14, 2023

asauber left a comment

giorio94 commented Jul 17, 2023

YutaroHayakawa left a comment

YutaroHayakawa Jul 24, 2023

giorio94 Aug 1, 2023

Improve the clustermesh status command #1834

Improve the clustermesh status command #1834

Conversation

giorio94 commented Jul 13, 2023 • edited Loading

rolinh left a comment

Choose a reason for hiding this comment

giorio94 commented Jul 14, 2023

asauber left a comment

Choose a reason for hiding this comment

giorio94 commented Jul 17, 2023

YutaroHayakawa left a comment

Choose a reason for hiding this comment

YutaroHayakawa Jul 24, 2023

Choose a reason for hiding this comment

giorio94 Aug 1, 2023

Choose a reason for hiding this comment

giorio94 commented Jul 13, 2023 •

edited

Loading