Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the clustermesh status command #1834

Merged
merged 9 commits into from
Aug 1, 2023

Conversation

giorio94
Copy link
Member

@giorio94 giorio94 commented Jul 13, 2023

This PR includes a set of improvements for the cilium clustermesh status command, to simplify troubleshooting and reduce CI flakiness:

  • Always check the readiness of the clustermesh-apiserver deployment.
  • Report a node as not ready also when the configuration has not yet been detected (as the secret modification has not propagated yet).
  • Report explicit errors about why a node is not ready, based on the additional information introduced in Extend clustermesh status reporting with remote configuration and synchronization information  cilium#26788 (e.g., etcd connection, cluster configuration retrieval, data synchronization).
  • Remove duplicate checks and fix inaccuracies in output messages.

Please refer to the individual commit messages for additional information.

Fixes: #1773

I'm currently marking this PR as ready for review as it makes sense that the review occurs in parallel with that of cilium/cilium#26788, so that possible modifications can be synchronized. The commit with the go.mod changes will be dropped before merging.

@giorio94 giorio94 added dont-merge/blocked Another PR must be merged before this one. kind/enhancement This would improve or streamline existing functionality. area/clustermesh labels Jul 13, 2023
@giorio94 giorio94 requested review from a team as code owners July 13, 2023 14:50
@giorio94 giorio94 temporarily deployed to ci July 13, 2023 14:50 — with GitHub Actions Inactive
@giorio94 giorio94 marked this pull request as draft July 13, 2023 15:31
@giorio94 giorio94 temporarily deployed to ci July 13, 2023 16:32 — with GitHub Actions Inactive
@giorio94 giorio94 marked this pull request as ready for review July 13, 2023 16:34
@giorio94 giorio94 temporarily deployed to ci July 14, 2023 06:28 — with GitHub Actions Inactive
rolinh
rolinh previously requested changes Jul 14, 2023
Copy link
Member

@rolinh rolinh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like you forgot to remove the commit aptly named "TO DROP: use forked version of cilium"

@giorio94
Copy link
Member Author

It looks like you forgot to remove the commit aptly named "TO DROP: use forked version of cilium"

Yeah, the commit is still there because otherwise the CLI doesn't compile. This is the reason for the dont-merge/blocked label being set, as it depends on cilium/cilium#26788. At the same time, I wanted to mark this one ready for review so that they can be reviewed in parallel, and possible changes can be propagated if necessary.

Copy link
Member

@asauber asauber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're now checking the health of the Deployment, but it may be the case that the Service is separately not able to receive traffic (e.g. an LB Service has not been assigned an IP). Is the idea here that if nodes have synced then the Service must be reachable?

@giorio94
Copy link
Member Author

We're now checking the health of the Deployment, but it may be the case that the Service is separately not able to receive traffic (e.g. an LB Service has not been assigned an IP). Is the idea here that if nodes have synced then the Service must be reachable?

There's actually a service readiness check, but it is hidden inside the statusAccessInformation function (which in turn calls extractAccessInformation). The inner function returns an error (thus triggering the retry mechanism) if no service IP is found. This is also the reason for the removal of the explicit service readiness check in 218fc9a, as it was already tested earlier.

@asauber asauber self-requested a review July 17, 2023 15:26
@giorio94 giorio94 temporarily deployed to ci July 31, 2023 07:54 — with GitHub Actions Inactive
@giorio94 giorio94 removed the dont-merge/blocked Another PR must be merged before this one. label Jul 31, 2023
@giorio94 giorio94 dismissed rolinh’s stale review July 31, 2023 07:56

The temporary commit has been dropped, and there are no more vendoring changes.

Currently, the wait observer prints the first log message only after
that the warning interval has expired. Yet, this can be misleading,
because the command seems to hang, and it is not clear which operation
is being performed.

Signed-off-by: Marco Iorio <[email protected]>
This parameter was a no-op, hence let's just remove it.

Signed-off-by: Marco Iorio <[email protected]>
@giorio94 giorio94 temporarily deployed to ci July 31, 2023 14:51 — with GitHub Actions Inactive
Copy link
Member

@YutaroHayakawa YutaroHayakawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with one question

ServiceIPs: []string{},
Tunnel: cm.Data[configNameTunnel],
}

switch {
case len(endpoints) > 0:
svc.Spec.Type = "Manual"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this "Manual"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TL;DR; I've dropped it with the last push. The idea was to propagate the type of the service to be printed by the cilium clustermesh status command, and to set it to Manual in case the endpoints were specified manually (there was a typo though, and it should have read ai.ServiceType = "Manual"). Yet, the cilium clustermesh status command does not support specifying the endpoints manually, and I agree that having a custom value might be misleading (I've also now changed that field to be of type corev1.ServiceType to make its content more explicit). Hence, the special case is actually not needed.

@maintainer-s-little-helper maintainer-s-little-helper bot added ready-to-merge This PR has passed all tests and received consensus from code owners to merge. labels Aug 1, 2023
@giorio94 giorio94 added dont-merge/bad-bot To prevent MLH from marking ready-to-merge. and removed ready-to-merge This PR has passed all tests and received consensus from code owners to merge. labels Aug 1, 2023
The retrieval of the access information already retrieves the
clustermesh-apiserver service and verifies that an IP can be
inferred. Hence, let's drop the subsequent service check, which
is redundant.

Signed-off-by: Marco Iorio <[email protected]>
Currently, the status of the clustermesh-apiserver deployment is
checked only when the --wait flag is specified. Let's always enable
it, and uniform the wait style to that adopted for the other checks.

Signed-off-by: Marco Iorio <[email protected]>
The wait logic implemented by the ClusterMeshConnectivity function was
never enabled, as it is already handled by an outer wrapper. Hence,
let's just remove it for the sake of simplicity.

Signed-off-by: Marco Iorio <[email protected]>
The message reported when waiting for clusters to be connected is
currently not correct, as the given number refers to nodes and not
clusters. Additionally, the two conditions are equivalent, as an
error is always present when a node is not ready. Let's also get
rid of a redundant error wrapping.

Signed-off-by: Marco Iorio <[email protected]>
Let's sort the cluster names and the possible errors to
enforce a consistent output. Additionally, let's make it
more explicit when the local cluster is not connected to
any remote clusters.

Signed-off-by: Marco Iorio <[email protected]>
Currently, the number of ready clusters is determined from the status
reported by the agents only. Yet, given that the configuration is
mounted from a secret, it might take some time before it propagates,
in case agents are not restarted at the same time (that is when host
aliases don't need to be configured). Hence, let's infer the number
of expected clusters from the secret, so that we can actually report
the correct status and wait until they are all effectively ready.

Signed-off-by: Marco Iorio <[email protected]>
Let's leverage the additional status flags introduced in
cilium/cilium#26788 to provide more detailed information
concerning why a node cannot connect to a remote cluster,
and simplify troubleshooting. In particular, it might be
due to an etcd connection problem, to the cluster config
being required but not found, or to the synchronization
process being still in progress.

Signed-off-by: Marco Iorio <[email protected]>
@giorio94 giorio94 temporarily deployed to ci August 1, 2023 12:50 — with GitHub Actions Inactive
@giorio94 giorio94 added ready-to-merge This PR has passed all tests and received consensus from code owners to merge. and removed dont-merge/bad-bot To prevent MLH from marking ready-to-merge. labels Aug 1, 2023
@michi-covalent michi-covalent merged commit 524d834 into cilium:main Aug 1, 2023
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/clustermesh kind/enhancement This would improve or streamline existing functionality. ready-to-merge This PR has passed all tests and received consensus from code owners to merge.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cilium clustermesh status --wait does not wait for all clusters to be connected
5 participants