Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dataplane proxies do not get mTLS identity cert and other xDS resources updated #12885

Open
jijiechen opened this issue Feb 19, 2025 · 2 comments · May be fixed by #12886
Open

dataplane proxies do not get mTLS identity cert and other xDS resources updated #12885

jijiechen opened this issue Feb 19, 2025 · 2 comments · May be fixed by #12886
Assignees
Labels
kind/bug A bug triage/pending This issue will be looked at on the next triage meeting
Milestone

Comments

@jijiechen
Copy link
Member

Kuma Version

2.9.3

Describe the bug

Some of the dataplanes in the mesh were found using expired mTLS certificate, so connection to/from these dataplane proxies could not be established.

In a mTLS enabled mesh cluster, the error seen in other log of other DPPs include:

[2025-02-12 00:31:57.150][35][debug][connection] [source/common/tls/ssl_socket.cc:280] [Tags: ""ConnectionId"":""78007""] remote address:172.23.2.43:45846,TLS_error:|268435581:SSL routines:OPENSSL_internal:CERTIFICATE_VERIFY_FAILED:TLS_error_end

To Reproduce

  1. Install Kuma service mesh onto a Kubernetes cluster, ideally version 1.30 or above
  2. Enable mTLS for the default mesh
  3. Install the kuma-demo, and maybe many other apps, with sidecar injection enabled in those app namesapces.
  4. Keep triggering traffic among the applications
  5. Wait for a few hours/days and check sidecar logs of the DPs

Expected behavior

Traffic establishes successfully

Additional context (optional)

No response

@jijiechen jijiechen added kind/bug A bug triage/pending This issue will be looked at on the next triage meeting labels Feb 19, 2025
@jijiechen
Copy link
Member Author

jijiechen commented Feb 19, 2025

From the logs provided by the user, I can see, at 2025-02-09T02:59:31.898Z the CP was sending stale envoy.extensions.transport_sockets.tls.v3.Secret resources (version 565c5411-1595-4b1f-911e-ab75cf750d03) to DP after it reconnected.

The secret resources should be cleaned up once the DP disconnected, but it wasn't.

  • DP disconnected at 2025-02-09T02:59:29.095Z
  • It reconnected at 2025-02-09T02:59:31.796Z (edited)

The expected order is:

  • DP disconnect ---> CP clean up (watchers removed) ---> DP re-connect --> watchers created --> CP re-generate secrets

While the actual order could be:

  • DP disconnect ---> DP re-connect --> watchers created --> CP clean up (watchers removed) --> CP re-generate secrets

In the second execution process, the updated identity cert (and other xDS updates) will never be received by the DP.

This can happen when the OnStop invocation was delayed for some reason: it's run in a go-routine, so it's possible to be scheduled slower than we expected.

ctx, cancel := context.WithCancel(context.Background())
t.watchdogs[dpKey] = func() {
dataplaneSyncTrackerLog.V(1).Info("stopping Watchdog for a Dataplane", "dpKey", dpKey, "streamID", streamID)
cancel()
}
dataplaneSyncTrackerLog.V(1).Info("starting Watchdog for a Dataplane", "dpKey", dpKey, "streamID", streamID)
//nolint:contextcheck // it's not clear how the parent go-control-plane context lives
go t.newDataplaneWatchdog(dpKey).Start(ctx)

@jijiechen
Copy link
Member Author

I've seen a 30ms latency on the clean up process of the dataplane connection.

Here are the callback events:

OnStreamReq A
OnProxyConnected B
OnStreamClosed C
OnProxyDisconnnected D

A and C are exclusive
B and D are exclusive
execution of A and C condition execution of B and D so the assumption is that B runs directly after A and D directly after C

What happens is:

C
A --> (bad D should run directly after C not A!!!)
B
D --> No more reconciliation running

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug A bug triage/pending This issue will be looked at on the next triage meeting
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants