Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential race condition in connect with --proxy-via parameter #3781

Open
kbatalin opened this issue Jan 28, 2025 · 5 comments · Fixed by #3786
Open

Potential race condition in connect with --proxy-via parameter #3781

kbatalin opened this issue Jan 28, 2025 · 5 comments · Fixed by #3786
Assignees

Comments

@kbatalin
Copy link

Describe the bug
A potential race condition occurs when connecting to a cluster with the --proxy-via flag. The issue is reproducible on clusters with more than one workload.

Symptoms:
When running the command telepresence connect --proxy-via <cidr>=<workload>, the following error appears:

telepresence connect: error: connector.Connect: failed to connect to root daemon: rpc error: code = DeadlineExceeded desc = context deadline exceeded

Logs from the root daemon:

2025-01-28 11:29:43.1597 info    daemon/session : Configuration reloaded (from pkg/client/config_util.go:22)
2025-01-28 11:29:43.1600 debug   daemon/session : Returning session from new session session_id:"1a5262df-41e3-44e8-bccf-d61938e34697" cluster_id:"57b93813-707d-421b-ba73-2f760de37dcb" install_id:"f1a7bc35-edab-440f-99db-2c5b99b616e3" (from pkg/client/rootd/service.go:367)
2025-01-28 11:29:43.2070 debug   daemon/session : ProxyVIA using subnet 246.246.0.0/16 (from pkg/client/rootd/session.go:1162)
2025-01-28 11:29:43.2073 debug   daemon/session/agentPods : WatchAgentPods starting (from pkg/client/agentpf/clients.go:315)
2025-01-28 11:29:43.2111 debug   daemon/session : Local translation subnets: [172.xxx.xxx.xxx/xx] (from pkg/client/rootd/session.go:1227)
2025-01-28 11:29:43.2112 debug   daemon/session : Ensuring proxy-via agent in workload1 (from pkg/client/rootd/session.go:1170)
2025-01-28 11:29:43.2405 debug   daemon/session/agentPods : Using WebSocket based port-forward to pod workload2 (from pkg/client/portforward/streamconn.go:131)
2025-01-28 11:29:43.5740 info    daemon/session/agentPods : Connected to OSS Traffic Agent v2.21.2 (from pkg/client/k8sclient/connect.go:84)
...
2025-01-28 11:29:44.2944 debug   daemon/session : Waiting for proxy-via agent in workload1 (from pkg/client/rootd/session.go:1248)
2025-01-28 11:29:44.2951 info    daemon/session/dns/Server : ... unrelated logs from dbs.Server ...
...
2025-01-28 11:30:44.2961 info    daemon/session : -- Session ended (from pkg/client/rootd/session.go:1027)
2025-01-28 11:30:44.2969 debug   daemon/session/agentPods : WatchAgentPods ending with 1 clients still active (from pkg/client/agentpf/clients.go:317)
2025-01-28 11:30:44.2971 error   daemon/session/agentPods : goroutine "/daemon/session/agentPods" exited with error: rpc error: code = Canceled desc = context canceled
...
2025-01-28 11:30:44.4500 error   daemon/session : proxy-via agent in workload1 failed: context deadline exceeded (from pkg/client/rootd/service.go:389)
2025-01-28 11:30:44.4505 info    daemon/session : Configuration reloaded (from pkg/client/config_util.go:22)

Debugging locally, I found that the issue originates in the agentPods goroutine, which depends on the ProxyVia value from the --proxy-via flag:

if ai.Intercepted || s.isProxyVIA(ai) || s.hasWaiterFor(ai) {
addClient(k, ai)
}

If this value is not set, the goroutine skips adding the required workload to the clients list and instead adds a random workload at the end:

// Ensure that we have at least one client (if at least one agent exists)
if s.clients.Size() == 0 {
for _, ai := range aim {
k := ai.PodName + "." + ai.Namespace
addClient(k, ai)
break
}
}
return nil

Meanwhile, the ProxyVia value is assigned in the separate vif goroutine:

s.agentClients.SetProxyVia(wl)

There is no synchronization between these two goroutines. In practice, this causes:

  1. agentPods loads the list of pods, but since ProxyVia is not yet set, the goroutine skips adding the correct workload to clients.
  2. The vif goroutine then searches for the required workload in the clients list (in the WaitForWorkload function) but exits immediately because the workload is not present.
  3. The CLI hangs and eventually times out with the above error.

To Reproduce

  1. Have a cluster with more than one workload.
  2. Run the command telepresence connect --proxy-via cidr=workload.
  3. Observe the error:
telepresence connect: error: connector.Connect: failed to connect to root daemon: rpc error: code = DeadlineExceeded desc = context deadline exceeded

Maybe need to try multiple times

Expected behavior
The connection should succeed, with the correct workload being added to the clients list and processed as expected.

Versions (please complete the following information):

  • Output of telepresence version (preferably while telepresence is connected):
OSS Client         : v2.21.2
OSS Root Daemon    : v2.21.2
OSS User Daemon    : v2.21.2
OSS Traffic Manager: v2.21.2
Traffic Agent      : ghcr.io/telepresenceio/tel2:2.21.2
  • Operating system of workstation running telepresence commands
apple m1 pro
macOS Sequoia 15.2

Additional context
Proposed solutions:

  1. Modify the WaitForWorkload function to wait for the required workload to appear in the clients list instead of exiting immediately.
  2. Add synchronization between the agentPods and vif goroutines to ensure the ProxyVia value is set before agentPods runs.
@thallgren thallgren self-assigned this Jan 29, 2025
@thallgren
Copy link
Member

@kbatalin As far as I can tell, the WaitForWorkload will indeed wait for the required workload to appear.

  1. It performs a Compute call on the wlWaiters map to retrieve a channel that it can use for the wait.
  2. The Compute function checks if a client exists. If so, the no channel is stored (the function returns true). otherwise, a new channel is created and stored (function returns false).
  3. If a channel already existed, or if a new one was created (second return value from Compute is true), then the waitWithTimeout is called.

I'm not saying that there isn't a race. Just pointing out that I think your conclusions are a bit off.

@kbatalin
Copy link
Author

kbatalin commented Feb 1, 2025

@thallgren

Thanks for the detailed explanation. Indeed, I was mistaken in thinking that WaitForWorkload exits immediately without waiting for the workload. I added more logs locally and troubleshot the issue again. Here's what I found:

Context:

  • All workloads in the cluster already have agent sidecars installed.
  • The command used: telepresence connect --proxy-via 10.123.45.0/20=proxy-via-workload

Observations:

  1. agentPods loads the list of pods, but since ProxyVia is not yet set, the goroutine adds a random workload to the list of clients. Let's call it random-workload. At this point, wlWaiters is empty.

  2. vif checks clients, which now contains only random-workload. Since it can't find the expected workload (proxy-via-workload), it adds a waiting channel to wlWaiters for proxy-via-workload.

State at this moment:

clients: ['random-workload']
wlWaiters: ['proxy-via-workload']
  1. vif calls waitWithTimeout, which invokes notifyWaiters before waiting on waitOn. However, notifyWaiters does not close the waiting channel because clients only contains random-workload. As a result, vif waits on waitOn.

  2. I might be wrong about this step, so please correct me if needed:
    updateClients is only triggered when the workload list is updated, which happens when the agentPods goroutine receives a message from rmc.WatchAgentPods(ctx, s.session). However, since no updates are sent (as the workloads remain unchanged in the cluster), vif will wait until a timeout occurs and then shut down.

I think, the connection may succeed in certain scenarios, for example:

  1. vif is scheduled before agentPods and manages to set the ProxyVia property.
  2. agentPods happens to select the correct workload in updateClients, making keys(clients) == keys(wlWaiters).

Sorry for the confusion in my original message.
Does this make sense?

@thallgren
Copy link
Member

@kbatalin yes, this makes sense. I'll look into improving the logic.

@thallgren
Copy link
Member

thallgren commented Feb 2, 2025

@kbatalin please try this preview of the coming 2.21.3 release.

@kbatalin
Copy link
Author

kbatalin commented Feb 4, 2025

@thallgren

I tried to connect multiple times with different workloads, and it looks good. All attempts were successful, and in the logs I can see:

debug   daemon/session : Waiting for proxy-via agent in XXX
debug   daemon/session : Wait succeeded for proxy-via agent in XXX

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants