Potential race condition in connect with `--proxy-via` parameter #3781

kbatalin · 2025-01-28T12:08:15Z

Describe the bug
A potential race condition occurs when connecting to a cluster with the --proxy-via flag. The issue is reproducible on clusters with more than one workload.

Symptoms:
When running the command telepresence connect --proxy-via <cidr>=<workload>, the following error appears:

telepresence connect: error: connector.Connect: failed to connect to root daemon: rpc error: code = DeadlineExceeded desc = context deadline exceeded

Logs from the root daemon:

2025-01-28 11:29:43.1597 info    daemon/session : Configuration reloaded (from pkg/client/config_util.go:22)
2025-01-28 11:29:43.1600 debug   daemon/session : Returning session from new session session_id:"1a5262df-41e3-44e8-bccf-d61938e34697" cluster_id:"57b93813-707d-421b-ba73-2f760de37dcb" install_id:"f1a7bc35-edab-440f-99db-2c5b99b616e3" (from pkg/client/rootd/service.go:367)
2025-01-28 11:29:43.2070 debug   daemon/session : ProxyVIA using subnet 246.246.0.0/16 (from pkg/client/rootd/session.go:1162)
2025-01-28 11:29:43.2073 debug   daemon/session/agentPods : WatchAgentPods starting (from pkg/client/agentpf/clients.go:315)
2025-01-28 11:29:43.2111 debug   daemon/session : Local translation subnets: [172.xxx.xxx.xxx/xx] (from pkg/client/rootd/session.go:1227)
2025-01-28 11:29:43.2112 debug   daemon/session : Ensuring proxy-via agent in workload1 (from pkg/client/rootd/session.go:1170)
2025-01-28 11:29:43.2405 debug   daemon/session/agentPods : Using WebSocket based port-forward to pod workload2 (from pkg/client/portforward/streamconn.go:131)
2025-01-28 11:29:43.5740 info    daemon/session/agentPods : Connected to OSS Traffic Agent v2.21.2 (from pkg/client/k8sclient/connect.go:84)
...
2025-01-28 11:29:44.2944 debug   daemon/session : Waiting for proxy-via agent in workload1 (from pkg/client/rootd/session.go:1248)
2025-01-28 11:29:44.2951 info    daemon/session/dns/Server : ... unrelated logs from dbs.Server ...
...
2025-01-28 11:30:44.2961 info    daemon/session : -- Session ended (from pkg/client/rootd/session.go:1027)
2025-01-28 11:30:44.2969 debug   daemon/session/agentPods : WatchAgentPods ending with 1 clients still active (from pkg/client/agentpf/clients.go:317)
2025-01-28 11:30:44.2971 error   daemon/session/agentPods : goroutine "/daemon/session/agentPods" exited with error: rpc error: code = Canceled desc = context canceled
...
2025-01-28 11:30:44.4500 error   daemon/session : proxy-via agent in workload1 failed: context deadline exceeded (from pkg/client/rootd/service.go:389)
2025-01-28 11:30:44.4505 info    daemon/session : Configuration reloaded (from pkg/client/config_util.go:22)

Debugging locally, I found that the issue originates in the agentPods goroutine, which depends on the ProxyVia value from the --proxy-via flag:

telepresence/pkg/client/agentpf/clients.go

Lines 523 to 525 in bece8ba

    
           if ai.Intercepted || s.isProxyVIA(ai) || s.hasWaiterFor(ai) { 
        
           	addClient(k, ai) 
        
           }

If this value is not set, the goroutine skips adding the required workload to the clients list and instead adds a random workload at the end:

telepresence/pkg/client/agentpf/clients.go

Lines 539 to 547 in bece8ba

    
           // Ensure that we have at least one client (if at least one agent exists) 
        
           if s.clients.Size() == 0 { 
        
           	for _, ai := range aim { 
        
           		k := ai.PodName + "." + ai.Namespace 
        
           		addClient(k, ai) 
        
           		break 
        
           	} 
        
           } 
        
           return nil

Meanwhile, the ProxyVia value is assigned in the separate vif goroutine:

telepresence/pkg/client/rootd/session.go

Line 1247 in 69c3021

s.agentClients.SetProxyVia(wl)

There is no synchronization between these two goroutines. In practice, this causes:

agentPods loads the list of pods, but since ProxyVia is not yet set, the goroutine skips adding the correct workload to clients.
The vif goroutine then searches for the required workload in the clients list (in the WaitForWorkload function) but exits immediately because the workload is not present.
The CLI hangs and eventually times out with the above error.

To Reproduce

Have a cluster with more than one workload.
Run the command telepresence connect --proxy-via cidr=workload.
Observe the error:

telepresence connect: error: connector.Connect: failed to connect to root daemon: rpc error: code = DeadlineExceeded desc = context deadline exceeded

Maybe need to try multiple times

Expected behavior
The connection should succeed, with the correct workload being added to the clients list and processed as expected.

Versions (please complete the following information):

Output of telepresence version (preferably while telepresence is connected):

OSS Client         : v2.21.2
OSS Root Daemon    : v2.21.2
OSS User Daemon    : v2.21.2
OSS Traffic Manager: v2.21.2
Traffic Agent      : ghcr.io/telepresenceio/tel2:2.21.2

Operating system of workstation running telepresence commands

apple m1 pro
macOS Sequoia 15.2

Additional context
Proposed solutions:

Modify the WaitForWorkload function to wait for the required workload to appear in the clients list instead of exiting immediately.
Add synchronization between the agentPods and vif goroutines to ensure the ProxyVia value is set before agentPods runs.

The text was updated successfully, but these errors were encountered:

thallgren · 2025-02-01T07:45:14Z

@kbatalin As far as I can tell, the WaitForWorkload will indeed wait for the required workload to appear.

It performs a Compute call on the wlWaiters map to retrieve a channel that it can use for the wait.
The Compute function checks if a client exists. If so, the no channel is stored (the function returns true). otherwise, a new channel is created and stored (function returns false).
If a channel already existed, or if a new one was created (second return value from Compute is true), then the waitWithTimeout is called.

I'm not saying that there isn't a race. Just pointing out that I think your conclusions are a bit off.

kbatalin · 2025-02-01T10:37:32Z

@thallgren

Thanks for the detailed explanation. Indeed, I was mistaken in thinking that WaitForWorkload exits immediately without waiting for the workload. I added more logs locally and troubleshot the issue again. Here's what I found:

Context:

All workloads in the cluster already have agent sidecars installed.
The command used: telepresence connect --proxy-via 10.123.45.0/20=proxy-via-workload

Observations:

agentPods loads the list of pods, but since ProxyVia is not yet set, the goroutine adds a random workload to the list of clients. Let's call it random-workload. At this point, wlWaiters is empty.
vif checks clients, which now contains only random-workload. Since it can't find the expected workload (proxy-via-workload), it adds a waiting channel to wlWaiters for proxy-via-workload.

State at this moment:

clients: ['random-workload']
wlWaiters: ['proxy-via-workload']

vif calls waitWithTimeout, which invokes notifyWaiters before waiting on waitOn. However, notifyWaiters does not close the waiting channel because clients only contains random-workload. As a result, vif waits on waitOn.
I might be wrong about this step, so please correct me if needed:
updateClients is only triggered when the workload list is updated, which happens when the agentPods goroutine receives a message from rmc.WatchAgentPods(ctx, s.session). However, since no updates are sent (as the workloads remain unchanged in the cluster), vif will wait until a timeout occurs and then shut down.

I think, the connection may succeed in certain scenarios, for example:

vif is scheduled before agentPods and manages to set the ProxyVia property.
agentPods happens to select the correct workload in updateClients, making keys(clients) == keys(wlWaiters).

Sorry for the confusion in my original message.
Does this make sense?

thallgren · 2025-02-01T11:35:52Z

@kbatalin yes, this makes sense. I'll look into improving the logic.

thallgren · 2025-02-02T15:16:47Z

@kbatalin please try this preview of the coming 2.21.3 release.

kbatalin · 2025-02-04T13:44:08Z

@thallgren

I tried to connect multiple times with different workloads, and it looks good. All attempts were successful, and in the logs I can see:

debug   daemon/session : Waiting for proxy-via agent in XXX
debug   daemon/session : Wait succeeded for proxy-via agent in XXX

Thanks!

thallgren self-assigned this Jan 29, 2025

thallgren linked a pull request Feb 2, 2025 that will close this issue

Using the --proxy-via flag would sometimes cause connection timeouts. #3786

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential race condition in connect with `--proxy-via` parameter #3781

Potential race condition in connect with `--proxy-via` parameter #3781

kbatalin commented Jan 28, 2025

thallgren commented Feb 1, 2025

kbatalin commented Feb 1, 2025 •

edited

Loading

thallgren commented Feb 1, 2025

thallgren commented Feb 2, 2025 •

edited

Loading

kbatalin commented Feb 4, 2025

Potential race condition in connect with --proxy-via parameter #3781

Potential race condition in connect with --proxy-via parameter #3781

Comments

kbatalin commented Jan 28, 2025

thallgren commented Feb 1, 2025

kbatalin commented Feb 1, 2025 • edited Loading

thallgren commented Feb 1, 2025

thallgren commented Feb 2, 2025 • edited Loading

kbatalin commented Feb 4, 2025

Potential race condition in connect with `--proxy-via` parameter #3781

Potential race condition in connect with `--proxy-via` parameter #3781

kbatalin commented Feb 1, 2025 •

edited

Loading

thallgren commented Feb 2, 2025 •

edited

Loading