Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: cf worker shouldn't disconnect peers #15

Open
ThaUnknown opened this issue Aug 13, 2022 · 9 comments
Open

bug: cf worker shouldn't disconnect peers #15

ThaUnknown opened this issue Aug 13, 2022 · 9 comments

Comments

@ThaUnknown
Copy link
Contributor

ThaUnknown commented Aug 13, 2022

the CF worker really shouldn't disconnect peers, this should happen between the peers themselves

the peer list often takes time to propagate tru cloudflare, or sometimes doesn't at all, leading to people randomly disconnecting with eachother

@gfodor
Copy link
Owner

gfodor commented Aug 14, 2022

hmm - the peer list is stored in R2 which has global strong consistency (read after write) except if there is a network partition. at least, that's what they say. so there shouldn't be any latency. If a peer that was once in the list is no longer there, then they timed out or gracefully disconnected. if the timeout is too short, you can adjust it in the constructor.

@ThaUnknown
Copy link
Contributor Author

i can assure you it's not consistent, anyways this should probably work like a BitTorrent tracker where it announces peers and doesn't dictate them

@gfodor
Copy link
Owner

gfodor commented Aug 14, 2022

can you describe exactly what is happening? because the removal case happens when a peer is in the list, and then is not in the list. it's not going to do that if there was latency in the peer arriving into the list.

@gfodor
Copy link
Owner

gfodor commented Aug 14, 2022

Are you seeing behavior inconsistent with what Cloudflare writes here?

image

I'm not opposed to changing this, but I don't want to change it for the wrong reason.

@gfodor
Copy link
Owner

gfodor commented Aug 14, 2022

I think a better change is to keep the peer if it is connected. The scenario this is trying to cover is where you've joined a room and there are latent peers leftover from previous sessions that never disconnected gracefully (so aren't actually going to work, but whose RTCPeerConnections are created anyway), and you want to remove the unneeded RTCPeerConnections after they are timed out by the signalling server.

@ThaUnknown
Copy link
Contributor Author

can you describe exactly what is happening? because the removal case happens when a peer is in the list, and then is not in the list. it's not going to do that if there was latency in the peer arriving into the list.

the worker returns inconsistent peers from those who are expected to be in the lobby, this leads to people connecting to eachother then disconnecting the very next refresh, or even before they can finish connecting, the worker really shouldn't dictate which peers you should disconnect from

@gfodor
Copy link
Owner

gfodor commented Aug 16, 2022

yep - if you update this PR to just remove peers who are not currently connected I'll merge it. I want to leave this case in for the other one I mentioned: you join a room where there are still peers registered who didn't gracefully disconnect their sessions and you need to clear up the peer connections you created for them.

@gfodor
Copy link
Owner

gfodor commented Aug 16, 2022

that said, if you are seeing peers disappear from the worker who were in there, this sounds like a much deeper problem that needs to be understood on its own. either R2 is not living up to its guarantees (unlikely) or there are race conditions I need to deal with (more likely.)

@gfodor
Copy link
Owner

gfodor commented Aug 16, 2022

oh now that you mention it I'm re-reading the code, and actually if there are a bunch of rapidly joining peers there might be some shuffling (despite R2's guarantees) since they do compete for slots in the list. the idea was that the state would eventually converge on subsequent polls. i never actually saw this happen in testing but i think what you're saying sounds possible with the current implementation, so i think I can cross off the concern that I don't understand what's going on.

in any case, update your PR and it will fix this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants