-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug: lightPush is not able to keep node connections #1966
Comments
Did some investigating of this issue. Replicated locally and noticed some patterns that lead to connections closed or peers deleted from store:
This one is the most common:
Eventually I reached a state where there was a single peer left that I can send a message to using light push. The peer is NOT in the The connection gets reset and the peer is removed....
but the same peer quickly reconnects:
and I can keep sending messages. It's been a little while and this last peer still hasn't been lost. I'm curious if the peer being absent from I while leave the light-js example running overnight and check on it again. Let me know if any of these logs are helpful @danisharora099 @weboko |
I modified connection manager such that it doesn't delete the dial error after a peer is removed. I waited to reach the state where there were no more peers to send a light push, and found that every single dial error was the same:
This seems odd. I understand it should be the case for some of the peers, but for 133 of them to result in this error seems excessive. |
It seems that most peers are not running wss, in which case this is expected. I have a PR open for this but I want to do some additional testing before merging: #1983 Something else I discovered is that for some reason the peer store "loses" the metadata for a peer, specifically the shard info. Without this, the function called to getPeers for light push filters out all of the peers because it thinks they don't support autosharding |
I believe I have found the culprit, but like the PR above it probably needs more testing #1984 |
I believe I identified the prime suspect causing this issue (and dropping connections in general):
It seems what happens is:
|
A mitigation for this that seems to work is increasing the maximum number of incoming connections for the ping protocol to 10 from the default of 2. I tried this with the light-js example and it's been running for 40 minutes without dropping below 9 peers, sending a light push message every 10 seconds. I'm not sure what are the performance implications of increasing this number but it seems preferable to the node losing peer connections and not recovering them. |
The most important fixes for this have been included in the latest release of js-waku. The remaining PR for filtering wss connections is not as relevant. @weboko can this be considered resolved? |
Agree that this task is complete. If we observe any issues - let's action in the scope of some other bug / task. |
bug report
Problem
Similarly to my issues with Filter (#1923), when I keep a
light-js
example running for ~15 minutes with cleanwaku:peers
localStorage, LightPush "breaks" - I get an error message when attempting tosend
in debug logs:After reload it took less then 5 minutes to the lightnode to lose all the lightpush peers - started with 3
3 minutes later there was only 1 peer
And a minute later there were no peers.
It does not seem to recover from this state
Proposed Solutions
Notes
The text was updated successfully, but these errors were encountered: