Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: lightPush is not able to keep node connections #1966

Closed
vpavlin opened this issue Apr 19, 2024 · 8 comments
Closed

bug: lightPush is not able to keep node connections #1966

vpavlin opened this issue Apr 19, 2024 · 8 comments
Assignees

Comments

@vpavlin
Copy link
Member

vpavlin commented Apr 19, 2024

bug report

Problem

Similarly to my issues with Filter (#1923), when I keep a light-js example running for ~15 minutes with clean waku:peers localStorage, LightPush "breaks" - I get an error message when attempting to send in debug logs:

waku:warn:light-push No peers found. Ensure you have a connection to the network. +0ms

After reload it took less then 5 minutes to the lightnode to lose all the lightpush peers - started with 3

Screenshot from 2024-04-19 10-52-15

3 minutes later there was only 1 peer
Screenshot from 2024-04-19 10-54-21

And a minute later there were no peers.

It does not seem to recover from this state

Proposed Solutions

  • Ensure that js-waku attempts to reconnect to a peer after losing a connection

Notes

  • js-waku version: 0.0.25-fd60cc2.0
@fryorcraken fryorcraken added this to Waku Apr 19, 2024
@weboko weboko moved this to In Progress in Waku Apr 22, 2024
@weboko weboko moved this from In Progress to Priority in Waku Apr 24, 2024
@weboko weboko changed the title LightPush is not able to keep node connections bug: lightPush is not able to keep node connections Apr 24, 2024
@weboko weboko added the bug Something isn't working label Apr 24, 2024
@weboko weboko removed their assignment Apr 24, 2024
@adklempner adklempner self-assigned this Apr 24, 2024
@adklempner
Copy link
Member

Did some investigating of this issue. Replicated locally and noticed some patterns that lead to connections closed or peers deleted from store:

libp2p:connection:outbound:4zjz071714024970277:error could not create new stream for protocols /ipfs/ping/1.0.0 +2s Error: unexpected end of input

libp2p:ping:error error while pinging 16Uiu2HAmQSMNExfUYUqfuXWkD5DaNZnMYnigRxFKbk3tcEFQeQeE +34s Error: unexpected end of input

waku:error:keep-alive Ping failed for peer (16Uiu2HAmQSMNExfUYUqfuXWkD5DaNZnMYnigRxFKbk3tcEFQeQeE).                Next ping will be attempted in 300 seconds.               +20m

libp2p:gossipsub connection ended 16Uiu2HAmQSMNExfUYUqfuXWkD5DaNZnMYnigRxFKbk3tcEFQeQeE +924ms
common.js:113

libp2p:gossipsub delete peer 16Uiu2HAmQSMNExfUYUqfuXWkD5DaNZnMYnigRxFKbk3tcEFQeQeE +0ms

This one is the most common:

waku:error:connection-manager Error dialing peer 16Uiu2HAmKAeC9VYRWWHwZxpjtPdf3bGeFg6aN2L1YYQmyc3FJRUU - The dial request has no valid addresses +1ms

connection_manager.js:201 waku:info:connection-manager Deleting undialable peer 16Uiu2HAmKAeC9VYRWWHwZxpjtPdf3bGeFg6aN2L1YYQmyc3FJRUU from peer store. Reason: The dial request has no valid addresses +1ms
libp2p:connection:outbound:6qf7pg1714026544431:error error handling incoming stream id r7 +0ms Too many inbound protocol streams for protocol "/ipfs/ping/1.0.0" - limit 2 ERR_TOO_MANY_INBOUND_PROTOCOL_STREAMS CodeError: Too many inbound protocol streams for protocol "/ipfs/ping/1.0.0" - limit 2
    at http://localhost:3000/dist/bundle.js:55349:41
    
libp2p:connection:outbound:6qf7pg1714026544431 closing connection to /dns4/node-01.ac-cn-hongkong-c.waku.test.statusim.net/tcp/8000/wss/p2p/16Uiu2HAkzHaTP5JsUwfR9NR8Rj9HC24puS6ocaU8wze4QrXr9iXp +588ms

Eventually I reached a state where there was a single peer left that I can send a message to using light push. The peer is NOT in the keepAliveManager.

The connection gets reset and the peer is removed....

libp2p:gossipsub:error CodeError: stream reset    at MplexStream.reset (http://localhost:3000/dist/bundle.js:39511:21)    at MplexStreamMuxer._handleIncoming (http://localhost:3000/dist/bundle.js:39919:28)    at MplexStreamMuxer.sink (http://localhost:3000/dist/bundle.js:39826:36)    at async Promise.all (index 0) +10m
common.js:113 libp2p:gossipsub connection ended 16Uiu2HAmSJvSJphxRdbnigUV5bjRRZFBhTtWFTSyiKaQByCjwmpV +10m
common.js:113 libp2p:gossipsub delete peer 16Uiu2HAmSJvSJphxRdbnigUV5bjRRZFBhTtWFTSyiKaQByCjwmpV +0ms

but the same peer quickly reconnects:

libp2p:mplex initiator stream with id 31 and protocol /vac/waku/relay/2.0.0 ended +2ms
common.js:113 libp2p:mplex new receiver stream 8 +188ms
common.js:113 libp2p:connection:outbound:8k5axg1714026544411 incoming stream opened on /vac/waku/relay/2.0.0 +3m
common.js:113 libp2p:gossipsub new peer 16Uiu2HAmSJvSJphxRdbnigUV5bjRRZFBhTtWFTSyiKaQByCjwmpV +191ms
common.js:113 libp2p:gossipsub Added peer has no IP in current address 16Uiu2HAmSJvSJphxRdbnigUV5bjRRZFBhTtWFTSyiKaQByCjwmpV /dns4/node-01.ac-cn-hongkong-c.waku.sandbox.status.im/tcp/8000/wss/p2p/16Uiu2HAmSJvSJphxRdbnigUV5bjRRZFBhTtWFTSyiKaQByCjwmpV +0ms
common.js:113 libp2p:gossipsub create inbound stream 16Uiu2HAmSJvSJphxRdbnigUV5bjRRZFBhTtWFTSyiKaQByCjwmpV +0ms
common.js:113 libp2p:connection:outbound:8k5axg1714026544411 starting new stream for protocols /vac/waku/relay/2.0.0 +1ms

and I can keep sending messages. It's been a little while and this last peer still hasn't been lost. I'm curious if the peer being absent from waku.connectionManager.keepAliveManager has something to do with it.

I while leave the light-js example running overnight and check on it again. Let me know if any of these logs are helpful @danisharora099 @weboko

@adklempner
Copy link
Member

I modified connection manager such that it doesn't delete the dial error after a peer is removed. I waited to reach the state where there were no more peers to send a light push, and found that every single dial error was the same:

[
    "The dial request has no valid addresses",
    "The dial request has no valid addresses",
    ...
    "The dial request has no valid addresses",
    "The dial request has no valid addresses"
]

This seems odd. I understand it should be the case for some of the peers, but for 133 of them to result in this error seems excessive.

@adklempner
Copy link
Member

I modified connection manager such that it doesn't delete the dial error after a peer is removed. I waited to reach the state where there were no more peers to send a light push, and found that every single dial error was the same:

[
    "The dial request has no valid addresses",
    "The dial request has no valid addresses",
    ...
    "The dial request has no valid addresses",
    "The dial request has no valid addresses"
]

This seems odd. I understand it should be the case for some of the peers, but for 133 of them to result in this error seems excessive.

It seems that most peers are not running wss, in which case this is expected. I have a PR open for this but I want to do some additional testing before merging: #1983

Something else I discovered is that for some reason the peer store "loses" the metadata for a peer, specifically the shard info. Without this, the function called to getPeers for light push filters out all of the peers because it thinks they don't support autosharding

@adklempner
Copy link
Member

Something else I discovered is that for some reason the peer store "loses" the metadata for a peer, specifically the shard info. Without this, the function called to getPeers for light push filters out all of the peers because it thinks they don't support autosharding

I believe I have found the culprit, but like the PR above it probably needs more testing #1984

@adklempner adklempner moved this from Priority to In Progress in Waku Apr 26, 2024
@adklempner
Copy link
Member

I believe I identified the prime suspect causing this issue (and dropping connections in general):

libp2p:connection:outbound:6qf7pg1714026544431:error error handling incoming stream id r7 +0ms Too many inbound protocol streams for protocol "/ipfs/ping/1.0.0" - limit 2 ERR_TOO_MANY_INBOUND_PROTOCOL_STREAMS CodeError: Too many inbound protocol streams for protocol "/ipfs/ping/1.0.0" - limit 2
    at http://localhost:3000/dist/bundle.js:55349:41
    
libp2p:connection:outbound:6qf7pg1714026544431 closing connection to /dns4/node-01.ac-cn-hongkong-c.waku.test.statusim.net/tcp/8000/wss/p2p/16Uiu2HAkzHaTP5JsUwfR9NR8Rj9HC24puS6ocaU8wze4QrXr9iXp +588ms

It seems what happens is:

  1. peer tries to ping js-waku
  2. there are already maximum amount of inbound streams opened from the peer to the node for the ping protocol
  3. libp2p closes the entire connection?

@adklempner
Copy link
Member

I believe I identified the prime suspect causing this issue (and dropping connections in general):

libp2p:connection:outbound:6qf7pg1714026544431:error error handling incoming stream id r7 +0ms Too many inbound protocol streams for protocol "/ipfs/ping/1.0.0" - limit 2 ERR_TOO_MANY_INBOUND_PROTOCOL_STREAMS CodeError: Too many inbound protocol streams for protocol "/ipfs/ping/1.0.0" - limit 2
    at http://localhost:3000/dist/bundle.js:55349:41
    
libp2p:connection:outbound:6qf7pg1714026544431 closing connection to /dns4/node-01.ac-cn-hongkong-c.waku.test.statusim.net/tcp/8000/wss/p2p/16Uiu2HAkzHaTP5JsUwfR9NR8Rj9HC24puS6ocaU8wze4QrXr9iXp +588ms

It seems what happens is:

  1. peer tries to ping js-waku
  2. there are already maximum amount of inbound streams opened from the peer to the node for the ping protocol
  3. libp2p closes the entire connection?

A mitigation for this that seems to work is increasing the maximum number of incoming connections for the ping protocol to 10 from the default of 2. I tried this with the light-js example and it's been running for 40 minutes without dropping below 9 peers, sending a light push message every 10 seconds. I'm not sure what are the performance implications of increasing this number but it seems preferable to the node losing peer connections and not recovering them.

@adklempner
Copy link
Member

The most important fixes for this have been included in the latest release of js-waku. The remaining PR for filtering wss connections is not as relevant. @weboko can this be considered resolved?

@weboko
Copy link
Collaborator

weboko commented May 14, 2024

Agree that this task is complete. If we observe any issues - let's action in the scope of some other bug / task.
Thank you for documenting investigation well!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

No branches or pull requests

4 participants