-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lotus sync issue: libp2p 0.31.1 to 0.33.2 regression #2764
Comments
My first guess, given (2), is libp2p/specs#573 (comment). This is unconfirmed, but high on my list.
|
My second guess is #2650. This wouldn't be the fault of libp2p, but TLS may be more impacted by the GFW? That seems unlikely... |
My third guess is something related to QUIC changes. |
Have you been able to repro 2 or 3 locally?
|
I can't repro this at the moment, unfortunately (not at home, node down). But I'll do some more digging later this week. |
Ok, I got one confirmation that disabling reuseport seems to fix the issue and one report that it makes no difference. |
Ok, that confirmation appeared to be a fluke. This doesn't appear to have been the issue |
From eyeballing the commits, I can see that the major changes apart from WebRTC are
Can we test this with an only QUIC node and an only TCP node to see if it's a problem with QUIC or TCP? |
I'll try. Unfortunately, the issue is hard to reproduce and tends to happen in production (hard to get people to run random patches). Right now we're waiting on goroutine dumps hoping to get a bit of an idea about what might be stuck (e.g., may not be libp2p). |
It might be the silently broken PX -- see libp2p/go-libp2p-pubsub#555 |
I am almost certain this is the culprit as the bootstrap really relies on it. |
AH.. that would definitely explain it. |
I thought that could be it as well, but I was thrown off by the premise that this wasn't an issue in v0.31.1. PX broke after this change: #2325 which was included in the v0.28.0 release. So v0.31.1 should have the same PX issue. |
I cant imagine what else it could be. |
Are these low peer counts low peers in your gossipsub mesh or low number of peers we are actually connected to? |
Do we know if these nodes are running both QUIC and TCP? If yes, it's unlikely that the problem is with either transport and is probably at a layer above the go-libp2p transports? |
Just chiming in here from the Lotus-side, it´s the number of peers we are connected to, after upgrading to 0.33.2 the count is around:
On the previos version (0.33.1), it was stable around the 200 range. |
I think these are the number of peers in your gossipsub topic mesh. A subset of the peers you are actually connected to. Could you find the number of peers you are connected to? And compare that between versions? |
Did the situation improve after gossip-sub v0.11 and go-libp2p v0.34? |
We'll likely need to wait for the network to upgrade (~August) to see results. |
I have a user with a large ipfs-cluster (>1000 peers) complaining of issues that are consistent with pubsub propagation failures and the issue happens in both go-libp2p-v0.33.2 + go-libp2p-pubsub v0.10.0 and go-libp2p-v0.35.1 with go-libp2p-pubsub-v0.11.0. I cannot 100% say it is the same issue as Lotus, but "low peer counts" is a symptom and it is still happening, apparently. How confident are we that it was fixed? |
We're confident that we fixed a issue, but there may be others. My initial thought was #2764 (comment), but if that cluster uses QUIC it shouldn't be affected by that. |
So it has improved since upgrading to these version, and the amount of peers are now more stably hovering around 300 peers with the same machine:
As Steven notes, the real test would be waiting for the network upgrade in August - as that is when most of these issues gets surfaced when people are upgrading and reconnecting to the network. |
Good news: it seems that the issue I described was a user configuration error in the end (very low limits in connection manager). |
Now that the Filecoin Mainnet has upgraded to NV23, and with that, a very large % of the nodes probably have updated to the go-libp2p v0.35.4 release - I´m seeing a significantly larger amount of peers that I´m connected to. It is 5x higher then the amount of peers I was connected to with the same machine in May
I think we can close this issue now, and rather re-open a more narrowed down issues if we encounter other problems |
We've seen reports of a chain-sync regression between lotus 1.25 and 1.26. Notably:
We're not entirely sure what's going on, but I'm starting an issue here so we can track things.
The text was updated successfully, but these errors were encountered: