-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Node crashed, and stop syncing with EOF #12897
Comments
Thanks @emmanuelm41 - acknowledged on the bug report. Have you been able to try the advice of turning on your node with indexing disabled? From @rvagg:
More info on using the backfill tool at https://github.com/filecoin-project/lotus/blob/master/documentation/en/chain-indexer-overview-for-operators.md#backfill |
I have not been able to test that exactly yet, but I can say it happened once, fix by its own after the upgrade, and later happened again. That makes me think that it will happen again, even if this workaround works. |
Ack, thanks @emmanuelm41 . And is always failing on message @rvagg or @aarshkshah1992 : is there a way to get more state dumped when it crashes so we can debug? |
@emmanuelm41 it would be good to check that you can even read this particular message, maybe there's genuine corruption happening here. This is what I get for
|
@emmanuelm41 : thanks. I think @rvagg's suggestion is to start the node with Eth RPC / ChainIndexer disabled and then attempt the ChainGetMessage call. |
After a few restarts, It actually fixed itself. The answer to the curl is this one
I guess the bug is not unrecoverable, as it is resolved by itself, but it is critical for us, as it could leave our nodes down for some time (uncertain) until manual intervention, or auto fix on some unknown amount of time. |
We stopped the deployment of this new version on other full nodes as we don't want to find ourselves with all nodes down later. |
Any updates on this? Something to test or try? Nodes are running now, but we stopped the deployment of the 1.31 version |
@emmanuelm41 not yet, looking into it, but it does seem to me like you might have experienced a one-off and possibly won't encounter this again with other nodes 🤞 I know that's not a great answer. I am wondering now if it's related to this that's been swirling around, suggested to be splitstore related but may not be. Your error came from a situation just like this—BLS message not being found. #12907 (comment) |
What makes me a bit uncomfortable about this is the fact that we never saw this issue before on any previous lotus version, and we saw it three times. Two in one node, and one in another (both v1.31.1). So far it has not happened again though. |
I think there's two things going on here - the underlying message-lookup problem which is probably not related to chainindexer and probably not even limited to 1.31, and the second problem is the lack of resilience of chainindexer in causing this panic. It's not clear to me yet how to handle that case because chainindexer really should know about messages and not finding one is a problem, but we probably shouldn't be panicking. I've asked @aarshkshah1992 to have a look at this specifically. |
I can raise a PR to warn instead of panicking during backfilling as not having a message in the statestore can be a valid scenario if the state is corrupted. |
Checklist
Latest release
, the most recent RC(release canadiate) for the upcoming release or the dev branch(master), or have an issue updating to any of these.Lotus component
Lotus Version
Repro Steps
Describe the Bug
Once the node starts, it crashes for some reason in some random block. So far this happened twice. The first one was fix by upgrading the node from v1.31.0 to v1.31.1. The second one is still preventing our node to start. This is a full archival node.
Logging Information
The text was updated successfully, but these errors were encountered: