Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node crashed, and stop syncing with EOF #12897

Open
5 of 11 tasks
emmanuelm41 opened this issue Feb 13, 2025 · 13 comments
Open
5 of 11 tasks

Node crashed, and stop syncing with EOF #12897

emmanuelm41 opened this issue Feb 13, 2025 · 13 comments
Labels

Comments

@emmanuelm41
Copy link

Checklist

  • This is not a security-related bug/issue. If it is, please follow please follow the security policy.
  • I have searched on the issue tracker and the lotus forum, and there is no existing related issue or discussion.
  • I am running the Latest release, the most recent RC(release canadiate) for the upcoming release or the dev branch(master), or have an issue updating to any of these.
  • I did not make any code changes to lotus.

Lotus component

  • lotus daemon - chain sync
  • lotus fvm/fevm - Lotus FVM and FEVM interactions
  • lotus miner/worker - sealing
  • lotus miner - proving(WindowPoSt/WinningPoSt)
  • lotus JSON-RPC API
  • lotus message management (mpool)
  • Other

Lotus Version

Daemon: v1.31.1

Repro Steps

  1. Run the node with ChainIndexer running
  2. Wait until some random block fails, and crashes the node entirely.

Describe the Bug

Once the node starts, it crashes for some reason in some random block. So far this happened twice. The first one was fix by upgrading the node from v1.31.0 to v1.31.1. The second one is still preventing our node to start. This is a full archival node.

Logging Information

2025-02-13T16:23:32.379Z	INFO	badgerbs	[email protected]/value.go:1158	Replaying file id: 42814 at offset: 723470378
2025-02-13T16:23:32.379Z	INFO	badgerbs	[email protected]/value.go:1178	Replay took: 6.132µs
2025-02-13T16:23:32.436Z	INFO	p2pnode	lp2p/addrs.go:101	Swarm listening at: [/ip6/::1/udp/52224/quic-v1/webtransport/certhash/uEiC94LiJnAp23RuZQI5Gh46hbZeByx22U5CP6PC8k0R8Sw/certhash/uEiBJ3qy7DeJtTOr9w4L3o9tqHPoeKzmW3kbRu06FhaB2Qg /ip4/127.0.0.1/udp/41958/quic-v1 /ip4/10.42.10.28/udp/41958/quic-v1 /ip6/::1/udp/57241/quic-v1 /ip4/127.0.0.1/tcp/44895 /ip4/10.42.10.28/tcp/44895 /ip6/::1/tcp/42707 /ip4/127.0.0.1/udp/39093/quic-v1/webtransport/certhash/uEiC94LiJnAp23RuZQI5Gh46hbZeByx22U5CP6PC8k0R8Sw/certhash/uEiBJ3qy7DeJtTOr9w4L3o9tqHPoeKzmW3kbRu06FhaB2Qg /ip4/10.42.10.28/udp/39093/quic-v1/webtransport/certhash/uEiC94LiJnAp23RuZQI5Gh46hbZeByx22U5CP6PC8k0R8Sw/certhash/uEiBJ3qy7DeJtTOr9w4L3o9tqHPoeKzmW3kbRu06FhaB2Qg]
2025-02-13T16:23:32.583Z	INFO	modules	modules/services.go:152	subscribing to pubsub topic /fil/blocks/testnetnet
2025-02-13T16:23:32.583Z	INFO	messagepool	messagepool/messagepool.go:447	mpool ready
2025-02-13T16:23:32.585Z	INFO	f3/manifest-provider	manifest/dynamic_provider.go:136	starting a dynamic manifest provider	{"manifestServerID": "12D3KooWENMwUF9YxvQxar7uBWJtZkA6amvK4xWmKXfSiHUo2Qq7"}
2025-02-13T16:23:32.585Z	INFO	modules	modules/eth.go:48	Start prefilling GetTipsetByHeight cache
2025-02-13T16:23:32.593Z	INFO	f3	[email protected]/f3.go:302	resuming F3 internals
2025-02-13T16:23:32.593Z	INFO	f3	[email protected]/f3.go:190	F3 is starting	{"initialDelay": 0, "hasPendingManifest": false, "NetworkName": "filecoin/48", "BootstrapEpoch": 4534590, "Finality": 900, "InitialPowerTable": "b", "CommitteeLookback": 10}
2025-02-13T16:23:32.595Z	INFO	chainindex	index/reconcile.go:66	starting chain reconciliation from head height 4702087; reconciliation epoch is 1
2025-02-13T16:23:32.599Z	INFO	chainindex	index/reconcile.go:91	found matching tipset at height 4702086, setting reconciliation epoch to 4702087
2025-02-13T16:23:32.599Z	INFO	chainindex	index/reconcile.go:112	Marking tipsets as reverted from height 4702087
2025-02-13T16:23:32.602Z	INFO	chainindex	index/reconcile.go:127	marked 77 tipsets as reverted from height 4702087
2025-02-13T16:23:32.602Z	INFO	chainindex	index/reconcile.go:136	indexing missing tipsets backwards from head height 4702087 to reconciliation epoch 4702087
2025-02-13T16:23:32.602Z	INFO	chainindex	index/reconcile.go:199	backfilling chain index backwards starting from head height 4702087
2025-02-13T16:23:32.603Z	WARN	events	events/observer.go:61	listenHeadChanges quit
2025-02-13T16:23:32.603Z	WARN	events	events/observer.go:61	listenHeadChanges quit
2025-02-13T16:23:32.603Z	WARN	events	events/observer.go:61	listenHeadChanges quit
2025-02-13T16:23:32.603Z	WARN	events	events/observer.go:66	not restarting listenHeadChanges: context error: context canceled
2025-02-13T16:23:32.603Z	WARN	events	events/observer.go:66	not restarting listenHeadChanges: context error: context canceled
2025-02-13T16:23:32.603Z	WARN	events	events/observer.go:66	not restarting listenHeadChanges: context error: context canceled
2025-02-13T16:23:32.603Z	WARN	sub	sub/incoming.go:54	quitting HandleIncomingBlocks loop
2025-02-13T16:23:32.603Z	WARN	events	events/observer.go:61	listenHeadChanges quit
2025-02-13T16:23:32.603Z	WARN	events	events/observer.go:66	not restarting listenHeadChanges: context error: context canceled
2025-02-13T16:23:32.603Z	WARN	peermgr	peermgr/peermgr.go:154	closing peermgr done
2025-02-13T16:23:32.603Z	WARN	peermgr	peermgr/peermgr.go:172	exiting peermgr run
2025-02-13T16:23:32.604Z	INFO	pubsub	[email protected]/pubsub.go:671	pubsub processloop shutting down
2025-02-13T16:23:32.604Z	WARN	chainstore	store/store.go:670	reorgWorker quit
2025-02-13T16:23:32.604Z	WARN	modules	modules/eth.go:51	error when prefilling GetTipsetByHeight cache: %!w(*xerrors.wrapError=&{failed to fill cache 0xe2e698ec00 {[7038764 28404881 28481413]}})
2025-02-13T16:23:32.604Z	INFO	modules	modules/eth.go:53	Prefilling GetTipsetByHeight done in 19.467942ms
2025-02-13T16:23:32.874Z	INFO	badgerbs	[email protected]/db.go:1027	Storing value log head: {Fid:42814 Len:682268 Offset:723470378}
2025-02-13T16:23:32.875Z	INFO	badgerbs	[email protected]/levels.go:1000	[Compactor: 173] Running compaction: {level:0 score:1.73 dropPrefixes:[]} for level: 0
2025-02-13T16:23:32.877Z	INFO	badgerbs	[email protected]/levels.go:962	LOG Compact 0->1, del 2 tables, add 1 tables, took 1.274169ms
2025-02-13T16:23:32.877Z	INFO	badgerbs	[email protected]/levels.go:1010	[Compactor: 173] Compaction for level: 0 DONE
2025-02-13T16:23:32.877Z	INFO	badgerbs	[email protected]/db.go:550	Force compaction on level 0 done
ERROR: initializing node: starting node: error while reconciling chain index with chain state: transaction failed: failed to backfill index: failed to apply tipset at height 4702087: failed to index tipset with parent events: failed to index tipset: failed to get messages for tipset: failed to get messages for block: loading bls messages for block: failed to get message: (bafy2bzacece4ltmbyynkyqtzdamnxgv6mjudw4fhkweg6ueocxubo4ar7h3wa):29: EOF
@emmanuelm41 emmanuelm41 added the kind/bug Kind: Bug label Feb 13, 2025
@github-project-automation github-project-automation bot moved this to 📌 Triage in FilOz Feb 13, 2025
@BigLep
Copy link
Member

BigLep commented Feb 13, 2025

Thanks @emmanuelm41 - acknowledged on the bug report.

Have you been able to try the advice of turning on your node with indexing disabled?

From @rvagg:

you can turn it off, setting EnableIndexer to false, but it’ll want you to turn off some other things, EnableEthRPC needs to be set to false and EnableActorEventsAPI needs to be set to false. Then it’ll start without the indexer and just proceed with syncing and operating normally, just without those APIs enabled and without keeping track in the chain indexer.
I’m assuming here that the node isn’t syncing at the moment? that you’re stuck at some epoch because the chainindexer is panicking, so we’ll get you caught up and see how we go.
Once you’re caught up, shut it down again, restore those options to how they were and start it up again but it’ll also attempt to reconcile its missing data. By default, MaxReconcileTipsets in the config is set to 3 days worth of epochs, so if you have a gap of 3 days then it should attempt to reconcile the gap by itself. If that’s not enough, the lotus index validate-backfill exists to help you deal with gaps - you can even run that over ranges of epochs you’re unsure about, supply it the --backfill flag and it’ll attempt to fill in any holes it finds. --log-good will show you which epochs it’s happy with if you want to see that. It’s a good tool to give you confidence that it’s got the holes it needs.
My only concern here is that there’s a corrupt message block, or something else about that message block that’s unexpected, and when you re-enable it and it tries to work on that epoch again it’s going to panic.

More info on using the backfill tool at https://github.com/filecoin-project/lotus/blob/master/documentation/en/chain-indexer-overview-for-operators.md#backfill

@emmanuelm41
Copy link
Author

I have not been able to test that exactly yet, but I can say it happened once, fix by its own after the upgrade, and later happened again. That makes me think that it will happen again, even if this workaround works.

@BigLep
Copy link
Member

BigLep commented Feb 14, 2025

Ack, thanks @emmanuelm41 . And is always failing on message bafy2bzacece4ltmbyynkyqtzdamnxgv6mjudw4fhkweg6ueocxubo4ar7h3wa in both instances?

@rvagg or @aarshkshah1992 : is there a way to get more state dumped when it crashes so we can debug?

@rvagg
Copy link
Member

rvagg commented Feb 14, 2025

@emmanuelm41 it would be good to check that you can even read this particular message, maybe there's genuine corruption happening here. This is what I get for bafy2bzacece4ltmbyynkyqtzdamnxgv6mjudw4fhkweg6ueocxubo4ar7h3wa:

$ curl -s http://localhost:1234/rpc/v1 -X POST -H "Content-Type": "appl
ication/json" --data '{"method":"Filecoin.ChainGetMessage", "params":[{"/":"bafy2bzacece4ltmbyynkyqtzdamnxgv6mjudw4fhkweg6ueocxubo4ar7h3wa"}], "id":1,"jsonrpc":"2.0"}' 
{"id":1,"jsonrpc":"2.0","result":{"Version":0,"To":"f03275720","From":"f3w73knlgas5qfykkq3fbxykqzttekb5gt6nn2gqnymgjex4qe3aihsooqecbxcfd7yuunqnka7hddbqd6ghra","Nonce":137038,"Value":"0","GasLimit":61031853,"GasFeeCap":"1198140","GasPremium":"1197086","Method":28,"Params":"gYGHDRoAAZ7I2CpYKQABguIDgegCIFStAy/Db1FxTawzKnql3hyVfN70ZuGQX4H0uTIV/WI9GgBHuZCAGgBRA2rYKlgoAAGB4gOSICAGmgNddduJTfMw7z4VcibMMkgIPIcEvenOLignceQZAg==","CID":{"/":"bafy2bzacece4ltmbyynkyqtzdamnxgv6mjudw4fhkweg6ueocxubo4ar7h3wa"}}}

@emmanuelm41
Copy link
Author

emmanuelm41 commented Feb 14, 2025

@rvagg I could try, but the thing here is that the node is down, completely down. So it never gets to start.

@BigLep no, not really. It failed on a different block before. This one is a different one.

@BigLep
Copy link
Member

BigLep commented Feb 14, 2025

@emmanuelm41 : thanks. I think @rvagg's suggestion is to start the node with Eth RPC / ChainIndexer disabled and then attempt the ChainGetMessage call.

@emmanuelm41
Copy link
Author

After a few restarts, It actually fixed itself. The answer to the curl is this one

{"id":1,"jsonrpc":"2.0","result":{"Version":0,"To":"f03275720","From":"f3w73knlgas5qfykkq3fbxykqzttekb5gt6nn2gqnymgjex4qe3aihsooqecbxcfd7yuunqnka7hddbqd6ghra","Nonce":137038,"Value":"0","GasLimit":61031853,"GasFeeCap":"1198140","GasPremium":"1197086","Method":28,"Params":"gYGHDRoAAZ7I2CpYKQABguIDgegCIFStAy/Db1FxTawzKnql3hyVfN70ZuGQX4H0uTIV/WI9GgBHuZCAGgBRA2rYKlgoAAGB4gOSICAGmgNddduJTfMw7z4VcibMMkgIPIcEvenOLignceQZAg==","CID":{"/":"bafy2bzacece4ltmbyynkyqtzdamnxgv6mjudw4fhkweg6ueocxubo4ar7h3wa"}}}

I guess the bug is not unrecoverable, as it is resolved by itself, but it is critical for us, as it could leave our nodes down for some time (uncertain) until manual intervention, or auto fix on some unknown amount of time.

@emmanuelm41
Copy link
Author

We stopped the deployment of this new version on other full nodes as we don't want to find ourselves with all nodes down later.

@emmanuelm41
Copy link
Author

Any updates on this? Something to test or try? Nodes are running now, but we stopped the deployment of the 1.31 version

@rvagg
Copy link
Member

rvagg commented Feb 20, 2025

@emmanuelm41 not yet, looking into it, but it does seem to me like you might have experienced a one-off and possibly won't encounter this again with other nodes 🤞 I know that's not a great answer. I am wondering now if it's related to this that's been swirling around, suggested to be splitstore related but may not be. Your error came from a situation just like this—BLS message not being found. #12907 (comment)

@emmanuelm41
Copy link
Author

What makes me a bit uncomfortable about this is the fact that we never saw this issue before on any previous lotus version, and we saw it three times. Two in one node, and one in another (both v1.31.1). So far it has not happened again though.

@rvagg
Copy link
Member

rvagg commented Feb 21, 2025

I think there's two things going on here - the underlying message-lookup problem which is probably not related to chainindexer and probably not even limited to 1.31, and the second problem is the lack of resilience of chainindexer in causing this panic. It's not clear to me yet how to handle that case because chainindexer really should know about messages and not finding one is a problem, but we probably shouldn't be panicking. I've asked @aarshkshah1992 to have a look at this specifically.

@aarshkshah1992
Copy link
Contributor

I can raise a PR to warn instead of panicking during backfilling as not having a message in the statestore can be a valid scenario if the state is corrupted.

@rjan90 rjan90 moved this from 📌 Triage to ⌨️ In Progress in FilOz Feb 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: ⌨️ In Progress
Development

No branches or pull requests

4 participants