Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Consensus failure lottery #23683

Open
1 task done
fmorency opened this issue Feb 12, 2025 · 4 comments
Open
1 task done

[Bug]: Consensus failure lottery #23683

fmorency opened this issue Feb 12, 2025 · 4 comments
Labels

Comments

@fmorency
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

We added CosmWasm support to our chain and tried to upgrade our devnet. However, we stumbled on

panic: error loading last version: failed to load latest version: version of store wasm mismatch root store's version; expected 571999 got 0; new stores should be added using StoreUpgrades

We wrote interchaintest/chain_upgrade_test.go in liftedinit/manifest-ledger#118 and sometime it fails with the error above, sometime everything is fine and the upgrade/test is successful.

I'm not sure what's going on.

The upgrade handler can be found in app/upgrades/next/upgrades.go

Cosmos SDK Version

0.50.11

How to reproduce?

Checkout the branch related to liftedinit/manifest-ledger#118

Run make local-image and make ictest-chain-upgrade.

The test might pass. Run it again until it fails.

@fmorency
Copy link
Contributor Author

@Reecepbcups wrote a workaround in Reecepbcups@72e651e

It seems it was resolved in v50 but it recently came back up on many networks around the same time.

@Reecepbcups
Copy link
Member

Reecepbcups commented Feb 12, 2025

also ref from the hub: cosmos/gaia#2313 (uses my above patch as a work around)

@fmorency fmorency changed the title [Bug]: Failed upgrade [Bug]: Consensus failure lottery Feb 12, 2025
@fmorency
Copy link
Contributor Author

Update

Upon reaching the upgrade height, nodes automatically shut down (and restart) to apply the new upgrade. In smaller blockchain networks, this simultaneous shutdown of multiple nodes can lead to a loss of consensus if a sufficient number of nodes become temporarily inactive. The Cosmos SDK does not wait for a majority of nodes to reach the upgrade height before initiating the shutdown process. Consequently, if consensus is lost during this period, all nodes may shut down, resulting in the observed error.

To maintain consensus during upgrades, we should monitor which nodes have been updated and which have not.

After bumping the number of validators from 2 to 16 in our chain-upgrade test, the error did not reoccur. However, without implementing a proper fix, the issue may still occur due to (lack of) chance.

@fmorency
Copy link
Contributor Author

fmorency commented Feb 13, 2025

Update

The chain-upgrade test consistently passes with only 4 nodes. However, the same update fails on our devnet. The main difference between the test and our real devnet is Cosmovisor.

I'm starting to suspect Cosmovisor.

Here's the interesting bit

No Cosmovisor: The application exits itself

7:02PM INF caught signal module=server signal=terminated

Cosmovisor: The application receives an interrupt from Cosmovisor

cosmovisor[3613069]: 2:52PM INF daemon shutting down in an attempt to restart module=cosmovisor
cosmovisor[3613069]: 2:52PM INF sent interrupt to app, waiting for exit module=cosmovisor
cosmovisor[3613092]: 2:52PM INF caught signal module=server signal=interrupt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: 📋 Backlog
Development

No branches or pull requests

2 participants