Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Signers not Signing #201

Closed
con5cience opened this issue Sep 20, 2023 · 4 comments
Closed

Signers not Signing #201

con5cience opened this issue Sep 20, 2023 · 4 comments

Comments

@con5cience
Copy link

Previous issue: #159

Now I'm on 3.1.0.

I discovered the issue with the processes only seeming to connect to one Sentry was due to a configuration error. I fixed that.

With the latest version of Horcrux, and the configuration set properly, Raft itself appears to be working fine:

# horcrux --home /.horcrux state show rhye-1
Private Validator State:
  Height:    2281346
  Round:     0
  Step:      2
Share Sign State:
  Height:    2281346
  Round:     0
  Step:      2
  
# horcrux --home /.horcrux leader
Request address: horcrux-quicksilver-testnet-signer-0.quicksilver.svc.cluster.local:2222
Current leader: horcrux-quicksilver-testnet-signer-1.quicksilver.svc.cluster.local:2222

And I can see the Raft master connecting to all of the Sentries, but the Sentries themselves still don't log anything after the connection has been established:

quicksilver-testnet-sentry-1 1:34AM INF SignerListener: Listening for new connection module=privval                                                                                                                                                  
quicksilver-testnet-sentry-1 1:34AM INF SignerListener: Connected module=privval

Blocks aren't being signed. The Sentries don't seem to be doing anything after the initial connection is made.

Back at a dead end. Any help would be appreciated.

@agouin
Copy link
Member

agouin commented Sep 20, 2023

It appears that the pubkey request is failing. Do you have the key shards loaded properly? rhye-1_shard.json on each cosigner.

@con5cience
Copy link
Author

It appears that the pubkey request is failing. Do you have the key shards loaded properly? rhye-1_shard.json on each cosigner.

Yes, each key shard is in /.horcrux/rhye-1_shard.json, each cosigner has a properly formed shard file with unique id.

@PFC-developer
Copy link

hi.. I was wondering if you had on any theories on this (and mitigations).
I still get this occasionally

@con5cience
Copy link
Author

This happened because of a chicken/egg situation with Kubernetes. I had readiness checks configured for the nodes that determined readiness based on the RPC API reporting sync state, number of peers, etc.

The nodes never became ready because they were waiting for the external signer to connect before they started syncing blocks and initializing the RPC API.

The external signer never connected because there were no pods in the service, because they weren't ready.

♾️

Short-term solution was to disable the readiness checks for now.

@agouin dropped the Cosmos Operator on me in conversation: https://github.com/strangelove-ventures/cosmos-operator

Long-term, it sounds like external signer connections are going to happen over gRPC in the Tendermint SDK at some point in the future, which should hopefully work around the issue.

Thanks to @agouin for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants