Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

horcrux is frozen? #203

Closed
PFC-developer opened this issue Sep 25, 2023 · 6 comments
Closed

horcrux is frozen? #203

PFC-developer opened this issue Sep 25, 2023 · 6 comments

Comments

@PFC-developer
Copy link

all 3 sentries are doing

│ node 1:24PM INF service start connection=snapshot impl=localClient module=abci-client msg={}                                                          │
│ node 1:24PM INF service start connection=mempool impl=localClient module=abci-client msg={}                                                           │
│ node 1:24PM INF service start connection=consensus impl=localClient module=abci-client msg={}                                                         │
│ node 1:24PM INF service start impl=EventBus module=events msg={}                                                                                      │
│ node 1:24PM INF service start impl=PubSub module=pubsub msg={}                                                                                        │
│ node 1:24PM INF service start impl=IndexerService module=txindex msg={}                                                                               │
│ node 1:24PM INF service start impl=SignerListenerEndpoint module=privval msg={}                                                                       │
│ node 1:24PM INF SignerListener: Listening for new connection module=privval                                                                           │
│ node 1:24PM INF SignerListener: Blocking for connection module=privval                                                                                │
│ node 1:24PM INF SignerListener: Blocking for connection module=privval                                                                                │
│ node 1:24PM INF SignerListener: Listening for new connection module=privval                                                                           │
│ node 1:24PM INF SignerListener: Connected module=privval                                                                                              │
│ node 1:25PM ERR SignerListener: Ping timeout module=privval                                                                                           │
│ node 1:25PM INF SignerListener: Listening for new connection module=privval                                                                           │
│ node 1:25PM INF SignerListener: Connected module=privval                                                                                              │
│ Stream closed EOF for obi/whitewhale-0 (clean-init)                                                                                                   │
│ Stream closed EOF for obi/whitewhale-0 (config-merge)                                                                                                 │
│ Stream closed EOF for obi/whitewhale-0 (snapshot-restore)                                                                                             │
│ Stream closed EOF for obi/whitewhale-0 (genesis-init)

all 3 horcrux clients are hanging
horcrux-0

 horcrux-container migaloo-1_shard.json                                                                                                                │
│ horcrux-container state                                                                                                                               │
│ horcrux-container signMode: threshold                                                                                                                 │
│ horcrux-container thresholdMode:                                                                                                                      │
│ horcrux-container   threshold: 2                                                                                                                      │
│ horcrux-container   cosigners:                                                                                                                        │
│ horcrux-container   - shardID: 1                                                                                                                      │
│ horcrux-container     p2pAddr: tcp://whitewhale-horcrux-0.whitewhale-horcrux:2222                                                                     │
│ horcrux-container   - shardID: 2                                                                                                                      │
│ horcrux-container     p2pAddr: tcp://whitewhale-horcrux-1.whitewhale-horcrux:2222                                                                     │
│ horcrux-container   - shardID: 3                                                                                                                      │
│ horcrux-container     p2pAddr: tcp://whitewhale-horcrux-2.whitewhale-horcrux:2222                                                                     │
│ horcrux-container   grpcTimeout: 1000ms                                                                                                               │
│ horcrux-container   raftTimeout: 1000ms                                                                                                               │
│ horcrux-container chainNodes:                                                                                                                         │
│ horcrux-container - privValAddr: tcp://whitewhale-0.whitewhale:1234                                                                                   │
│ horcrux-container debugAddr: 0.0.0.0:6001                                                                                                             │
│ horcrux-container I[2023-09-25|13:25:23.189] Horcrux Validator                            module=validator mode=threshold priv-state-dir=/root/.horcr │
│ horcrux-container I[2023-09-25|13:25:23.189] service start                                module=validator msg="Starting CosignerRaftStore service" i │
│ horcrux-container I[2023-09-25|13:25:23.189] Local Raft Listening                         module=validator port=2222                                  │
│ horcrux-container I[2023-09-25|13:25:23.189] Debug Server Listening                       module=debugserver address=0.0.0.0:6001                     │
│ horcrux-container I[2023-09-25|13:25:23.189] Prometheus Metrics Listening                 module=metrics address=0.0.0.0:6001 path=/metrics           │
│ horcrux-container I[2023-09-25|13:25:23.189] service start                                module=validator msg="Starting RemoteSigner service" impl=R │
│ horcrux-container I[2023-09-25|13:25:23.221] Connected to Sentry                          module=validator address=tcp://whitewhale-0.whitewhale:1234 │
│ Stream closed EOF for obi/whitewhale-horcrux-0 (init)                                                                                                 │
│

horcrux-1 is doing

│ horcrux-container I[2023-09-25|13:24:12.723] Connected to Sentry                          module=validator address=tcp://whitewhale-1.whitewhale:1234 │
│ horcrux-container 2023-09-25T13:25:42.013Z [ERROR] raft: failed to make requestVote RPC: target="{Voter 1 whitewhale-horcrux-0.whitewhale-horcrux:222 │
│ Stream closed EOF for obi/whitewhale-horcrux-1 (init)

horcrux-2

│ horcrux-container 2023-09-25T13:25:50.976Z [ERROR] raft: failed to heartbeat to: peer=whitewhale-horcrux-0.whitewhale-horcrux:2222 backoff time=500ms │
│ horcrux-container 2023-09-25T13:25:51.640Z [ERROR] raft: failed to heartbeat to: peer=whitewhale-horcrux-0.whitewhale-horcrux:2222 backoff time=500ms │
│ horcrux-container 2023-09-25T13:25:52.262Z [ERROR] raft: failed to heartbeat to: peer=whitewhale-horcrux-0.whitewhale-horcrux:2222 backoff time=500ms │
│ horcrux-container 2023-09-25T13:25:52.871Z [ERROR] raft: failed to heartbeat to: peer=whitewhale-horcrux-0.whitewhale-horcrux:2222 backoff time=500ms │
│ horcrux-container 2023-09-25T13:25:54.751Z [ERROR] raft: peer has newer term, stopping replication: peer="{Voter 1 whitewhale-horcrux-0.whitewhale-ho │
│ Stream closed EOF for obi/whitewhale-horcrux-2 (init)                                                                                                 │
@PFC-developer
Copy link
Author

PFC-developer commented Sep 25, 2023

from what I can surmise this happens when all 3 nodes and sentries are brought up at the same time (ie switching from FullNodes to Sentries).

on a identical install (same chain, hosted validator) one fullnode was switched to a sentry, and horcrux "caught it" and started signing blocks.
when I switched the 2nd it also started working.

it has a single node 'stuck' on this chain though. I'll leave it like this for a bit if you need to debug (it's signing blocks via the other 2) (the thing is active is prometheus)

@PFC-developer
Copy link
Author

tried again (switching from FullNode -> sentry) and it appeared to work.
the only real difference this time is that I deleted the PVC on one of the nodes (causing it to not come up immediately), and other nodes then "just worked" in that case.

@PFC-developer
Copy link
Author

spoke too soon.. one sentry is still stuck.

  • tried wiping the disk and restoring from snapshot.
  • tried wiping the disk of the horcrux instance (just that one, not ALL of them)

@PFC-developer
Copy link
Author

so left it a few days, tried restarting the node with no joy.
I ended up re-generating the signer & shard keys and after the 2nd try it picked up all 3 of them.
(I no longer have a node 'stuck' to debug)

@akc2267
Copy link
Contributor

akc2267 commented Oct 10, 2023

closing as duplicate of #201

@akc2267 akc2267 closed this as completed Oct 10, 2023
@PFC-developer
Copy link
Author

appraently #201 was a different issue (readiness checks causing it not to connect.. not sure I can re-open this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants