Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent Signer Behavior #159

Closed
con5cience opened this issue Jun 9, 2023 · 12 comments
Closed

Inconsistent Signer Behavior #159

con5cience opened this issue Jun 9, 2023 · 12 comments
Assignees

Comments

@con5cience
Copy link

con5cience commented Jun 9, 2023

Hi there,

I'm trying to implement a Horcrux POC that would mimic my Production architecture with Quicksilver Testnet + Cosmovisor. I have three Sentry nodes up and synced and I'm on Horcrux v3. Processes are all running in plain Kubernetes, no RBAC, no network policies, TCP connectivity has all been tested with telnet and the processes can all talk to each other just fine.

Scenario 1

When I plug each individual Signer into a dedicated Sentry, nothing happens.

Signer configs, chainNodes.privValAddr changes depending on the node:

signMode: threshold
thresholdMode:
  threshold: 2
  cosigners:
    - shardID: 1
      p2pAddr: tcp://quicksilver-testnet-signer-0.quicksilver.svc.cluster.local:2222
    - shardID: 2
      p2pAddr: tcp://quicksilver-testnet-signer-1.quicksilver.svc.cluster.local:2222
    - shardID: 3
      p2pAddr: tcp://quicksilver-testnet-signer-2.quicksilver.svc.cluster.local:2222
  grpcTimeout: 1000ms
  raftTimeout: 1000ms
chainNodes:
  - privValAddr: tcp://quicksilver-testnet-sentry-[0,1,2].quicksilver.svc.cluster.local:1234
debugAddr: ''

Sentry logs:

quicksilver-testnet-sentry-0 2:26AM INF running app args=["start","--home","/.quicksilverd","--moniker","FooBar-0"] module=cosmovisor path=/.quicksilverd/cosmovisor/genesis/bin/quicksilverd
quicksilver-testnet-sentry-0 2:27AM INF starting node with ABCI Tendermint in-process
quicksilver-testnet-sentry-0 2:27AM INF service start impl=multiAppConn module=proxy msg={}
quicksilver-testnet-sentry-0 2:27AM INF service start connection=query impl=localClient module=abci-client msg={}
quicksilver-testnet-sentry-0 2:27AM INF service start connection=snapshot impl=localClient module=abci-client msg={}
quicksilver-testnet-sentry-0 2:27AM INF service start connection=mempool impl=localClient module=abci-client msg={}
quicksilver-testnet-sentry-0 2:27AM INF service start connection=consensus impl=localClient module=abci-client msg={}
quicksilver-testnet-sentry-0 2:27AM INF service start impl=EventBus module=events msg={}
quicksilver-testnet-sentry-0 2:27AM INF service start impl=PubSub module=pubsub msg={}
quicksilver-testnet-sentry-0 2:27AM INF service start impl=IndexerService module=txindex msg={}
quicksilver-testnet-sentry-0 2:27AM INF service start impl=SignerListenerEndpoint module=privval msg={}
quicksilver-testnet-sentry-0 2:27AM INF SignerListener: Listening for new connection module=privval
quicksilver-testnet-sentry-0 2:27AM INF SignerListener: Blocking for connection module=privval
quicksilver-testnet-sentry-0 2:27AM INF SignerListener: Connected module=privval

It appears to be acting like none of the signing requests are going through to the signers. The behavior appears to be expected when signing isn't happening at all based on what I've managed to sift out of Tendermint issues like this one but ultimately I'm not sure.

Signer logs all look the same, the target address of course changes depending on which Signer is pointed at which Sentry:

horcrux-quicksilver-testnet-signer-2 I[2023-06-09|15:12:01.188] Horcrux Validator                            module=validator mode=threshold priv-state-dir=/.horcrux/state                                                                          
horcrux-quicksilver-testnet-signer-2 I[2023-06-09|15:12:01.196] service start                                module=validator msg="Starting CosignerRaftStore service" impl=CosignerRaftStore                                                        
horcrux-quicksilver-testnet-signer-2 I[2023-06-09|15:12:01.196] Local Raft Listening                         module=validator port=2222                                                                                                              
horcrux-quicksilver-testnet-signer-2 I[2023-06-09|15:12:01.199] debug-addr not defined; debug server disabled module=debugserver                                                                                                                     
horcrux-quicksilver-testnet-signer-2 I[2023-06-09|15:12:01.200] service start                                module=validator msg="Starting RemoteSigner service" impl=RemoteSigner                                                                  
horcrux-quicksilver-testnet-signer-2 I[2023-06-09|15:12:01.219] Connected to Sentry                          module=validator address=tcp://quicksilver-testnet-sentry-2.quicksilver.svc.cluster.local:1234

Scenario 2

If I hand off this config to all the Signers, with all of the Sentries included in the chainNodes list:

signMode: threshold
thresholdMode:
  threshold: 2
  cosigners:
    - shardID: 1
      p2pAddr: tcp://quicksilver-testnet-signer-0.quicksilver.svc.cluster.local:2222
    - shardID: 2
      p2pAddr: tcp://quicksilver-testnet-signer-1.quicksilver.svc.cluster.local:2222
    - shardID: 3
      p2pAddr: tcp://quicksilver-testnet-signer-2.quicksilver.svc.cluster.local:2222
  grpcTimeout: 1000ms
  raftTimeout: 1000ms
chainNodes:
  - privValAddr: tcp://quicksilver-testnet-sentry-0.quicksilver.svc.cluster.local:1234
  - privValAddr: tcp://quicksilver-testnet-sentry-1.quicksilver.svc.cluster.local:1234
  - privValAddr: tcp://quicksilver-testnet-sentry-2.quicksilver.svc.cluster.local:1234
debugAddr: ''
  1. The Signer cluster connects to one Sentry. The Sentry pulls blocks, the Signer cluster signs blocks. Everything appears to work, except the other Sentries sit there waiting for remote validators to connect and they don't sync at all, which is not ideal.
  2. The Signer cluster always picks the last Sentry in the list, even when I kill all the Signer nodes at the same time, multiple times in a row. I would expect to see the cluster pick a different Sentry on restart, and/or I would expect the individual Signers to each pick a different Sentry (potentially unfounded expectation informed by years of working with cluster-y softwares).
  3. Killing the Sentry causes the Signer cluster to stop signing until the Sentry comes back. The Signer cluster does not try to connect to a different Sentry when the connection is lost.

What am I doing wrong? Any help would be super appreciated. If I've missed a glaringly obvious thing in the docs about how this is all supposed to work, feel free to just drop the link on my face. Ideally I'd like to have a full mesh or 1:1 arrangement between Sentries and Signer processes for maximum fault tolerance.

Thanks in advance.

@agouin
Copy link
Member

agouin commented Jun 13, 2023

Are you seeing the This node is a validator log when your sentry nodes start up? Are the sentry nodes in sync with the chain? It looks like horcrux is properly configured, but it is not receiving any sign requests from the sentries.

@agouin
Copy link
Member

agouin commented Jun 13, 2023

The Signer cluster always picks the last Sentry in the list, even when I kill all the Signer nodes at the same time, multiple times in a row.

The signers do not pick a sentry. The cosigners connect to all of the configured sentries. The sentries pick a remote signer. This is standard tendermint behavior, it will only send sign requests to one of the connected remote signer.

@con5cience
Copy link
Author

con5cience commented Jun 13, 2023

Thanks so much for your response!

Are you seeing the This node is a validator log when your sentry nodes start up? Are the sentry nodes in sync with the chain? It looks like horcrux is properly configured, but it is not receiving any sign requests from the sentries.

I am not finding the phrase This node is a validator in the logs on any of the sentries.

The sentries are fully synced and reporting catching_up: false via the /status RPC endpoint before I try to plug in the signers, and their block heights match what testnet explorers are reporting as current.

Your assertion about not receiving signing requests matches expectations I've had regarding these symptoms based on other Github issues I've unearthed in the Tendermint SDK Github repo, unfortunately the trail of breadcrumbs goes a bit dry from there and I'm not sure how to triage further. Would definitely appreciate any counsel you might have to offer.

The signers do not pick a sentry. The cosigners connect to all of the configured sentries. The sentries pick a remote signer. This is standard tendermint behavior, it will only send sign requests to one of the connected remote signer.

Ok, so all of my sentries are log_level=debug at the moment, and right now I have 2/3 sentries that most recently logged SignerListener: Connected module=privval from when I first plugged in the signer cluster a few days ago and that's it. The RPC port on these sentries is not open/listening so I can't query the node status in any fashion -- I'm guessing there's a handshake that's failed here that gatekeeps whether the RPC port is opened. Meanwhile, the last Sentry is logging aggressively and clearly syncing and blocks are being signed by the signer cluster and the RPC port is open. Not sure how to interpret this behavior, and also not sure how to triage from here.

When I restart one of the of the sentries that isn't spewing logs, the Horcrux cluster does gripe until the sentry comes back into service. Horcrux says it's connected, but the sentry continues to loop a connection timeout:

quicksilver-testnet-sentry-0 2:36PM INF SignerListener: Listening for new connection module=privval                                                                                                                                                  
quicksilver-testnet-sentry-0 2:36PM INF SignerListener: Blocking for connection module=privval                                                                                                                                                       
quicksilver-testnet-sentry-0 2:36PM ERR SignerListener: Ping timeout module=privval

This does not appear to disrupt signing activity. The sentry does not log SignerListener: Connected until I restart the Horcrux cluster. It looks like it does eventually reconnect but it probably took a good 20+ minutes.

@jackzampolin
Copy link
Member

Is your key registered as a validator on the chain?

@con5cience
Copy link
Author

con5cience commented Jun 13, 2023

Is your key registered as a validator on the chain?

Yes. I originally spun up a completely separate testnet sentry node and used its keys to register as a validator, then migrated the keys to Horcrux (not one of the testnet sentries currently in play, in case I did a poor job of communicating that). I still have the old infrastructure laying around, the node is just stopped there from when I migrated it. On the testnet explorer I can see that the validator is registered, active and signing.

https://testnet.quicksilver.explorers.guru/validator/quickvaloper16r00sgkv6uqaf358nvawexul7cuwmx52ca86rd

@agouin
Copy link
Member

agouin commented Jun 13, 2023

Can you confirm that your key shards have the correct id for all cosigners? Check both rsa_keys.json and regen-redwood-1_chain.json on each.

@con5cience
Copy link
Author

con5cience commented Jun 13, 2023

I have isolated Secret objects populated for each of the signers, and the id values on both rsa_keys.json and rhye-1_shard.json are consistent across all nodes. Signer 0 has id: 1, signer 1 has id: 2, signer 2 has id: 3.

@agouin
Copy link
Member

agouin commented Jun 14, 2023

What is the RTT (ping) between the cosigners to each other, and also the cosigners to the sentries?

@con5cience
Copy link
Author

~50ms when traversing Availability Zones, ~5ms when inside the same AZ.

@agouin
Copy link
Member

agouin commented Jun 17, 2023

Did you change the cosigner configuration in config.yaml on any of the cosigners? Deleting the raft dir on all cosigners is necessary when doing this.

@con5cience
Copy link
Author

con5cience commented Jun 17, 2023

They were all initialized with the proper config.yaml at roughly the same time. In scenario 2 they all get the same config, in scenario 1 the config differs per signer, but they all get fresh disks and the processes are started within a few seconds of each other.

The keys are mounted on the filesystem, the volumes themselves are reprovisioned in between attempts/reconfigurations/etc, so there's no disk-based file cache in play as far as I'm aware.

NB: The cosigner configuration has never changed in any incarnation, the chainNodes are the only thing that change.

@agouin
Copy link
Member

agouin commented Aug 7, 2023

Closing this as stale. Please open a new issue if you notice this behavior again on v3.1.0+.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants