modify is_stable to be indexer progressing and height caught up #887

ppca · 2024-10-10T23:25:55Z

indexer progressing = local indexer block height last update timestamp is within threshold
indexer caught up = my block height >= latest height from near rpc endpoint - 50

This will fix the case where some nodes are catching up but still not update to date, and they got involved in the signature generation

ChaoticTempest

Looks good for the most part, but we really should not be changing the field names for messages

chain-signatures/node/src/web/mod.rs

chain-signatures/node/src/mesh/mod.rs

chain-signatures/node/src/mesh/connection.rs

volovyks

Phuong's comments are spot on. Overall, it looks good. However, we should be mindful that a single node could potentially disrupt the system by reporting a block height that is higher than the actual value.

ppca · 2024-10-11T20:24:40Z

@ChaoticTempest @volovyks I changed the logic:

get max block height from near rpc instead of from participants, this means as long as the node is honest, it will be compared against the true latest block height;
block_height_lag_threshold will be one field in indexer and is configurable thru environment variable;
instead of changing the definition of is_participant_stable(), I added a function is_stable() in indexer, that checks a) block heights has been updated lately, b) I am not more than block_height_lag_threshold behind the max block height form near rpc;
given 3), the is_stable field in StateView is directly updated to be is_stable = self.indexer.is_stable().await, which will encapsulate both progressing and height being caught up

ppca · 2024-10-11T21:33:48Z

I do want to add a timeout to all the near rpc calls we have in our code, is there an easy way to do it for near_fetch::Client? @ChaoticTempest

chain-signatures/node/src/indexer.rs

volovyks · 2024-10-14T09:06:26Z

chain-signatures/node/src/indexer.rs

@@ -53,6 +54,14 @@ pub struct Options {
    /// The threshold in seconds to check if the indexer needs to be restarted due to it stalling.
    #[clap(long, env("MPC_INDEXER_RUNNING_THRESHOLD"), default_value = "300")]
    pub running_threshold: u64,
+
+    /// The threshold in block height lag to check if the indexer has caught up.


Let's discuss a strategy for configuration at our next meeting. Do we want to put everything in the contract?

yeah, we should move all these indexer configurations into the contract to make it easier to configure

Yeah I was thinking the same when adding timeout options today.

ChaoticTempest · 2024-10-15T06:19:34Z

I do want to add a timeout to all the near rpc calls we have in our code, is there an easy way to do it for near_fetch::Client? @ChaoticTempest

You have to use either transact_async and do the check manually, or keep transact and use tokio::time::timeout

ChaoticTempest · 2024-10-15T06:29:05Z

chain-signatures/node/src/indexer.rs

+    #[clap(
+        long,
+        env("MPC_INDEXER_BLOCK_HEIGHT_LAG_THRESHOLD"),
+        default_value = "50"


hmm, not sure if 50 is the best. Might be too little to say it's behind. Have you tested longer lag threshold like 100 or 500?

I was thinking about a lower number.
50 blocks is ~50 seconds. Are we often hitting that threshold?

If the node is 50 seconds behind, it fails all the assigned requests. Yes, it can still participate in other node protocols, but I would not consider it as "stable".

I actually only looked at our dev where when it's generating signatures alright, the heights are within 50 from one another.

I thought about it again. I think we should use a bigger value than 50. This lag is comparing with the latest block fetched from near rpc, and if lake has any delays, all nodes will be identified as unstable. 50s is probably too strict.
I'm thinking 200, as it is a bit smaller than the longest a signature request can wait in yield/resume, which means if nodes are <200 blocks behind, they'll still be able to answer the request in time.

yup, lake can be delayed a fair bit, so yeah let's do something like 200. Lines up very well with yield/resume

ppca · 2024-10-16T22:45:38Z

When doing this PR, I actually realized we only need to proposer to be stable. This is because 1) proposer is the one starting the signature protocol, other nodes just joining and they don't check if they have that request locally; 2) proposer is deterministic for each sign request, so there's no risk of unstable nodes later re-starting protocol for the same signature request, because it will only be that same proposer always.
So by limiting lag from rpc latest block to 200, we make sure proposers will see the sign request within 3.5 mins, and have another 1 min to respond().
If that lag is too big, that node will not make it in time anyways.
[100-200] might all be good.
@ChaoticTempest @volovyks

ppca · 2024-10-17T00:03:51Z

I also realized with our current implementation, when a node catches up on heights, it is likely to start protocols to generate signature for an old sign request, because the stable participants set change, and if you look at logic here:

mpc/chain-signatures/node/src/protocol/signature.rs

Line 118 in 6557f01

let subset = stable.keys().choose_multiple(&mut rng, threshold);

The subset and proposer for the signature can be different from last time and thus this node who just caught up could end up being the proposer and respond() again.
Of course the range of sign requests will be limited to whatever lag threshold we allow in this PR.

ChaoticTempest · 2024-10-17T06:04:59Z

wait, we don't need to keep track of the block height from rpc vs our current block height from indexer, we can just use the block timestamp to check how far behind we are in comparison with our current time. So we don't need to add this fetching of the block from RPC which can also be delayed by a couple seconds

When doing this PR, I actually realized we only need to proposer to be stable. This is because 1) proposer is the one starting the signature protocol, other nodes just joining and they don't check if they have that request locally; 2) proposer is deterministic for each sign request, so there's no risk of unstable nodes later re-starting protocol for the same signature request, because it will only be that same proposer always. So by limiting lag from rpc latest block to 200, we make sure proposers will see the sign request within 3.5 mins, and have another 1 min to respond(). If that lag is too big, that node will not make it in time anyways. [100-200] might all be good.

Even less strict -- it doesn't have to be the proposer, it can be any stable participant because it doesn't matter who responds with the signature.

I also realized with our current implementation, when a node catches up on heights, it is likely to start protocols to generate signature for an old sign request, because the stable participants set change, and if you look at logic here:

mpc/chain-signatures/node/src/protocol/signature.rs

Line 118 in 6557f01

let subset = stable.keys().choose_multiple(&mut rng, threshold);

The subset and proposer for the signature can be different from last time and thus this node who just caught up could end up being the proposer and respond() again.
Of course the range of sign requests will be limited to whatever lag threshold we allow in this PR.

Yeah, we can just reject a signature request ourselves based on our threshold timing with the block timestamp

ppca · 2024-10-17T20:28:45Z

we can just use the block timestamp to check how far behind we are in comparison with our current time

That's cool! How can I do that? @ChaoticTempest

ChaoticTempest · 2024-10-18T06:22:55Z

@ppca in handle_block, it should be block.header().timestamp_nanosec()

ppca requested review from ChaoticTempest, volovyks and ailisp October 10, 2024 23:25

ChaoticTempest reviewed Oct 11, 2024

View reviewed changes

volovyks reviewed Oct 11, 2024

View reviewed changes

ppca force-pushed the xiangyi/modify_is_stable branch from 63ae026 to d88e08a Compare October 11, 2024 20:16

volovyks reviewed Oct 14, 2024

View reviewed changes

ChaoticTempest reviewed Oct 15, 2024

View reviewed changes

ppca added 5 commits October 16, 2024 15:02

modify is_stable to be indexer progressing and height caught up

3a0a241

pull max height from rpc endpoint

919e32c

fix integration tests

af9744f

address comments

3074ff9

use 200 as default to allow 3 min delay from latest rpc fetch

9ccf67b

ppca force-pushed the xiangyi/modify_is_stable branch from f42701d to 9ccf67b Compare October 16, 2024 22:16

Merge branch 'develop' into xiangyi/modify_is_stable

ab9ccdd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

modify is_stable to be indexer progressing and height caught up #887

modify is_stable to be indexer progressing and height caught up #887

ppca commented Oct 10, 2024 •

edited

Loading

ChaoticTempest left a comment

volovyks left a comment

ppca commented Oct 11, 2024

ppca commented Oct 11, 2024

volovyks Oct 14, 2024

ChaoticTempest Oct 15, 2024

ppca Oct 15, 2024

ChaoticTempest commented Oct 15, 2024

ChaoticTempest Oct 15, 2024

volovyks Oct 15, 2024

ppca Oct 15, 2024 •

edited

Loading

ppca Oct 16, 2024

ChaoticTempest Oct 17, 2024

ppca commented Oct 16, 2024

ppca commented Oct 17, 2024

ChaoticTempest commented Oct 17, 2024

ppca commented Oct 17, 2024

ChaoticTempest commented Oct 18, 2024

modify is_stable to be indexer progressing and height caught up #887

Are you sure you want to change the base?

modify is_stable to be indexer progressing and height caught up #887

Conversation

ppca commented Oct 10, 2024 • edited Loading

ChaoticTempest left a comment

Choose a reason for hiding this comment

volovyks left a comment

Choose a reason for hiding this comment

ppca commented Oct 11, 2024

ppca commented Oct 11, 2024

volovyks Oct 14, 2024

Choose a reason for hiding this comment

ChaoticTempest Oct 15, 2024

Choose a reason for hiding this comment

ppca Oct 15, 2024

Choose a reason for hiding this comment

ChaoticTempest commented Oct 15, 2024

ChaoticTempest Oct 15, 2024

Choose a reason for hiding this comment

volovyks Oct 15, 2024

Choose a reason for hiding this comment

ppca Oct 15, 2024 • edited Loading

Choose a reason for hiding this comment

ppca Oct 16, 2024

Choose a reason for hiding this comment

ChaoticTempest Oct 17, 2024

Choose a reason for hiding this comment

ppca commented Oct 16, 2024

ppca commented Oct 17, 2024

ChaoticTempest commented Oct 17, 2024

ppca commented Oct 17, 2024

ChaoticTempest commented Oct 18, 2024

ppca commented Oct 10, 2024 •

edited

Loading

ppca Oct 15, 2024 •

edited

Loading