-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
modify is_stable to be indexer progressing and height caught up #887
base: develop
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good for the most part, but we really should not be changing the field names for messages
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Phuong's comments are spot on. Overall, it looks good. However, we should be mindful that a single node could potentially disrupt the system by reporting a block height that is higher than the actual value.
63ae026
to
d88e08a
Compare
@ChaoticTempest @volovyks I changed the logic:
|
I do want to add a timeout to all the near rpc calls we have in our code, is there an easy way to do it for near_fetch::Client? @ChaoticTempest |
@@ -53,6 +54,14 @@ pub struct Options { | |||
/// The threshold in seconds to check if the indexer needs to be restarted due to it stalling. | |||
#[clap(long, env("MPC_INDEXER_RUNNING_THRESHOLD"), default_value = "300")] | |||
pub running_threshold: u64, | |||
|
|||
/// The threshold in block height lag to check if the indexer has caught up. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's discuss a strategy for configuration at our next meeting. Do we want to put everything in the contract?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, we should move all these indexer configurations into the contract to make it easier to configure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I was thinking the same when adding timeout options today.
You have to use either |
chain-signatures/node/src/indexer.rs
Outdated
#[clap( | ||
long, | ||
env("MPC_INDEXER_BLOCK_HEIGHT_LAG_THRESHOLD"), | ||
default_value = "50" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, not sure if 50 is the best. Might be too little to say it's behind. Have you tested longer lag threshold like 100 or 500?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about a lower number.
50 blocks is ~50 seconds. Are we often hitting that threshold?
If the node is 50 seconds behind, it fails all the assigned requests. Yes, it can still participate in other node protocols, but I would not consider it as "stable".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually only looked at our dev where when it's generating signatures alright, the heights are within 50 from one another.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about it again. I think we should use a bigger value than 50. This lag is comparing with the latest block fetched from near rpc, and if lake has any delays, all nodes will be identified as unstable. 50s is probably too strict.
I'm thinking 200, as it is a bit smaller than the longest a signature request can wait in yield/resume, which means if nodes are <200 blocks behind, they'll still be able to answer the request in time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup, lake can be delayed a fair bit, so yeah let's do something like 200. Lines up very well with yield/resume
f42701d
to
9ccf67b
Compare
When doing this PR, I actually realized we only need to proposer to be stable. This is because 1) proposer is the one starting the signature protocol, other nodes just joining and they don't check if they have that request locally; 2) proposer is deterministic for each sign request, so there's no risk of unstable nodes later re-starting protocol for the same signature request, because it will only be that same proposer always. |
I also realized with our current implementation, when a node catches up on heights, it is likely to start protocols to generate signature for an old sign request, because the stable participants set change, and if you look at logic here:
The subset and proposer for the signature can be different from last time and thus this node who just caught up could end up being the proposer and respond() again. Of course the range of sign requests will be limited to whatever lag threshold we allow in this PR. |
wait, we don't need to keep track of the block height from rpc vs our current block height from indexer, we can just use the block timestamp to check how far behind we are in comparison with our current time. So we don't need to add this fetching of the block from RPC which can also be delayed by a couple seconds
Even less strict -- it doesn't have to be the proposer, it can be any stable participant because it doesn't matter who responds with the signature.
Yeah, we can just reject a signature request ourselves based on our threshold timing with the block timestamp |
That's cool! How can I do that? @ChaoticTempest |
@ppca in |
indexer progressing = local indexer block height last update timestamp is within threshold
indexer caught up = my block height >= latest height from near rpc endpoint - 50
This will fix the case where some nodes are catching up but still not update to date, and they got involved in the signature generation