Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Why this should be merged and how was this tested
This commit enables error driven snowflake.
I deployed on the Fuji network two nodes with the same spec (8 CPU, 32GB RAM), and collected metrics for 12 hours.
The measurement of the average time to finalize a block on the latest version of Avalanche is as follows:
In contrast, with this commit, the time to finalize a block is now cut by ~ 35%:
When 5% of the stake is unreachable, the current snowflake finalization time increases:
The error driven snowflake finalization time increases more, but it seems it is still slightly faster than the current snowflake:
The time measurements for block finalizations were collected via the metric
avalanche_snowman_blks_accepted_sum
which is the metric which measures the duration from the time the block is seen for the first time and enters consensus, to the time the block is finalized.How this works
In the classical snowflake protocol, a node issues a sequence of polls, and every poll yields a certain confidence score which is based on the number of nodes that responded and the content of the response.
If the poll contains enough responses that amplify the confidence score above a certain threshold, the poll is considered a success, and the criteria to finalize a block is collecting enough successive successful polls.
In the error driven snowflake which is described in Section 4.1 in the Frosty paper, a poll can succeed in various degrees of success: The higher the confidence score that the poll concludes, the more successful the poll is considered. In contrast to the classical snowflake protocol, the criteria for how many successful polls are required to finalize a block is now determined by the confidence score of the polls - successive polls with a higher confidence score require fewer of them to finalize, and vice versa.
Each poll consists of sending queries to a number of nodes, and collecting responses. Since nodes may be offline, slow or malicious, they might not return responses in a timely manner. In order for the polls to be efficient, there exists logic which terminates a poll early once it has reached a required level of confidence, or if enough nodes timed out such that it is evident that the required confidence level cannot be reached by waiting for further nodes.
The current snowflake code already supports the error driven variant. However, the logic that terminates the polls early currently only supports the classical snowflake with a single confidence score.
In addition, there is no way to express in the configuration the threshold for the error driven snowflake, as there is only a single confidence configuration in the configuration.
This commit introduces new configuration flags to the avalanche node which express the error driven snowflake various confidence criteria, and also changes the early termination logic to accommodate the error driven snowflake.