Activate Error Driven Snowflake #3265

yacovm · 2024-08-02T22:06:57Z

Why this should be merged and how was this tested

This commit enables error driven snowflake.

I deployed on the Fuji network two nodes with the same spec (8 CPU, 32GB RAM), and collected metrics for 12 hours.

The measurement of the average time to finalize a block on the latest version of Avalanche is as follows:

In contrast, with this commit, the time to finalize a block is now cut by ~ 35%:

When 5% of the stake is unreachable, the current snowflake finalization time increases:

The error driven snowflake finalization time increases more, but it seems it is still slightly faster than the current snowflake:

The time measurements for block finalizations were collected via the metric avalanche_snowman_blks_accepted_sum which is the metric which measures the duration from the time the block is seen for the first time and enters consensus, to the time the block is finalized.

How this works

In the classical snowflake protocol, a node issues a sequence of polls, and every poll yields a certain confidence score which is based on the number of nodes that responded and the content of the response.

If the poll contains enough responses that amplify the confidence score above a certain threshold, the poll is considered a success, and the criteria to finalize a block is collecting enough successive successful polls.

In the error driven snowflake which is described in Section 4.1 in the Frosty paper, a poll can succeed in various degrees of success: The higher the confidence score that the poll concludes, the more successful the poll is considered. In contrast to the classical snowflake protocol, the criteria for how many successful polls are required to finalize a block is now determined by the confidence score of the polls - successive polls with a higher confidence score require fewer of them to finalize, and vice versa.

Each poll consists of sending queries to a number of nodes, and collecting responses. Since nodes may be offline, slow or malicious, they might not return responses in a timely manner. In order for the polls to be efficient, there exists logic which terminates a poll early once it has reached a required level of confidence, or if enough nodes timed out such that it is evident that the required confidence level cannot be reached by waiting for further nodes.

The current snowflake code already supports the error driven variant. However, the logic that terminates the polls early currently only supports the classical snowflake with a single confidence score.
In addition, there is no way to express in the configuration the threshold for the error driven snowflake, as there is only a single confidence configuration in the configuration.

This commit introduces new configuration flags to the avalanche node which express the error driven snowflake various confidence criteria, and also changes the early termination logic to accommodate the error driven snowflake.

This commit enables error driven snowflake and reduces the average poll time by 35% on the Fuji network. In the classical snowflake protocol, a node issues a sequence of polls, and every poll yields a certain confidence score which is based on the number of nodes that responded and the content of the response. If the poll contains enough responses that amplify the confidence score above a certain threshold, the poll is considered a success, and the criteria to finalize a block is collecting enough successive successful polls. In the error driven snowflake which is described in Section 4.1 in the Frosty paper, a poll can succeed in various degrees of success: The higher the confidence score that the poll concludes, the more successful the poll is considered. In contrast to the classical snowflake protocol, the criteria for how many successful polls are required to finalize a block is now determined by the confidence score of the polls - successive polls with a higher confidence score require fewer of them to finalize, and vice versa. Each poll consists of sending queries to a number of nodes, and collecting responses. Since nodes may be offline, slow or malicious, they might not return responses in a timely manner. In order for the polls to be efficient, there exists logic which terminates a poll early once it has reached a required level of confidence, or if enough nodes timed out such that it is evident that the required confidence level cannot be reached by waiting for further nodes. The current snowflake code already supports the error driven variant. However, the logic that terminates the polls early currently only supports the classical snowflake with a single confidence score. In addition, there is no way to express in the configuration the threshold for the error driven snowflake, as there is only a single confidence configuration in the configuration. This commit introduces new configuration flags to the avalanche node which express the error driven snowflake various confidence criteria, and also changes the early termination logic to accommodate the error driven snowflake. Signed-off-by: Yacov Manevich <[email protected]>

yacovm requested a review from StephenButtolph as a code owner August 2, 2024 22:06

yacovm marked this pull request as draft August 2, 2024 22:07

yacovm force-pushed the errDrivenSF branch 17 times, most recently from 91bd651 to 3eece44 Compare August 8, 2024 17:25

yacovm changed the title ~~[WIP] Activate Error Driven Snowflake~~ Activate Error Driven Snowflake Aug 9, 2024

yacovm force-pushed the errDrivenSF branch from 3eece44 to d883966 Compare August 9, 2024 16:50

yacovm marked this pull request as ready for review August 9, 2024 16:50

yacovm closed this Aug 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Activate Error Driven Snowflake #3265

Activate Error Driven Snowflake #3265

yacovm commented Aug 2, 2024 •

edited

Loading

Activate Error Driven Snowflake #3265

Activate Error Driven Snowflake #3265

Conversation

yacovm commented Aug 2, 2024 • edited Loading

Why this should be merged and how was this tested

How this works

yacovm commented Aug 2, 2024 •

edited

Loading