Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: header downloading get stuck #4459

Merged
merged 2 commits into from
Jul 1, 2024
Merged

fix: header downloading get stuck #4459

merged 2 commits into from
Jul 1, 2024

Conversation

hanabi1224
Copy link
Contributor

@hanabi1224 hanabi1224 commented Jul 1, 2024

Summary of changes

Root cause: when the local head is followed by a null Tipset, sync_headers_in_reverse runs into a dead loop
e.g. When the local head is at epoch 1742466 which is followed by a null tipset at epoch 1742467, the last queried tipsets in network_tipsets are at epoch [1742466, ..] which are all filtered out, and lead the loop into a dead one.

        for tipset in network_tipsets
            .into_iter()
            .take_while(|ts| ts.epoch() >= until_epoch)
        {
                ...
        }

This is a regression introduced in #4291
In lotus, it breaks the outer loop directly with the label

loop:
	for blockSet[len(blockSet)-1].Height() > untilHeight {
            ...
            for _, b := range blks {
			if b.Height() < untilHeight {
				break loop
			}
            ...
            }
        }

Changes introduced in this pull request:

  • break the loop when the tipset is a null one at the until_epoch

Reference issue to close (if applicable)

Closes #4441

Other information and links

Change checklist

  • I have performed a self-review of my own code,
  • I have made corresponding changes to the documentation,
  • I have added tests that prove my fix is effective or that my feature works (if possible),
  • I have made sure the CHANGELOG is up-to-date. All user-facing changes should be reflected in this document.

@hanabi1224 hanabi1224 marked this pull request as ready for review July 1, 2024 07:08
@hanabi1224 hanabi1224 requested a review from a team as a code owner July 1, 2024 07:08
@hanabi1224 hanabi1224 requested review from ruseinov and lemmih and removed request for a team July 1, 2024 07:08
Copy link
Member

@LesnyRumcajs LesnyRumcajs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic, thank you!

@@ -822,7 +822,7 @@ async fn sync_headers_in_reverse<DB: Blockstore + Sync + Send + 'static>(

#[allow(deprecated)] // Tracking issue: https://github.com/ChainSafe/forest/issues/3157
let wp = WithProgressRaw::new("Downloading headers", total_size as u64);
while pending_tipsets.last().epoch() > until_epoch {
'outer: while pending_tipsets.last().epoch() > until_epoch {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the until_epoch is null, it relies on break 'outer; to break this loop

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we avoid the inner/outer loop complexity by refactoring this to smaller functions that could potentially be unit tested? At least the block for tipset in network_tipsets {

Copy link
Contributor Author

@hanabi1224 hanabi1224 Jul 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid the 'outer label, how about this

        // breaks the loop when `until_epoch` has been added to `pending_tipsets`
        // or when `until_epoch` is null. Note that in the latter case, the outer
        // while condition is always true, and it relies on this flag to break
        // the loop
        let should_terminate = 'append_tipsets: {
            for tipset in network_tipsets {
                // This could happen when the `until_epoch` is a null epoch
                if tipset.epoch() < until_epoch {
                    break 'append_tipsets true;
                }
                validate_tipset_against_cache(bad_block_cache, tipset.key(), &accepted_blocks)?;
                accepted_blocks.extend(tipset.cids());
                tracker.write().set_epoch(tipset.epoch());
                pending_tipsets.push(tipset);
            }
            false
        };
        if should_terminate {
            break;
        }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's up to you, but I'd still recommend extracting the loop to a separate function. This makes it possible to test and is subjectively easier for the reader of the parent function to grasp.

Copy link
Contributor Author

@hanabi1224 hanabi1224 Jul 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about this then? It still highly relies on the comments to explain this piece of code though

// tipsets is sorted by epoch in descending order
// returns true when `until_epoch_inclusive` is overreached
fn for_each_tipset_until_epoch_overreached(
    tipsets: impl IntoIterator<Item = Arc<Tipset>>,
    until_epoch_inclusive: ChainEpoch,
    mut callback: impl FnMut(Arc<Tipset>) -> Result<(), TipsetRangeSyncerError>,
) -> Result<bool, TipsetRangeSyncerError> {
    for tipset in tipsets {
        if tipset.epoch() < until_epoch_inclusive {
            return Ok(true);
        }
        callback(tipset)?;
    }
    Ok(false)
}

        let callback = |tipset: Arc<Tipset>| {
            validate_tipset_against_cache(bad_block_cache, tipset.key(), &accepted_blocks)?;
            accepted_blocks.extend(tipset.cids());
            tracker.write().set_epoch(tipset.epoch());
            pending_tipsets.push(tipset);
            Ok(())
        };
        // Breaks the loop when `until_epoch` is overreached, which happens
        // when there are null tipsets in the queried range.
        // Note that when the `until_epoch` is null, the outer while condition
        // is always true, and it relies on this flag to break the loop
        if for_each_tipset_until_epoch_overreached(network_tipsets, until_epoch, callback)? {
            // Breaks when the `until_epoch` is overreached.
            break;
        }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this has been queued, I will make this refactor in a separate PR and let's discuss there

Copy link
Contributor

@elmattic elmattic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@elmattic elmattic added this pull request to the merge queue Jul 1, 2024
Merged via the queue into main with commit a9e55e7 Jul 1, 2024
27 checks passed
@elmattic elmattic deleted the hm/fix-header-download branch July 1, 2024 10:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Forest stuck on downloading headers
3 participants