Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(availability): prevent parallel availability calls #3883

Merged
merged 13 commits into from
Oct 31, 2024

Conversation

walldiss
Copy link
Member

Parallel availability calls could simultaneously select different set of uniq coordinates for block sampling. The selection would be overwritten by last call, which would result in more samples in store than needed. It would make light node to do more sampling work than needed and also would create problems for future pruning of additional samples.

Provided unit explains the problem the best. With removed change, it would result in 176 samples in store instead of 16.

@walldiss walldiss added the kind:fix Attached to bug-fixing PRs label Oct 24, 2024
@walldiss walldiss self-assigned this Oct 24, 2024
@codecov-commenter
Copy link

codecov-commenter commented Oct 24, 2024

Codecov Report

Attention: Patch coverage is 85.18519% with 4 lines in your changes missing coverage. Please review.

Project coverage is 45.33%. Comparing base (2469e7a) to head (6e45e05).
Report is 370 commits behind head on main.

Files with missing lines Patch % Lines
share/availability/light/availability.go 70.00% 2 Missing and 1 partial ⚠️
libs/utils/sessions.go 94.11% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3883      +/-   ##
==========================================
+ Coverage   44.83%   45.33%   +0.49%     
==========================================
  Files         265      309      +44     
  Lines       14620    21789    +7169     
==========================================
+ Hits         6555     9877    +3322     
- Misses       7313    10854    +3541     
- Partials      752     1058     +306     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@Wondertan Wondertan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • When is this happening in practice?
    • Besides someone else calling ShareAvailabe over RPC, shouldn't DASer ensure there is only one availability per height at a time?
  • This seems like a broader problem for all availabilities and remind me of SyncGetter(:triggered:) we still missing. Shouldn't we do that for all availabilties instead of light?

@walldiss
Copy link
Member Author

Sync summary:

When is this happening in practice?
Besides someone else calling ShareAvailabe over RPC, shouldn't DASer ensure there is only one availability per height at a time?

You are right, ShareAvailable over RPC can cause inconsistent sample states in the store. While DASer isn't supposed to call availability for the same height in parallel, it can happen. DASer is built so that multiple ShareAvailable calls are possible and doesn't strictly enforce a single call per height.

This seems like a broader problem for all availabilities and remind me of SyncGetter(:triggered:) we still missing. Shouldn't we do that for all availabilties instead of light?

There are only two types of availability:

  • Full Availability: Doesn't have this problem because its state is tied to the stored data.
  • Light Availability: Stores samples and sampling results separately without transactional guarantees, leading to potential issues.

This PR adds transaction guarantees for light availability only, which is why the fix is specific to it.

renaynay
renaynay previously approved these changes Oct 29, 2024
Copy link
Member

@renaynay renaynay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am fine with this change - I do agree with hlib's comment re extending single flight protection to both calls (for full availability too) even though this circumstance only really happens from external RPC call.

One thing about it though is that we have an instance of this pattern already in go-header where we delegate the flight protection to the caller (syncer) rather than the component itself, but here it is delegated to the component. Just a thought, but we should probably try to choose one pattern.

@walldiss
Copy link
Member Author

Am fine with this change - I do agree with hlib's comment re extending single flight protection to both calls (for full availability too) even though this circumstance only really happens from external RPC call.

Just to clarify, full availability doesn’t need this since the issue can’t happen there. In full availability, the state is kept in one place and updated all at once. But in light availability, the state is split between datastore and blockstore and updated separately.

One thing about it though is that we have an instance of this pattern already in go-header where we delegate the flight protection to the caller (syncer) rather than the component itself, but here it is delegated to the component. Just a thought, but we should probably try to choose one pattern.

We can’t delegate to the caller here because there are two possible parallel callers from different components: DASer and RPC. Maybe for go-header, if there are potentially multiple callers, it should also handle state access internally instead of relying on the caller.

renaynay
renaynay previously approved these changes Oct 30, 2024
# Conflicts:
#	share/availability/light/availability.go
#	share/availability/light/availability_test.go
Copy link
Member

@renaynay renaynay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utack

@walldiss walldiss enabled auto-merge (squash) October 31, 2024 15:22
@walldiss walldiss merged commit 105ab1b into celestiaorg:main Oct 31, 2024
30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:fix Attached to bug-fixing PRs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants