Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix inherents race condition #1974

Merged
merged 8 commits into from
Oct 26, 2023
Merged

Fix inherents race condition #1974

merged 8 commits into from
Oct 26, 2023

Conversation

nazar-pc
Copy link
Member

Fixes #871, see #871 (comment) for the reason why this was happening.

Note that previously notification about importing block was technically happening AFTER block was imported (but before next block in import queue is imported, which wasn't the only way to get blocks imported), now it is happening BEFORE block is actually imported into Substrate.

So happy to finally understand what is going on there 😌

Code contributor checklist:

Copy link
Member

@NingLin-P NingLin-P left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nazar-pc nazar-pc disabled auto-merge September 18, 2023 13:06
@nazar-pc nazar-pc requested a review from NingLin-P September 18, 2023 13:06
@nazar-pc
Copy link
Member Author

@NingLin-P looks like that broke some execution tests, could you look into it? I think there were some implicit assumptions about which blocks are already imported there.

@NingLin-P
Copy link
Member

@NingLin-P looks like that broke some execution tests, could you look into it? I think there were some implicit assumptions about which blocks are already imported there.

Yeah, the domain tests broke due to some event ordering assumptions are broke. Before the changes, we made solid assumptions about the ordering of the following events and ensured the previous event must finish before next event (even though they are running is different workers concurrently) :

  1. Produce slot, the operator construct/submit bundle.
  2. Produce consensus block that includes the above produced bundle
  3. Process the consensus block and derive a domain block
  4. Produce the next slot...

With the change, event 3 can't guaranteed to be finished before the next slot, thus the operator may skip to produce bundle and further skip to produce domain block because there is no bundle in the consensus block.

For now, I don't have a better idea than to revert the change in the domain test 🙁

@nazar-pc
Copy link
Member Author

  1. Produce consensus block that includes the above produced bundle

This is not guaranteed to be the very next block.

  1. Process the consensus block and derive a domain block

One of the next blocks will have that bundle, but there is no guarantee which one.

I think test was relying on fragine logic it should not really do that, it is not how things work in actual chain anyway.

@NingLin-P
Copy link
Member

It is not fragile, the ordering is guaranteed in main, we have spent quite a lot of effort to make the assumption reliable and the test deterministic even under the slow github-host CI machine.

it is not how things work in actual chain anyway.

It is how the actual chain work in most case, where each step have seconds of interval to finish. But there are certainly a lot of scenarios not covered by the test because each domain test is focused on and dedicated to a specific case, while testing how the operator handles consensus chain fork we do not care about what happens if there is an empty consensus block in the new fork because empty consensus block is already tested in another test.

@nazar-pc
Copy link
Member Author

There are several possible solutions here:

  • make tests not rely on when bundle gets included (which means potentially waiting for extra blocks)
  • tighter control over block production in test environment such that you do not produce blocks until necessary
  • introduce another notification just for testing purposes at the old place in the workflow

I would prefer the first option for sure. While yes, it works in many cases on real chain the way you describe it, it doesn't always work like that.

@NingLin-P
Copy link
Member

tighter control over block production in test environment such that you do not produce blocks until necessary

This is the exact current approach, the first option will make the test more verbose and complicated because in some tests we have assert about consensus/domain block number.

It is impossible to cover all the scenarios and the combination of them in the test (the number is exponential), so we just chose a relatively efficient way.

@nazar-pc
Copy link
Member Author

This is the exact current approach, the first option will make the test more verbose and complicated because in some tests we have assert about consensus/domain block number.

It is impossible to cover all the scenarios and the combination of them in the test (the number is exponential), so we just chose a relatively efficient way.

Well, I'll be relying on you do to something about it I think.

@NingLin-P NingLin-P mentioned this pull request Oct 11, 2023
1 task
@NingLin-P
Copy link
Member

I have tested locally that the domain test is fixed with #2087, plz resolve the conflict.

…ndition

# Conflicts:
#	crates/sc-consensus-subspace/src/slot_worker.rs
@nazar-pc nazar-pc requested review from NingLin-P and rahulksnv and removed request for ParthDesai and NingLin-P October 13, 2023 08:31
@nazar-pc nazar-pc enabled auto-merge October 13, 2023 08:31
vedhavyas
vedhavyas previously approved these changes Oct 13, 2023
NingLin-P
NingLin-P previously approved these changes Oct 13, 2023
@nazar-pc
Copy link
Member Author

@NingLin-P one of the domains tests hanged apparently on Windows and macOS

@NingLin-P
Copy link
Member

@NingLin-P one of the domains tests hanged apparently on Windows and macOS

I can't reproduce it on my macOS, I have re-run the CI job to see how it goes.

@nazar-pc
Copy link
Member Author

Are you using nextest as well?

@NingLin-P
Copy link
Member

Are you using nextest as well?

Yeah, created NingLin-P#8 to debug and see if it can be reproduced in github-host CI runner.

@nazar-pc
Copy link
Member Author

@NingLin-P same test hanged on main here: https://github.com/subspace/subspace/actions/runs/6534587384/job/17742123483

@nazar-pc nazar-pc disabled auto-merge October 16, 2023 21:15
…ndition

# Conflicts:
#	crates/sc-consensus-subspace/src/lib.rs
@nazar-pc nazar-pc dismissed stale reviews from NingLin-P and vedhavyas via cd295a9 October 25, 2023 12:47
@nazar-pc
Copy link
Member Author

@NingLin-P I have pulled latest main into here, yet it still fails in CI. I think this is your PR now to finish, it would be difficult for me to do something about it.

…t in test

This commit sending one more notification to the operator to ensure it have
received the previous consensus block import notification and ensure it have
finish processing it before return. This commit also clean up some workaround
introduced previously that no more necessary or is fixed in upstream.

Signed-off-by: linning <[email protected]>
@NingLin-P NingLin-P force-pushed the fix-inherents-race-condition branch from dce257b to 56c07b0 Compare October 25, 2023 15:11
@NingLin-P
Copy link
Member

Sorry for the annoying force push, it is used to remove some commits that I brought to this PR accidentally, the last commit is the only change I pushed which should fix the CI ideally, PTAL.

@nazar-pc
Copy link
Member Author

nazar-pc commented Oct 25, 2023

According to force-push you just updated some logs (removing a bunch of them from default levels actually). Not sure how it is supposed to fix anything.

UPD: Ah, there were other commit before that, okay...

Copy link
Member Author

@nazar-pc nazar-pc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nits, let's see if CI passes

test/subspace-test-service/src/lib.rs Outdated Show resolved Hide resolved
test/subspace-test-service/src/lib.rs Outdated Show resolved Hide resolved
@nazar-pc nazar-pc enabled auto-merge October 25, 2023 22:20
@nazar-pc nazar-pc added this pull request to the merge queue Oct 26, 2023
Merged via the queue into main with commit 3568a44 Oct 26, 2023
@nazar-pc nazar-pc deleted the fix-inherents-race-condition branch October 26, 2023 13:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Import failed: Checking inherents failed: MissingSegmentHeadersList
4 participants