Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Merged by Bors] - Retry registration timeout fix #6136

Closed
wants to merge 31 commits into from

Conversation

ConvallariaMaj
Copy link
Contributor

@ConvallariaMaj ConvallariaMaj commented Jul 15, 2024

Motivation

Fix an issue #6035

Description

Closes #6035 `

Fix first half of issue: established separate parameter for submit challenge timeout. If timeout is not set, submitting challenge will continue till poet round starts.
Added tests and comments to config parameters.

Second half of issue is fixed within #6116

Test Plan

Tested with unit tests

TODO

  • Explain motivation or link existing issue(s)
  • Test changes and document test plan
  • Update documentation as needed
  • Update changelog as needed

activation/activation.go Outdated Show resolved Hide resolved
@ConvallariaMaj ConvallariaMaj marked this pull request as ready for review July 15, 2024 14:01
activation/activation.go Outdated Show resolved Hide resolved
activation/activation.go Outdated Show resolved Hide resolved
activation/activation_test.go Outdated Show resolved Hide resolved
cmd/root.go Outdated Show resolved Hide resolved
activation/poet.go Outdated Show resolved Hide resolved
Copy link
Member

@fasmat fasmat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand what the purpose of "Default Timeout" is, when it is only used for a single type of request and for others individual timeouts have to be specified.

activation/activation_test.go Outdated Show resolved Hide resolved
activation/activation_test.go Outdated Show resolved Hide resolved
cmd/root.go Outdated Show resolved Hide resolved
cmd/root.go Outdated Show resolved Hide resolved
cmd/root.go Outdated Show resolved Hide resolved
config/presets/fastnet.go Outdated Show resolved Hide resolved
config/presets/testnet.go Outdated Show resolved Hide resolved
activation/nipost_test.go Outdated Show resolved Hide resolved
activation/nipost_test.go Outdated Show resolved Hide resolved
@fasmat
Copy link
Member

fasmat commented Jul 16, 2024

The changes seem to also have broken other tests that now hang when executed.

activation/poet.go Outdated Show resolved Hide resolved
config/mainnet.go Outdated Show resolved Hide resolved
activation/poet.go Outdated Show resolved Hide resolved
config/presets/standalone.go Outdated Show resolved Hide resolved
@spacemeshos spacemeshos deleted a comment from fasmat Jul 17, 2024
Copy link
Contributor

@poszu poszu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing the timeout is not enough. It will now retry to submit MaxRequestRetries times (handled by the go-retryablehttp library), but no more. We need to manually retry submitting till the end of the cycle gap (probably the best place to do this is in the PoetService::Submit() method.

Please, also add a test checking that it retries submitting up until the cycle gap.

@fasmat
Copy link
Member

fasmat commented Jul 18, 2024

Removing the timeout is not enough. It will now retry to submit MaxRequestRetries times (handled by the go-retryablehttp library), but no more. We need to manually retry submitting till the end of the cycle gap (probably the best place to do this is in the PoetService::Submit() method.

Please, also add a test checking that it retries submitting up until the cycle gap.

I agree that this might need to be changed but I don't agree with thee suggested approach of retrying the failed retryablehttp.Client request. Instead the Submit shouldn't have an "unlimited" number (math.MaxInt) of retries instead.

The HTTPPoetClient can have 2 retryablehttp.Clients inside - one with the limited number for proof fetches and info request and one with "unlimited" retries for submit.

Wdyt?

@poszu
Copy link
Contributor

poszu commented Jul 18, 2024

I agree that this might need to be changed but I don't agree with thee suggested approach of retrying the failed retryablehttp.Client request. Instead the Submit shouldn't have an "unlimited" number (math.MaxInt) of retries instead.

The HTTPPoetClient can have 2 retryablehttp.Clients inside - one with the limited number for proof fetches and info request and one with "unlimited" retries for submit.

Wdyt?

Retrying via the mechanism built in the retryablehttp.Client has the benefit (I think?) of giving up after non-retryable failures so it makes sense. Perhaps it's possible to adjust retry mechanism on per-request basis instead of keeping 2 clients around?

@fasmat
Copy link
Member

fasmat commented Jul 18, 2024

Perhaps it's possible to adjust retry mechanism on per-request basis instead of keeping 2 clients around?

Maybe, I didn't check the full documentation but even if not, an http client isn't very expensive to keep around - from the users perspective its still the PoetClient with a Submit and Proof method, but internally it uses 2 http clients; I don't see an issue with this 🙂

activation/poet.go Outdated Show resolved Hide resolved
activation/poet.go Outdated Show resolved Hide resolved
activation/poet_client_test.go Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
activation/poet.go Outdated Show resolved Hide resolved
activation/poet.go Outdated Show resolved Hide resolved
activation/poet_client_test.go Outdated Show resolved Hide resolved
# Conflicts:
#	activation/activation.go
#	config/mainnet.go
#	config/presets/testnet.go
activation/activation.go Outdated Show resolved Hide resolved
activation/activation.go Outdated Show resolved Hide resolved
activation/poet.go Outdated Show resolved Hide resolved
activation/poet_client_test.go Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
activation/activation.go Outdated Show resolved Hide resolved
activation/activation.go Outdated Show resolved Hide resolved
activation/poet.go Outdated Show resolved Hide resolved
activation/poet.go Outdated Show resolved Hide resolved
@fasmat
Copy link
Member

fasmat commented Aug 14, 2024

I think we are using wrong backoff functions. For the client I think we should keep using retryablehttp.LinearJitterBackoff. It uses a linear backoff that increases the time between retries by numAttempts * rand(min, max). Given that we have 10 seconds as min and 20 seconds as max this will take at least 550 and at most 1100 seconds (which is our timeout) with increasing delays between requests (to avoid putting to much load on an already struggling PoET).

For submit we also want some jitter to not have all nodes query a poet at exactly the same time, but we do not want to increase the time between requests by too much. I believe a backoff function like this can be used:

// customJitterBackoff always waits between `min` and `max` before retrying
func customJitterBackoff(min, max time.Duration, attemptNum int, resp *http.Response) time.Duration {
	if max <= min {
		return min
	}

	return min + rand.N(max-min)
}

@ConvallariaMaj
Copy link
Contributor Author

bors merge

spacemesh-bors bot pushed a commit that referenced this pull request Aug 15, 2024
## Motivation
Fix an issue #6035



Co-authored-by: ConvallariaMaj <[email protected]>
@spacemesh-bors
Copy link

Pull request successfully merged into develop.

Build succeeded:

@spacemesh-bors spacemesh-bors bot changed the title Retry registration timeout fix [Merged by Bors] - Retry registration timeout fix Aug 15, 2024
@spacemesh-bors spacemesh-bors bot closed this Aug 15, 2024
@spacemesh-bors spacemesh-bors bot deleted the retry-registration-timeout-fix branch August 15, 2024 12:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants