Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(baseapp): signal then panic at halt-height #338

Merged
merged 4 commits into from
Aug 3, 2024

Conversation

michaelfig
Copy link
Collaborator

@michaelfig michaelfig commented Nov 9, 2023

Description

Refs: #337, #305
Closes: Agoric/agoric-sdk#8326

Implements both newer Cosmos panic and Agoric-compatible signal behaviour when --halt-height is reached. First, try to SIGINT self, then SIGTERM self, followed by the CONSENSUS FAILURE!!! panic. This will stop the state machine in all cases and should exit the process as Agoric expects.


Author Checklist

All items are required. Please add a note to the item if the item is not applicable and
please add links to any relevant follow up issues.

I have...

  • included the correct type prefix in the PR title
  • added ! to the type prefix if API or client breaking change
  • targeted the correct branch (see PR Targeting)
  • provided a link to the relevant issue or specification
  • followed the guidelines for building modules
  • included the necessary unit and integration tests
  • added a changelog entry to CHANGELOG.md
  • included comments for documenting Go code
  • updated the relevant documentation or specification
  • reviewed "Files changed" and left comments if necessary
  • run make lint and make test
  • confirmed all CI checks have passed

Reviewers Checklist

All items are required. Please add a note if the item is not applicable and please add
your handle next to the items reviewed if you only reviewed selected items.

I have...

  • confirmed the correct type prefix in the PR title
  • confirmed ! in the type prefix if API or client breaking change
  • confirmed all author checklist items have been addressed
  • reviewed state machine logic
  • reviewed API design and naming
  • reviewed documentation is accurate
  • reviewed tests and test coverage
  • manually tested (if applicable)

@michaelfig michaelfig self-assigned this Nov 9, 2023
Base automatically changed from 8223-revert-halt to Agoric November 9, 2023 07:26
@JimLarson
Copy link

You can compare to my attempt at a fix: https://github.com/agoric-labs/cosmos-sdk/tree/8326-exit-on-halt

Also note that it will fix Agoric/agoric-sdk#8326

@michaelfig
Copy link
Collaborator Author

You can compare to my attempt at a fix: https://github.com/agoric-labs/cosmos-sdk/tree/8326-exit-on-halt

Here is my annotated version of the code I see there today:

// halt attempts to gracefully shutdown the node via SIGINT and SIGTERM falling
// back on os.Exit if both fail.
func (app *BaseApp) halt() {
	app.logger.Info("halting node per configuration", "height", app.haltHeight, "time", app.haltTime)

	p, err := os.FindProcess(os.Getpid())
	if err == nil {
		// attempt cascading signals in case SIGINT fails (os dependent)
		sigIntErr := p.Signal(syscall.SIGINT)
		sigTermErr := p.Signal(syscall.SIGTERM)

		if sigIntErr == nil || sigTermErr == nil {
			// mfig: sigIntErr and sigTermErr are always nil if the process
			// was found and the signal was sent.  It has nothing to do with
			// how the process handles the signal, thus just returning here
			// does not prevent new blocks from being produced.  A panic
			// is the most appropriate way to prevent further progress.
			return
		}
	}

	// Resort to exiting immediately if the process could not be found or killed
	// via SIGINT/SIGTERM signals.
	app.logger.Info("failed to send SIGINT/SIGTERM; exiting...")
	os.Exit(0) // mfig: this should not be a 0 (success) exit code
}

@michaelfig michaelfig changed the base branch from Agoric to mfig-v0.46.16-alpha.agoric.2.4 July 5, 2024 00:23
@michaelfig michaelfig force-pushed the mfig-signal-then-halt branch 2 times, most recently from 811864c to 00164a3 Compare July 5, 2024 00:27
baseapp/abci.go Fixed Show fixed Hide fixed
baseapp/abci.go Dismissed Show dismissed Hide dismissed
@michaelfig michaelfig marked this pull request as ready for review July 5, 2024 00:41
@michaelfig michaelfig force-pushed the mfig-signal-then-halt branch 2 times, most recently from 27b8f1a to 8ebafcd Compare July 22, 2024 03:41
@github-actions github-actions bot removed the Type: CI label Jul 22, 2024
Copy link

@gibson042 gibson042 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted, I'm uncomfortable with using something other than block time to establish "now". Is there a way to get a block context into the function?

baseapp/abci.go Outdated
@@ -312,6 +314,37 @@ func (app *BaseApp) deliverTxWithoutEventHistory(req abci.RequestDeliverTx) (res
}
}

// checkHalt checkes if height or time exceeds halt-height or halt-time respectively.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// checkHalt checkes if height or time exceeds halt-height or halt-time respectively.
// checkHalt forces a process exit if height or time exceeds halt-height or halt-time respectively.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I've rephrased.

baseapp/abci.go Outdated
case app.haltHeight > 0 && uint64(height) > app.haltHeight:
halt = true

case app.haltTime > 0 && time.Unix() > int64(app.haltTime):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this invalid? time.Unix requires POSIX epoch timestamp input in the form of sec and nsec arguments, and returns a time.Time which is incomparable with int64.

Suggested change
case app.haltTime > 0 && time.Unix() > int64(app.haltTime):
case app.haltTime > 0 && time.Now().Unix() > int64(app.haltTime):

(but I'm uncomfortable with using local time rather than block time for this)

Copy link
Collaborator Author

@michaelfig michaelfig Aug 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. You were misled by the fact the call wasn't using a static function of the time package, because that package was shadowed by checkHalt's time argument. So, it was actually using the argument's instance method, Time.Unix. To avoid this confusion, I've renamed the argument to tm instead.

baseapp/abci.go Outdated
Comment on lines 319 to 328
var halt bool
switch {
case app.haltHeight > 0 && uint64(height) > app.haltHeight:
halt = true

case app.haltTime > 0 && time.Unix() > int64(app.haltTime):
halt = true
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

switch seems too verbose here.

Suggested change
var halt bool
switch {
case app.haltHeight > 0 && uint64(height) > app.haltHeight:
halt = true
case app.haltTime > 0 && time.Unix() > int64(app.haltTime):
halt = true
}
halt := (app.haltHeight > 0 && uint64(height) > app.haltHeight) ||
(app.haltTime > 0 && time.Unix() > int64(app.haltTime))

or

Suggested change
var halt bool
switch {
case app.haltHeight > 0 && uint64(height) > app.haltHeight:
halt = true
case app.haltTime > 0 && time.Unix() > int64(app.haltTime):
halt = true
}
var halt bool
if app.haltHeight > 0 && uint64(height) > app.haltHeight {
halt = true
} else if app.haltTime > 0 && time.Unix() > int64(app.haltTime) {
halt = true
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@michaelfig michaelfig force-pushed the mfig-signal-then-halt branch 2 times, most recently from 34babbe to ad59089 Compare August 1, 2024 18:38
@michaelfig
Copy link
Collaborator Author

michaelfig commented Aug 1, 2024

I'm uncomfortable with using something other than block time to establish "now". Is there a way to get a block context into the function?

The CometBFT block header as supplied to app.BeginBlock is exactly what calls checkHalt with in-consensus arguments:

 	app.checkHalt(req.Header.Height, req.Header.Time)

baseapp/abci.go Dismissed Show dismissed Hide dismissed
baseapp/abci.go Outdated
Comment on lines 317 to 319
// checkHalt forces a state machine halt and attempts to kill the current
// process if height or tm exceeds halt-height or halt-time respectively.
func (app *BaseApp) checkHalt(height int64, tm time.Time) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestions for further clarification:

Suggested change
// checkHalt forces a state machine halt and attempts to kill the current
// process if height or tm exceeds halt-height or halt-time respectively.
func (app *BaseApp) checkHalt(height int64, tm time.Time) {
// checkHalt forces a state machine halt and attempts to kill the current
// process if block height or timestamp exceeds halt-height or halt-time respectively.
func (app *BaseApp) checkHalt(blockHeight int64, blockTime time.Time) {

baseapp/abci.go Outdated
return
}

app.logger.Info("halt per configuration", "height", app.haltHeight, "time", app.haltTime)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
app.logger.Info("halt per configuration", "height", app.haltHeight, "time", app.haltTime)
app.logger.Info("halt per configuration", "haltHheight", app.haltHeight, "haltTime", app.haltTime, "blockHeight", height, "blockTime", tm)

@michaelfig michaelfig merged commit 3768f9c into mfig-v0.46.16-alpha.agoric.2.4 Aug 3, 2024
28 of 29 checks passed
@michaelfig michaelfig deleted the mfig-signal-then-halt branch August 3, 2024 00:36
@mhofman
Copy link

mhofman commented Aug 3, 2024

FYI, the checkHalt from #305 (restored here) was an unmodified port of cosmos#16639. That said newer versions of cosmos-sdk take a slightly different approach, so I'm totally ok deviating from the upstream patch as we'll have to re-integrate it anyway.

Overall LGTM, sorry for dropping the ball on this review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cosmos halt config might still advance state
4 participants