DLPX-89392 Race between reboot and grub install #747
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem
With the help of @grwilson, we think the issue is a race between the upgrade script’s
execute
script callingsystemctl reboot
, and the virtualization’s upgrade logic callingrootfs-container set-bootfs
, which modifies/updates GRUB.Specifically, the virtualization logic on 16.0 and earlier, looks like this:
i.e. run the
execute
script synchronously, and then update GRUB synchronously after that.This was OK, up until the changes we made in 17.0 via DLPX-85893 .. in those changes, we modified the
execute
script to callexec systemctl reboot
, when previously it would not.Thus, I believe the problem is as follows:
execute
execute
callsexec systemctl reboot
systemctl reboot
returns success, while reboot hasn’t happened yetexecute
returns control back to virtualizationtrySetBootFs
which runsrootfs-container set-bootfs
container to update GRUBSince step 5 couldn’t run to completion, GRUB is left in a “corrupted” state, and the reboot isn’t able to boot up successfully.
Affected Versions
Since we didn’t add the
systemctl reboot
to theexecute
until 17.0, I think this issue is limited to upgrades TO 17.0 and greater, FROM 16.0 or less.I don’t believe upgrades FROM 17.0 are affected (e.g. 17.0 to 18.0), because the virtualization upgrade code was modified to run the
execute
script asynchronously. Thus, while the virtualization logic will still runtrySetBootFs
to update GRUB, it’s unlikely those modifications to GRUB will conflict/race with the reboot triggered by theexecute
script. Theexecute
script can take a long time to run, so in practice I’d expect the virtualization logic to finish its modifications to GRUB, prior toexecute
initiating the reboot.I also don’t think upgrade FROM 18.0 and greater are affected (e.g. 18.0 to 19.0), because the execute script will not initiate a reboot when upgrading from these versions, due to changes made in DLPX-88573. The reboot will only occur on versions FROM 17.0 or less.
Workaround
The workaround for any affected versions, would be to perform a DEFERRED upgrade (perhaps immediately followed by a FINISH DEFERRED upgrade), and not use FULL upgrades.
Solution
I think the proper solution for this, would is to modify the execute script, such that it never returns, and instead waits for the reboot; this way, the virtualization logic is unable to call
trySetBootFs
.Related Work
Testing
git-ab-pre-push
is here