Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DLPX-89392 Race between reboot and grub install #747

Conversation

prakashsurya
Copy link
Contributor

@prakashsurya prakashsurya commented Jan 5, 2024

Problem

With the help of @grwilson, we think the issue is a race between the upgrade script’s execute script calling systemctl reboot, and the virtualization’s upgrade logic calling rootfs-container set-bootfs, which modifies/updates GRUB.

Specifically, the virtualization logic on 16.0 and earlier, looks like this:

    try {
      ExecuteUtils.executeSudo(script, "-p", UpgradeManagerImpl.getPlatform());
    } catch (NonZeroExitException e) {
      throw new VersionSwitchFailed(e, e.getMessage());
    }
    this.trySetBootFs(UpgradeUtility.getMountedRootFSContainerDataset().lastElement());

i.e. run the execute script synchronously, and then update GRUB synchronously after that.

This was OK, up until the changes we made in 17.0 via DLPX-85893 .. in those changes, we modified the execute script to call exec systemctl reboot, when previously it would not.

Thus, I believe the problem is as follows:

  1. virtualization calls execute
  2. execute calls exec systemctl reboot
  3. systemctl reboot returns success, while reboot hasn’t happened yet
  4. execute returns control back to virtualization
  5. virtualization then calls trySetBootFs which runs rootfs-container set-bootfs container to update GRUB
  6. while grub is in the middle of being modified (from 5), the reboot occurs (from 3)

Since step 5 couldn’t run to completion, GRUB is left in a “corrupted” state, and the reboot isn’t able to boot up successfully.

Affected Versions

Since we didn’t add the systemctl reboot to the execute until 17.0, I think this issue is limited to upgrades TO 17.0 and greater, FROM 16.0 or less.

I don’t believe upgrades FROM 17.0 are affected (e.g. 17.0 to 18.0), because the virtualization upgrade code was modified to run the execute script asynchronously. Thus, while the virtualization logic will still run trySetBootFs to update GRUB, it’s unlikely those modifications to GRUB will conflict/race with the reboot triggered by the execute script. The execute script can take a long time to run, so in practice I’d expect the virtualization logic to finish its modifications to GRUB, prior to execute initiating the reboot.

I also don’t think upgrade FROM 18.0 and greater are affected (e.g. 18.0 to 19.0), because the execute script will not initiate a reboot when upgrading from these versions, due to changes made in DLPX-88573. The reboot will only occur on versions FROM 17.0 or less.

Workaround

The workaround for any affected versions, would be to perform a DEFERRED upgrade (perhaps immediately followed by a FINISH DEFERRED upgrade), and not use FULL upgrades.

Solution

I think the proper solution for this, would is to modify the execute script, such that it never returns, and instead waits for the reboot; this way, the virtualization logic is unable to call trySetBootFs.

Related Work

Testing

  • git-ab-pre-push is here

@prakashsurya prakashsurya force-pushed the dlpx/pr/prakashsurya/6364b322-848d-416f-b295-26b287a258af branch from 5566216 to fe60d4d Compare January 5, 2024 21:16
@prakashsurya prakashsurya marked this pull request as ready for review January 5, 2024 21:31
@prakashsurya prakashsurya merged commit a4ef28e into develop Jan 8, 2024
7 checks passed
@prakashsurya prakashsurya deleted the dlpx/pr/prakashsurya/6364b322-848d-416f-b295-26b287a258af branch January 8, 2024 17:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants