Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement OS Version reconcile #86

Merged
merged 7 commits into from
Sep 26, 2024

Conversation

anmazzotti
Copy link
Collaborator

@anmazzotti anmazzotti commented Aug 30, 2024

Closes rancher/elemental#1103

Actually third take on OS Version reconcile implementation.
Refactored from #42

Depends on rancher/elemental-toolkit#2168

@anmazzotti
Copy link
Collaborator Author

anmazzotti commented Aug 30, 2024

Looks pretty good so far.
The lifecycle of a not-bootstrapped ElementalHost looks like this:

NAME                                     CLUSTER   MACHINE   ELEMENTALMACHINE   PHASE   READY   AGE
m-43030c0a-ac5d-4bc9-b01a-1fce926182c7                                                          0s
m-43030c0a-ac5d-4bc9-b01a-1fce926182c7                                                          0s
m-43030c0a-ac5d-4bc9-b01a-1fce926182c7                                          Registering           0s
m-43030c0a-ac5d-4bc9-b01a-1fce926182c7                                          Finalizing Registration           0s
m-43030c0a-ac5d-4bc9-b01a-1fce926182c7                                          Finalizing Registration   True    0s
m-43030c0a-ac5d-4bc9-b01a-1fce926182c7                                          Finalizing Registration   False   0s
m-43030c0a-ac5d-4bc9-b01a-1fce926182c7                                          Installing                False   1s
m-43030c0a-ac5d-4bc9-b01a-1fce926182c7                                          Installing                True    38s
m-43030c0a-ac5d-4bc9-b01a-1fce926182c7                                          Installing                True    38s
m-43030c0a-ac5d-4bc9-b01a-1fce926182c7                                          Running                   True    71s
m-43030c0a-ac5d-4bc9-b01a-1fce926182c7                                          Running                   True    71s
m-43030c0a-ac5d-4bc9-b01a-1fce926182c7                                          Running                   True    100s
m-43030c0a-ac5d-4bc9-b01a-1fce926182c7                                          Reconciling OS Version    True    112s
m-43030c0a-ac5d-4bc9-b01a-1fce926182c7                                          Reconciling OS Version    False   112s
m-43030c0a-ac5d-4bc9-b01a-1fce926182c7                                          Running                   False   2m21s
m-43030c0a-ac5d-4bc9-b01a-1fce926182c7                                          Running                   True    2m21s

Hosts which undergo OS reconciliation will have a new OSVersionReady condition:

  status:
    conditions:
      - lastTransitionTime: "2024-08-30T12:54:04Z"
        severity: Info
        status: "True"
        type: OSVersionReady

The Elemental OS Plugin takes the hash of the ElementalHost.spec.osVersionManagement field and uses it as a correlation_id to mark and later validate upgrades:

spec:
  osVersionManagement:
    imageUri: oci://192.168.122.10:30000/elemental-os:dev-next
m-43030c0a-ac5d-4bc9-b01a-1fce926182c7:~ # cat /run/elemental/efi/grub_oem_env 
# GRUB Environment Block
# WARNING: Do not edit this file by tools other than grub2-editenv!!!
state_label=COS_STATE
recovery_label=COS_RECOVERY
oem_label=COS_OEM
persistent_label=COS_PERSISTENT
default_menu_entry=Elemental
snapshotter=btrfs
default_fallback=0 1 2
passive_snaps=1
correlation_id=cbcba4bfcd825eb2c117b604fca1ef11976501c3d00fc3f9739abdc39c673750

This implementation however still misses the "upgrade failed" scenario. This probably requires an additional update to the elemental-toolkit when closing a snapshotter transaction on error. We could add some information (as additional grub env) to highlight the upgrade with a certain correlation_id failed (during elemental upgrade or during boot-assessment), so that the Elemental plugin won't be stuck in endless upgrade attempts.

Signed-off-by: Andrea Mazzotti <[email protected]>
@@ -29,6 +29,8 @@ CAPI_VERSION?=$(shell grep "sigs.k8s.io/cluster-api" go.mod | awk '{print $$NF}'
KUBEADM_READY_OS ?= ""
ELEMENTAL_TOOLKIT_IMAGE ?= ghcr.io/rancher/elemental-toolkit/elemental-cli:nightly
ELEMENTAL_AGENT_IMAGE ?= ghcr.io/rancher-sandbox/cluster-api-provider-elemental/agent:latest
ELEMENTAL_OS_IMAGE?=docker.io/local/elemental-capi-os:dev
ELEMENTAL_ISO_IMAGE?=docker.io/local/elemental-capi-iso:dev
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about systems that don't boot from ISO, like Raspberry Pi ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no support yet for disk images. The process of building CAPI images is also manual, I think for this moment best we can do is to add also manual support to invoke elemental build-disk as alternative to build-iso, but we also need to test this out because we need the disk image to install the system at boot.

tl;dr Not yet. :P

Signed-off-by: Andrea Mazzotti <[email protected]>
@anmazzotti
Copy link
Collaborator Author

anmazzotti commented Sep 23, 2024

All is done in this PR, so I'll open it for review.
However merge should be blocked until a new elemental toolkit version is available, since nightly does not include the required elemental state feature.

Edit: switching elemental-cli v2.2.0 is also now part of this PR. All good.

@anmazzotti anmazzotti marked this pull request as ready for review September 23, 2024 12:49
@anmazzotti anmazzotti requested a review from a team as a code owner September 23, 2024 12:49
@anmazzotti
Copy link
Collaborator Author

Also for reference a lifecycle PoV of an upgraded node:

management:~ # kubectl get elementalhosts -w
NAME                                     CLUSTER   MACHINE   ELEMENTALMACHINE   PHASE   READY   AGE
m-67e232e7-9cbb-43cd-abbd-e840300fc2b3                                                          0s
m-67e232e7-9cbb-43cd-abbd-e840300fc2b3                                                          0s
m-67e232e7-9cbb-43cd-abbd-e840300fc2b3                                          Registering           0s
m-67e232e7-9cbb-43cd-abbd-e840300fc2b3                                          Finalizing Registration           0s
m-67e232e7-9cbb-43cd-abbd-e840300fc2b3                                          Finalizing Registration   True    0s
m-67e232e7-9cbb-43cd-abbd-e840300fc2b3                                          Finalizing Registration   False   0s
m-67e232e7-9cbb-43cd-abbd-e840300fc2b3                                          Installing                False   0s
m-67e232e7-9cbb-43cd-abbd-e840300fc2b3                                          Installing                True    37s
m-67e232e7-9cbb-43cd-abbd-e840300fc2b3                                          Installing                True    37s
m-67e232e7-9cbb-43cd-abbd-e840300fc2b3                                          Running                   True    64s
m-67e232e7-9cbb-43cd-abbd-e840300fc2b3                                          Running                   True    64s
m-67e232e7-9cbb-43cd-abbd-e840300fc2b3   elemental-cluster-k3s   elemental-cluster-k3s-control-plane-m6rl8   elemental-cluster-k3s-control-plane-dn85z   Running                   True    77s
m-67e232e7-9cbb-43cd-abbd-e840300fc2b3   elemental-cluster-k3s   elemental-cluster-k3s-control-plane-m6rl8   elemental-cluster-k3s-control-plane-dn85z   Bootstrapping             True    84s
m-67e232e7-9cbb-43cd-abbd-e840300fc2b3   elemental-cluster-k3s   elemental-cluster-k3s-control-plane-m6rl8   elemental-cluster-k3s-control-plane-dn85z   Bootstrapping             False   84s
m-67e232e7-9cbb-43cd-abbd-e840300fc2b3   elemental-cluster-k3s   elemental-cluster-k3s-control-plane-m6rl8   elemental-cluster-k3s-control-plane-dn85z   Running                   False   111s
m-67e232e7-9cbb-43cd-abbd-e840300fc2b3   elemental-cluster-k3s   elemental-cluster-k3s-control-plane-m6rl8   elemental-cluster-k3s-control-plane-dn85z   Bootstrapping             False   111s
m-67e232e7-9cbb-43cd-abbd-e840300fc2b3   elemental-cluster-k3s   elemental-cluster-k3s-control-plane-m6rl8   elemental-cluster-k3s-control-plane-dn85z   Bootstrapping             True    111s
m-67e232e7-9cbb-43cd-abbd-e840300fc2b3   elemental-cluster-k3s   elemental-cluster-k3s-control-plane-m6rl8   elemental-cluster-k3s-control-plane-dn85z   Bootstrapping             True    111s
m-67e232e7-9cbb-43cd-abbd-e840300fc2b3   elemental-cluster-k3s   elemental-cluster-k3s-control-plane-m6rl8   elemental-cluster-k3s-control-plane-dn85z   Running                   True    2m1s
m-67e232e7-9cbb-43cd-abbd-e840300fc2b3   elemental-cluster-k3s   elemental-cluster-k3s-control-plane-m6rl8   elemental-cluster-k3s-control-plane-dn85z   Running                   False   2m4s
m-67e232e7-9cbb-43cd-abbd-e840300fc2b3   elemental-cluster-k3s   elemental-cluster-k3s-control-plane-m6rl8   elemental-cluster-k3s-control-plane-dn85z   Running                   False   2m5s
m-67e232e7-9cbb-43cd-abbd-e840300fc2b3   elemental-cluster-k3s   elemental-cluster-k3s-control-plane-m6rl8   elemental-cluster-k3s-control-plane-dn85z   Running                   False   2m27s
m-67e232e7-9cbb-43cd-abbd-e840300fc2b3   elemental-cluster-k3s   elemental-cluster-k3s-control-plane-m6rl8   elemental-cluster-k3s-control-plane-dn85z   Reconciling OS Version    False   2m41s
m-67e232e7-9cbb-43cd-abbd-e840300fc2b3   elemental-cluster-k3s   elemental-cluster-k3s-control-plane-m6rl8   elemental-cluster-k3s-control-plane-dn85z   Reconciling OS Version    False   2m41s
m-67e232e7-9cbb-43cd-abbd-e840300fc2b3   elemental-cluster-k3s   elemental-cluster-k3s-control-plane-m6rl8   elemental-cluster-k3s-control-plane-dn85z   Running                   False   2m59s
m-67e232e7-9cbb-43cd-abbd-e840300fc2b3   elemental-cluster-k3s   elemental-cluster-k3s-control-plane-m6rl8   elemental-cluster-k3s-control-plane-dn85z   Running                   True    2m59s
m-67e232e7-9cbb-43cd-abbd-e840300fc2b3   elemental-cluster-k3s   elemental-cluster-k3s-control-plane-m6rl8   elemental-cluster-k3s-control-plane-dn85z   Running                   True    2m59s

And the newest implementation uses elemental state and the elemental upgrade --snapshot-labels functionality instead of exploting grub variables. Same correlationID logic still applies, but now it's applied to the toolkit install state:

m-67e232e7-9cbb-43cd-abbd-e840300fc2b3:~ # elemental state
date: "2024-09-13T13:45:22Z"
snapshotter:
    type: btrfs
    max-snaps: 4
    config: {}
efi:
    label: COS_GRUB
oem:
    label: COS_OEM
persistent:
    label: COS_PERSISTENT
recovery:
    label: COS_RECOVERY
    recovery:
        source: dir:///run/rootfsbase
        fs: squashfs
        date: "2024-09-13T13:43:17Z"
        fromAction: install
state:
    label: COS_STATE
    snapshots:
        1:
            source: dir:///run/rootfsbase
            date: "2024-09-13T13:43:17Z"
            fromAction: install
        2:
            source: oci://192.168.122.10:30000/elemental-capi-os:v1.2.3
            digest: sha256:317b1b62b8ef9a6f24ca47a268a9d94c6b12e2892d2111d6b48797c0e9698c38
            active: true
            labels:
                correlationID: 8d138c258216ce8b6eb749d2d107174dbebd56e0cb273bcad8eea31bf1f6476f
            date: "2024-09-13T13:45:22Z"
            fromAction: upgrade

Signed-off-by: Andrea Mazzotti <[email protected]>
@anmazzotti anmazzotti self-assigned this Sep 24, 2024
Copy link
Collaborator

@davidcassany davidcassany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, nice 👍

I am curious about the patchHelper object to store and keep resource changes. Do we know if this has a similar behavior to what we use in the operator client.MergeFromWithOptions adding the client.MergeFromWithOptimisticLock? Just asking because at a time this was a detail to support rancher backup and restore. If this is unknown we should probably try to figure it out before we find ourselves digging into hard to reproduce race conditions.

@anmazzotti
Copy link
Collaborator Author

Good point.
So the helper seems to use client.MergeFromWithOptimisticLock only when patching status conditions (to resolve conflicts): https://github.com/kubernetes-sigs/cluster-api/blob/main/util/patch/patch.go#L336

I am not sure whether this is good enough. For me it's unknown for now, I can open an issue to investigate and eventually align to the elemental-operator

@anmazzotti anmazzotti merged commit b3c1be9 into rancher-sandbox:main Sep 26, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Elemental CAPI OS Upgrades
3 participants