Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement ocp postprocess hostname rename #82

Merged

Conversation

mresvanis
Copy link
Member

@mresvanis mresvanis commented Jan 26, 2024

This PR adds the changes required to the ocp_postprocess module, in order to skip the unnecessary and time-consuming OCP control-plane component rollouts. The latter get triggered by the node's hostname change, which occurs when applying an IBU seed image onto the target SNO.

The changes are:

  • add a new recert configuration argument of type string, i.e. --hostname with which the new hostname is provided
    • we might need to also replace the NODE_IP in the etcd related configmaps, in addition to the hostname, which is currently done in the filesystem by the LCA post-pivot. It's probably a good idea to do it here, so that we can change it both in etcd and the filesystem. In that sense, this argument could be refactored to --hostname-rename <hostname>:<IP> (?)
  • add a hostname_rename mod, which implements the following changes:
    • replace all occurrences of the previous (seed node) hostname in etcd
      • secrets/openshift-etcd/etcd-all-certs and its revisions
      • operator.openshift.io/kubeapiservers/cluster
      • operator.openshift.io/kubecontrollermanagers/cluster
      • operator.openshift.io/kubeschedulers/cluster
      • secrets/openshift-etcd/etcd-peer and its revisions
      • secrets/openshift-etcd/etcd-serving
      • secrets/openshift-etcd/etcd-serving-metrics
      • configmaps/openshift-etcd/etcd-pod and its revisions
      • configmaps/openshift-etcd/restore-etcd-pod
      • configmaps/openshift-etcd/etcd-scripts
    • replace all occurrences of the previous (seed node) hostname in the filesystem
      • /etc/kubernetes/static-pod-resources/etcd-* env var names and values of env vars, i.e. s/NODE_<hostname>/NODE_<new-hostname>/g; s/<hostname/<new-hostname>/g
      • /etc/kubernetes/manifests/etcd-* env var names and env var values, i.e. s/NODE_<hostname>/NODE_<new-hostname>/g; s/<hostname>/<new-hostname>/g
      • /etc/kubernetes/static-pod-resources/kube-apiserver-pod-*, i.e. s/--node-name=<hostname>/--node-name=<new-hostname>/g
      • /etc/kubernetes/static-pod-resources/etcd-certs/secrets/etcd-all-certs/*.{crt,key}
      • /etc/kubernetes/static-pod-resources/etcd-member/etcd-all-certs/*.{crt,key}
      • /etc/kubernetes/static-pod-resources/etcd-pod-*/secrets/etcd-all-certs/*.{crt,key}
  • replace hostname in etcd keys and values, that are irrelevant to skipping the control-plane component rollouts but still useful in order to not leave any leftovers when moving to the seed image to the target SNO

@mresvanis mresvanis requested a review from omertuc January 26, 2024 14:42
Copy link

openshift-ci bot commented Jan 26, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mresvanis

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mresvanis mresvanis force-pushed the mgmt-16426-hostname-ocp-postprocess branch 2 times, most recently from c29dca2 to 54e02b1 Compare January 26, 2024 14:46
@mresvanis
Copy link
Member Author

/test e2e-aws-ovn-single-node-recert-parallel e2e-aws-ovn-single-node-recert-serial baremetalds-sno-recert-cluster-rename

@mresvanis
Copy link
Member Author

/test e2e-aws-ovn-single-node-recert-parallel

@mresvanis mresvanis force-pushed the mgmt-16426-hostname-ocp-postprocess branch from f6a617c to 8dbd414 Compare January 29, 2024 16:37
.replace(original_hostname, hostname);
tokio::spawn(async move {
async move {
fs::rename(original_path, new_path)?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should have a function similar to commit_file so it can be avoided during a dry run

Copy link
Member Author

@mresvanis mresvanis Jan 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, I completely neglected the dry run option. I'll try to update this accordingly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't add a function like commit_rename because we only need it here, but I can do that if you think it makes sense for the long term. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer a separate function just because I don't want to concern rename code with the concept of dry run

Copy link
Member

@omertuc omertuc Jan 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I'll just do it separately later as part of a refactor of all the rename code, too much repetition. I'm afraid to touch it right now because we have no IBI/IBU 0-rollout tests, but once we'll do I'll be able to make changes with confidence

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, this makes sense, I'm working on adding the recert hostname change 0-rollout CI job (not IBI/IBU at this point, just recert, but I think until we have an IBI/IBU e2e job, we can already increase our confidence starting with this recert specific job).

@mresvanis mresvanis force-pushed the mgmt-16426-hostname-ocp-postprocess branch 2 times, most recently from cbd5714 to 8dac639 Compare January 29, 2024 19:41
This change ensures that unnecessary and time-consuming OCP
control-plane component rollouts are skipped. The latter are
triggered by the node's hostname change, which occurs when
applying an IBU seed image onto the target SNO.

Specifically, the changes are:

- add a new recert configuration argument of type string, i.e.
  --hostname with which the new hostname is provided
- add a hostname_rename mod, which implements the following changes:
   - replace all occurrences of the previous (seed node) hostname in
     etcd
   - replace all occurrences of the previous (seed node) hostname in the
     filesystem

Signed-off-by: Michail Resvanis <[email protected]>
@mresvanis mresvanis force-pushed the mgmt-16426-hostname-ocp-postprocess branch from 8dac639 to 792ed73 Compare January 30, 2024 10:05
@omertuc
Copy link
Member

omertuc commented Jan 30, 2024

/test ?

Copy link

openshift-ci bot commented Jan 30, 2024

@omertuc: The following commands are available to trigger required jobs:

  • /test 4.14-images
  • /test 4.15-images
  • /test 4.16-images
  • /test baremetalds-sno-recert-cluster-rename
  • /test cargo-check
  • /test cargo-clippy
  • /test e2e-aws-ovn-single-node-recert-openshift-e2e-test-qe
  • /test e2e-aws-ovn-single-node-recert-parallel
  • /test e2e-aws-ovn-single-node-recert-serial
  • /test images

Use /test all to run the following jobs that were automatically triggered:

  • pull-ci-rh-ecosystem-edge-recert-main-4.14-images
  • pull-ci-rh-ecosystem-edge-recert-main-4.15-images
  • pull-ci-rh-ecosystem-edge-recert-main-4.16-images
  • pull-ci-rh-ecosystem-edge-recert-main-cargo-check
  • pull-ci-rh-ecosystem-edge-recert-main-cargo-clippy
  • pull-ci-rh-ecosystem-edge-recert-main-images

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@omertuc
Copy link
Member

omertuc commented Jan 30, 2024

/test baremetalds-sno-recert-cluster-rename
/test e2e-aws-ovn-single-node-recert-parallel
/test e2e-aws-ovn-single-node-recert-serial

Copy link

openshift-ci bot commented Jan 30, 2024

@mresvanis: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-single-node-recert-parallel 792ed73 link true /test e2e-aws-ovn-single-node-recert-parallel

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@omertuc
Copy link
Member

omertuc commented Jan 30, 2024

e2e-aws-ovn-single-node-recert-parallel

This failure is typical, ignore

@omertuc omertuc merged commit 4232294 into rh-ecosystem-edge:main Jan 30, 2024
9 of 11 checks passed
@mresvanis mresvanis deleted the mgmt-16426-hostname-ocp-postprocess branch January 31, 2024 11:28
@omertuc omertuc mentioned this pull request Jan 31, 2024
mresvanis added a commit to mresvanis/lifecycle-agent that referenced this pull request Feb 5, 2024
This change leverages recert's latest OCP post-process hostname
feature, which makes OCP's control-plane cluster operators (i.e.
etcd, kube-apiserver, kube-controller-manager, kube-scheduler)
happy, so that the latter won't trigger additional revisions. Thus
reducing the time OCP needs to stabilize after recert.

For more details please check here:
- rh-ecosystem-edge/recert#82

Signed-off-by: Michail Resvanis <[email protected]>
tsorya pushed a commit to tsorya/lifecycle-agent that referenced this pull request Feb 6, 2024
This change leverages recert's latest OCP post-process hostname
feature, which makes OCP's control-plane cluster operators (i.e.
etcd, kube-apiserver, kube-controller-manager, kube-scheduler)
happy, so that the latter won't trigger additional revisions. Thus
reducing the time OCP needs to stabilize after recert.

For more details please check here:
- rh-ecosystem-edge/recert#82

Signed-off-by: Michail Resvanis <[email protected]>
jc-rh pushed a commit to jc-rh/lifecycle-agent that referenced this pull request Feb 14, 2024
This change leverages recert's latest OCP post-process hostname
feature, which makes OCP's control-plane cluster operators (i.e.
etcd, kube-apiserver, kube-controller-manager, kube-scheduler)
happy, so that the latter won't trigger additional revisions. Thus
reducing the time OCP needs to stabilize after recert.

For more details please check here:
- rh-ecosystem-edge/recert#82

Signed-off-by: Michail Resvanis <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants