Syncing latest changes from upstream main for ramen #157

df-build-team · 2023-11-23T08:04:24Z

PR containing the latest commits from upstream main branch

Signed-off-by: Raghavendra Talur <[email protected]>

Instead of using enable-all which changes the list dynamically. Signed-off-by: Raghavendra Talur <[email protected]>

Signed-off-by: Elena Gershkovich <[email protected]>

This may be needed in the past, but not it breaks us since it pull incompatible changes from ceph-csi main branch instead of using the version tested with released rook. Signed-off-by: Nir Soffer <[email protected]>

Maybe it was needed in the past, it does not make sense to use an image not matching rook default. Signed-off-by: Nir Soffer <[email protected]>

…leanup On hub recovery cases when the ManifestWork does not exist on the hub cluster, it is recreated using createVolSyncDestManifestWork. The function creates VRG for volsync destination on clusters that need to be Secondary, and has to be passed the primary cluster. This fix addresses the above during cleanup where the cluster passed was not the primary. This was introduced as part of commit: ea0c1e4 Signed-off-by: Shyamsundar Ranganathan <[email protected]>

Pulling busybox and nginx images fails now with: 429 Too Many Requests - Server message: toomanyrequests: You have reached your pull rate limit We don't have the same image versions in quay.io, so I pushed the images we use to my own repo: - quay.io/nirsof/busybox:stable - quay.io/nirsof/nginx:stable-alpine-slim We proabbly want to keep these images in the ramendr repo. Signed-off-by: Nir Soffer <[email protected]>

We don't depend directly on velero images but `velero install` uses unqualified image refrence, and they exist only on docker.io, we fail again with the same error (too many requests). Use a copy of the images in quay.io from my private repo. We ned to move them later to the ramendr repo. Signed-off-by: Nir Soffer <[email protected]>

Signed-off-by: Elena Gershkovich <[email protected]>

do not depend on s3 secrets propagation for initializing DRPolicy metrics Signed-off-by: rakeshgm <[email protected]>

Signed-off-by: Raghavendra Talur <[email protected]>

This makes it possible to have one enabled list of linters. Linters that are in the enabled list but commented as the ones that are disabled and we need to enable them after fixing the issues. Signed-off-by: Raghavendra Talur <[email protected]>

Signed-off-by: Raghavendra Talur <[email protected]>

These were passing for the previous version of golangci-lint. We will enable them in the subsequent commits. Signed-off-by: Raghavendra Talur <[email protected]>

Signed-off-by: Benamar Mekhissi <[email protected]>

Discouraging dot-imports is a good rule. We can enable it again now with an exception for the Ginkgo and Gomega packages. Importing those two packages with dot-import makes test files more readable. Signed-off-by: Raghavendra Talur <[email protected]>

openshift-ci · 2023-11-23T08:04:32Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: df-build-team

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

It seems that recent change for supporting multiple namespaces have broken failover with volsync when using multiple volsync apps. Add the pvc namespace to the protected pvc, similar to volrep code. I did not modify the test to validate this change since I don't fully understand the test. It seems that there is a missing `It` section for testing the vrg. Signed-off-by: Nir Soffer <[email protected]>

Recently the recipe CRD[1] became required if kube object protection is enabled, so we must have it in the test environment unless we disable the feature. Disabling the feature is not a good idea since we will want to be able to test imperative apps or declerative apps that need recipes (e.g. kubvirt with DataVolumeTemplate). [1] https://github.com/RamenDR/recipe/blob/main/config/crd/bases/ramendr.openshift.io_recipes.yaml Signed-off-by: Nir Soffer <[email protected]>

This function looks wrong and confusing, and we need to amend it to handle also PVC namespace. Make it more clear before we change the behavior: - Add function documentation explaining the purpose of the function - Remove unneeded temporary variable for more clear code - Improve comment when mismatch is found Signed-off-by: Nir Soffer <[email protected]>

When looking for mismatch between the source VRG and destination VRG RDSpecs, consider both the name and namespace. If a PVC on the destination is missing a namespace (e.g. created by older ramen) the destination VRG will be updated. Signed-off-by: Nir Soffer <[email protected]>

Upon hub recovery, it is essential to regenerate the VRG ManifestWorks. For CephFS workloads, two VRG ManifestWorks are necessary—one for the primary cluster and another for the secondary cluster. However, Ramen failed to regenerate the secondary VRG ManifestWork due to a bug. The bug originated from a condition that determined whether to regenerate the ManifestWork based solely on the number of VRGs found on the managed cluster. Consequently, if the count met the required number, the regeneration was bypassed without considering the actual count of ManifestWorks. This oversight led to ACM evicting the VRG from the secondary cluster after the default one-hour eviction time. Signed-off-by: Benamar Mekhissi <[email protected]>

Signed-off-by: Benamar Mekhissi <[email protected]>

To avoid idempotency issues in `clusteradm accept`[1] enable the ManagedClusterAutoApproval feature gate, so `clusteradm accept` is not needed. Another way to solve this is to add `--skip-approve-check` option in `clusteradm accept` but the approval step is not needed in context of a testing environment. [1] open-cluster-management-io/clusteradm#395 Thanks: Mike Ng <[email protected]> Signed-off-by: Nir Soffer <[email protected]>

In OpenShift we deploy 2 s3 secrets (one per s3 store) on the hub, and the secrets are propagated to the managed clusters using the policy framework. In ramenctl we deploy the secrets directly to the managed clusters. This is much simpler and more reliable, but it bypass the ramen code we want to test, hiding issues in the real code path. Change ramenctl to deploy the secrets in the same way as in OpenShift: - Use 2 secrets, one per cluster s3 store - Deploy the secrets only on the hub - Wait until the secrets are propagated to the managed clusters by ramen. With this issues in ramen related code or OCM will break ramenctl early. Hopefully this will help to detect regressions before they reach QE or released in OpenShift. Example run: $ ramenctl config $env 2023-11-22 22:05:19,546 INFO [ramenctl] Starting config 2023-11-22 22:05:19,812 INFO [ramenctl] Waiting until ramen-hub-operator is rolled out 2023-11-22 22:05:19,889 INFO [ramenctl] Creating ramen s3 secrets in cluster 'hub' 2023-11-22 22:05:20,428 INFO [ramenctl] Updating ramen config map in cluster 'hub' 2023-11-22 22:05:20,716 INFO [ramenctl] Creating dr-clusters for regional-dr 2023-11-22 22:05:20,988 INFO [ramenctl] Creating dr-policy for regional-dr 2023-11-22 22:05:21,220 INFO [ramenctl] Waiting until s3 secrets are propagated to managed clusters 2023-11-22 22:05:22,800 INFO [ramenctl] Waiting until DRClusters report phase 2023-11-22 22:05:22,941 INFO [ramenctl] Waiting until DRClusters phase is available 2023-11-22 22:05:23,206 INFO [ramenctl] Waiting until DRPolicy is validated 2023-11-22 22:05:23,361 INFO [ramenctl] Finished config in 3.82 seconds Signed-off-by: Nir Soffer <[email protected]>

Version 0.7.0 was broken so we had to include detailed instructions for installing 0.6.0. The issue was fixed in 0.7.1 by [1]. Update the instructions to require the new version and remove the now unneeded instructions. Tested with ocm.yaml. [1] open-cluster-management-io/clusteradm#388 Signed-off-by: Nir Soffer <[email protected]>

Signed-off-by: Elena Gershkovich <[email protected]>

The mdl_version_test function was inserted in a place where it would be run before checking if mdl was installed or not. Now we have one function for each of the tools where a check is made if it exists and if it is of the right version and then the tool is run. Signed-off-by: Raghavendra Talur <[email protected]>

This is required for a future change where we will set a funlen limit which is stricter for the main code. The test function can be allowed to be longer especially because they have the tabular structure which is usually long. Signed-off-by: Raghavendra Talur <[email protected]>

So we can use "bin" directory in the source tree. For consistency ignore also "testbin" at the root of the project. Signed-off-by: Nir Soffer <[email protected]>

Like the busybox pod, but run as a init.d service that can be installed in a cirros VM. The service includes: - /usr/bin/ramen - the logging "deamon" script - /etc/init.d/ramen - the service script - install - install script running inside the vm The service can be installed by copying the `ramen` directory into the VM and running the install script. The log can be inspected by running: $ virtctl ssh cirros@sample-vm -n kubevirt-sample --context dr1 --command "tail -f /var/log/ramen.log" Wed Oct 25 23:43:12 UTC 2023 UPDATE Wed Oct 25 23:43:22 UTC 2023 UPDATE Wed Oct 25 23:43:32 UTC 2023 UPDATE Example logs after failover: Sun Oct 29 20:23:40 UTC 2023 UPDATE Sun Oct 29 20:23:50 UTC 2023 UPDATE Sun Oct 29 20:26:24 UTC 2023 START Sun Oct 29 20:26:34 UTC 2023 UPDATE Sun Oct 29 20:26:44 UTC 2023 UPDATE In failover we lost data written since the last snapshot (20:23:50) until the VM was started on the failover cluster (20:26:24). Example log after relocate: Sun Oct 29 20:28:14 UTC 2023 UPDATE Sun Oct 29 20:28:24 UTC 2023 UPDATE Sun Oct 29 20:28:33 UTC 2023 STOP Sun Oct 29 20:31:20 UTC 2023 START Sun Oct 29 20:31:31 UTC 2023 UPDATE Sun Oct 29 20:31:41 UTC 2023 UPDATE During relocate the application was terminated gracefully at 20:28:33 and start on the other cluster at 20:31:20 without loosing any data. The STOP message depends on the cirros VM shutting down all processes gracefully, so it may not be emitted in some cases. To ensure graceful shutdown the user must stop the application gracefully. Example graceful shutdown: virtctl ssh cirros@sample-vm -n kubevirt-sample --context dr1 --command 'sudo /etc/init.d/ramen stop' Signed-off-by: Nir Soffer <[email protected]>

The cirros image tiny (~30m) and starts in 4 seconds to login prompt. It is basically bare bones VM with busybox and some cloud-init support. The cirros image is not compatible with virt-customize (used to inject stuff into the image) since it packs the root file system inside the initrd file system. We fix this by booting the image once with qemu-kvm. During the first boot cirros init system unpack the root file system into the root partition (/dev/sda1). From this point, the VM image is compatible with virt-customize. To start the VM quickly, avoiding few minutes delay looking up cloud-init metadata server, we provide a temporary instance-id via cloud-init noclound iso image. We customize the VM with the ramen logger, to make the VM ready for DR testing outside of the box. The VM image is packed inside a container (called container disk in kubvirt). Kubevirt can start vm or import images from these images. The image is stored in my private quay.io repo for now, will move later to ramedr repo. Signed-off-by: Nir Soffer <[email protected]>

This verifies both ssh access and the ramen service inside the vm, and a good example of accessing the ramen log. Signed-off-by: Nir Soffer <[email protected]>

We can have many release using the same upstream version. Lets make this easier to consume by adding a release number. Signed-off-by: Nir Soffer <[email protected]>

This helps to understand downtime during failover and relocate. It can also help to diagnose boot time. Example log: Tue Oct 31 20:29:15 UTC 2023 START uptime=3.98 Boot time with different nested vm levels: machine nested time -------------------------- laptop 0 3.99s laptop 1 6.11s server 2 32.21s Signed-off-by: Nir Soffer <[email protected]>

Stating the ramen logger first to minimize downtime when inspecting failover or relocate flows. This change decreases the ramen service boot time by 2.5 seconds with local vm, 1.4 second in nested vm, but does not improve much double nested vm. Boot time with different nested vm levels: machine nested before after ------------------------------------- laptop 0 3.99s 1.55s laptop 1 6.11s 4.73s server 2 33.21s 32.79s Signed-off-by: Nir Soffer <[email protected]>

Adding the quiet[1] option disables most kernel log messages during boot These message are not needed normally and available in dmesg if needed later. With this change the ramen service boot time is up to 1.47 times faster. machine nested before after ------------------------------------- laptop 0 1.55s 1.39s laptop 1 4.73s 3.28s server 2 32.79s 22.33s [1] https://www.kernel.org/doc/html/v4.14/admin-guide/kernel-parameters.html Thanks: Peter Lauterbach <[email protected]> Signed-off-by: Nir Soffer <[email protected]>

We can generate the secret during kustomization, but it is not possible to refer to `~/.ssh/id_rsa.pub`. We cannot use a static symlink, and if we create a symlink we need to use kustomize build --load-restrictor=LoadRestrictionsNone which makes it harder to apply the kustomization manually for testing. To keep things simple, we copy we copy the user default public key to the vm kustomization directory during the test. From this point the kustomization can be applied manually. Signed-off-by: Nir Soffer <[email protected]>

If `minikube profile list` is called before `minikube start` it fails because `$MINIKUBE_HOME/.minikube/profiles` does not exist[1]. I posted a fix[1] on Jan 5 2023 but it was not accepted. Talur posted another fix[2] on July 18 2023 but it is still waiting for review. Fixing it in drenv until minikube maintainers find time to handle this. [1] kubernetes/minikube#15593 [2] kubernetes/minikube#15594 [3] kubernetes/minikube#16900 Signed-off-by: Nir Soffer <[email protected]>

We used the $container driver (defaults to docker) to be able to run the tests on github. This creates several issues: - tests/scripts/drenv-selftest uses docker, so it will not reveal issues with the $vm driver, which is the one we really care about - drenv tests uses docker instead of vm - developers need to install both docker (for the tests) and podman (for building images and ramenctl) - users cannot use drenv-selftest (since we don't ask to install docker in the user guide) Fix the issues by adding new `vm.yaml` for testing using the $vm driver and rename `test.yaml` to `container.yaml`. `test/scripts/drenv-selftest` uses now `vm.yaml`, so it tests what we care about and can be used by users to validate the setup. `conftest.py` uses now vm driver by default. To use the container driver you can run the tests with: DRIVER=container make test The github workflow uses DRIVER=container to override the driver because we don't have KVM hardware acceleration in github. Signed-off-by: Nir Soffer <[email protected]>

We tested both $vm and $container, but we really care only about the $vm driver. Simplify the configuration so we don't require docker. Signed-off-by: Nir Soffer <[email protected]>

Environment was misspelled in many files. Signed-off-by: Nir Soffer <[email protected]>

It was used only for running drenv tests since podman is broken in github. We use now the $vm driver to run the drenv tests. Signed-off-by: Nir Soffer <[email protected]>

Instead of starting and deleting the test cluster for each run, use existing cluster started by the developer (or CI). With this change `make test` run 6 time faster (10s vs 60s). To start the test cluster before the test use: make cluster To delete the test cluster use: make clean Signed-off-by: Nir Soffer <[email protected]>

Using minikube directly is good way to debug minikube but for validating drenv we have a selftest script ensuring that drenv works with minikube on any support platform. Signed-off-by: Nir Soffer <[email protected]>

raghavendra-talur and others added 17 commits November 9, 2023 10:54

Makefile: delete whitespace at the end of lines

f8e7127

Signed-off-by: Raghavendra Talur <[email protected]>

makefile: move the installation of golangci-lint to a script

9090ad8

Signed-off-by: Raghavendra Talur <[email protected]>

golangci-lint: specify the current list of enabled linters

ca17832

Instead of using enable-all which changes the list dynamically. Signed-off-by: Raghavendra Talur <[email protected]>

Create a single helper for the various versions of addAnnotation

6677058

Signed-off-by: Elena Gershkovich <[email protected]>

Remove cephcsi image override

f3fd6ce

This may be needed in the past, but not it breaks us since it pull incompatible changes from ceph-csi main branch instead of using the version tested with released rook. Signed-off-by: Nir Soffer <[email protected]>

Don't override csiaddons sidecar image

ffc5fe4

Maybe it was needed in the past, it does not make sense to use an image not matching rook default. Signed-off-by: Nir Soffer <[email protected]>

Update Velero version in documentation

719fd43

Signed-off-by: Elena Gershkovich <[email protected]>

init DRPolicy metrics before secrets propagation

de07977

do not depend on s3 secrets propagation for initializing DRPolicy metrics Signed-off-by: rakeshgm <[email protected]>

golangci-lint: explicitly list the revive rules

f1488c1

Signed-off-by: Raghavendra Talur <[email protected]>

golangci-lint: update golangci-lint version to 1.55.2

97f04de

Signed-off-by: Raghavendra Talur <[email protected]>

golangci-lint: disable the failing linters

96b2cbe

These were passing for the previous version of golangci-lint. We will enable them in the subsequent commits. Signed-off-by: Raghavendra Talur <[email protected]>

Add label to volsync secret for inclusion in hub recovery backup

379daa1

Signed-off-by: Benamar Mekhissi <[email protected]>

df-build-team requested a review from a team November 23, 2023 08:04

nirs and others added 11 commits November 29, 2023 16:17

Fix and reformat 2 comments

9dd1c15

Signed-off-by: Benamar Mekhissi <[email protected]>

Exclude VolSync secret Policy from hub backup

ca7af2d

Signed-off-by: Benamar Mekhissi <[email protected]>

During Failover pull VRG from s3 only from failover cluster

15cc764

Signed-off-by: Elena Gershkovich <[email protected]>

raghavendra-talur and others added 18 commits December 11, 2023 16:49

Ignore only bin at the root of the project

dad29f9

So we can use "bin" directory in the source tree. For consistency ignore also "testbin" at the root of the project. Signed-off-by: Nir Soffer <[email protected]>

Dump ramen log in kubevirt self test

e2eafef

This verifies both ssh access and the ramen service inside the vm, and a good example of accessing the ramen log. Signed-off-by: Nir Soffer <[email protected]>

Add release number to cirros container

8e2994c

We can have many release using the same upstream version. Lets make this easier to consume by adding a release number. Signed-off-by: Nir Soffer <[email protected]>

Simplify minio development environment

a9031f0

We tested both $vm and $container, but we really care only about the $vm driver. Simplify the configuration so we don't require docker. Signed-off-by: Nir Soffer <[email protected]>

Fix spelling in envs/*.yaml

0160735

Environment was misspelled in many files. Signed-off-by: Nir Soffer <[email protected]>

Remove docker requirement for drenv test

35d1e6f

It was used only for running drenv tests since podman is broken in github. We use now the $vm driver to run the drenv tests. Signed-off-by: Nir Soffer <[email protected]>

Use drenv-sefltest to validate the installation

fe846a6

Using minikube directly is good way to debug minikube but for validating drenv we have a selftest script ensuring that drenv works with minikube on any support platform. Signed-off-by: Nir Soffer <[email protected]>

raghavendra-talur merged commit 169e764 into main Dec 12, 2023
24 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Syncing latest changes from upstream main for ramen #157

Syncing latest changes from upstream main for ramen #157

df-build-team commented Nov 23, 2023

openshift-ci bot commented Nov 23, 2023

Syncing latest changes from upstream main for ramen #157

Syncing latest changes from upstream main for ramen #157

Conversation

df-build-team commented Nov 23, 2023

openshift-ci bot commented Nov 23, 2023