Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Syncing latest changes from upstream main for ramen #157

Merged
merged 46 commits into from
Dec 12, 2023

Conversation

df-build-team
Copy link

PR containing the latest commits from upstream main branch

raghavendra-talur and others added 17 commits November 9, 2023 10:54
Instead of using enable-all which changes the list dynamically.

Signed-off-by: Raghavendra Talur <[email protected]>
This may be needed in the past, but not it breaks us since it pull
incompatible changes from ceph-csi main branch instead of using the
version tested with released rook.

Signed-off-by: Nir Soffer <[email protected]>
Maybe it was needed in the past, it does not make sense to use an image
not matching rook default.

Signed-off-by: Nir Soffer <[email protected]>
…leanup

On hub recovery cases when the ManifestWork does not exist on the hub
cluster, it is recreated using createVolSyncDestManifestWork.

The function creates VRG for volsync destination on clusters that need
to be Secondary, and has to be passed the primary cluster.

This fix addresses the above during cleanup where the cluster passed was
not the primary.

This was introduced as part of commit:
ea0c1e4

Signed-off-by: Shyamsundar Ranganathan <[email protected]>
Pulling busybox and nginx images fails now with:

    429 Too Many Requests - Server message: toomanyrequests: You have
    reached your pull rate limit

We don't have the same image versions in quay.io, so I pushed the images
we use to my own repo:

- quay.io/nirsof/busybox:stable
- quay.io/nirsof/nginx:stable-alpine-slim

We proabbly want to keep these images in the ramendr repo.

Signed-off-by: Nir Soffer <[email protected]>
We don't depend directly on velero images but `velero install` uses
unqualified image refrence, and they exist only on docker.io, we fail
again with the same error (too many requests).

Use a copy of the images in quay.io from my private repo. We ned to move
them later to the ramendr repo.

Signed-off-by: Nir Soffer <[email protected]>
do not depend on s3 secrets propagation
for initializing DRPolicy metrics

Signed-off-by: rakeshgm <[email protected]>
This makes it possible to have one enabled list of linters. Linters that
are in the enabled list but commented as the ones that are disabled and
we need to enable them after fixing the issues.

Signed-off-by: Raghavendra Talur <[email protected]>
These were passing for the previous version of golangci-lint. We will
enable them in the subsequent commits.

Signed-off-by: Raghavendra Talur <[email protected]>
Discouraging dot-imports is a good rule. We can enable it again now with
an exception for the Ginkgo and Gomega packages. Importing those two
packages with dot-import makes test files more readable.

Signed-off-by: Raghavendra Talur <[email protected]>
@df-build-team df-build-team requested a review from a team November 23, 2023 08:04
Copy link

openshift-ci bot commented Nov 23, 2023

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: df-build-team

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

nirs and others added 11 commits November 29, 2023 16:17
It seems that recent change for supporting multiple namespaces have
broken failover with volsync when using multiple volsync apps. Add the
pvc namespace to the protected pvc, similar to volrep code.

I did not modify the test to validate this change since I don't fully
understand the test. It seems that there is a missing `It` section for
testing the vrg.

Signed-off-by: Nir Soffer <[email protected]>
Recently the recipe CRD[1] became required if kube object protection is
enabled, so we must have it in the test environment unless we disable
the feature. Disabling the feature is not a good idea since we will want
to be able to test imperative apps or declerative apps that need recipes
(e.g. kubvirt with DataVolumeTemplate).

[1] https://github.com/RamenDR/recipe/blob/main/config/crd/bases/ramendr.openshift.io_recipes.yaml

Signed-off-by: Nir Soffer <[email protected]>
This function looks wrong and confusing, and we need to amend it to
handle also PVC namespace. Make it more clear before we change the
behavior:

- Add function documentation explaining the purpose of the function
- Remove unneeded temporary variable for more clear code
- Improve comment when mismatch is found

Signed-off-by: Nir Soffer <[email protected]>
When looking for mismatch between the source VRG and destination VRG
RDSpecs, consider both the name and namespace. If a PVC on the
destination is missing a namespace (e.g. created by older ramen) the
destination VRG will be updated.

Signed-off-by: Nir Soffer <[email protected]>
Upon hub recovery, it is essential to regenerate the VRG ManifestWorks. For CephFS workloads,
two VRG ManifestWorks are necessary—one for the primary cluster and another for the secondary
cluster. However, Ramen failed to regenerate the secondary VRG ManifestWork due to a bug. The
bug originated from a condition that determined whether to regenerate the ManifestWork based
solely on the number of VRGs found on the managed cluster. Consequently, if the count met the
required number, the regeneration was bypassed without considering the actual count of
ManifestWorks. This oversight led to ACM evicting the VRG from the secondary cluster after
the default one-hour eviction time.

Signed-off-by: Benamar Mekhissi <[email protected]>
Signed-off-by: Benamar Mekhissi <[email protected]>
To avoid idempotency issues in `clusteradm accept`[1] enable the
ManagedClusterAutoApproval feature gate, so `clusteradm accept` is not
needed.

Another way to solve this is to add `--skip-approve-check` option in
`clusteradm accept` but the approval step is not needed in context of a
testing environment.

[1] open-cluster-management-io/clusteradm#395

Thanks: Mike Ng <[email protected]>
Signed-off-by: Nir Soffer <[email protected]>
In OpenShift we deploy 2 s3 secrets (one per s3 store) on the hub, and
the secrets are propagated to the managed clusters using the policy
framework.

In ramenctl we deploy the secrets directly to the managed clusters. This
is much simpler and more reliable, but it bypass the ramen code we want
to test, hiding issues in the real code path.

Change ramenctl to deploy the secrets in the same way as in OpenShift:
- Use 2 secrets, one per cluster s3 store
- Deploy the secrets only on the hub
- Wait until the secrets are propagated to the managed clusters by
  ramen.

With this issues in ramen related code or OCM will break ramenctl early.
Hopefully this will help to detect regressions before they reach QE or
released in OpenShift.

Example run:

    $ ramenctl config $env
    2023-11-22 22:05:19,546 INFO    [ramenctl] Starting config
    2023-11-22 22:05:19,812 INFO    [ramenctl] Waiting until ramen-hub-operator is rolled out
    2023-11-22 22:05:19,889 INFO    [ramenctl] Creating ramen s3 secrets in cluster 'hub'
    2023-11-22 22:05:20,428 INFO    [ramenctl] Updating ramen config map in cluster 'hub'
    2023-11-22 22:05:20,716 INFO    [ramenctl] Creating dr-clusters for regional-dr
    2023-11-22 22:05:20,988 INFO    [ramenctl] Creating dr-policy for regional-dr
    2023-11-22 22:05:21,220 INFO    [ramenctl] Waiting until s3 secrets are propagated to managed clusters
    2023-11-22 22:05:22,800 INFO    [ramenctl] Waiting until DRClusters report phase
    2023-11-22 22:05:22,941 INFO    [ramenctl] Waiting until DRClusters phase is available
    2023-11-22 22:05:23,206 INFO    [ramenctl] Waiting until DRPolicy is validated
    2023-11-22 22:05:23,361 INFO    [ramenctl] Finished config in 3.82 seconds

Signed-off-by: Nir Soffer <[email protected]>
Version 0.7.0 was broken so we had to include detailed instructions for
installing 0.6.0. The issue was fixed in 0.7.1 by [1]. Update the
instructions to require the new version and remove the now unneeded
instructions.

Tested with ocm.yaml.

[1] open-cluster-management-io/clusteradm#388

Signed-off-by: Nir Soffer <[email protected]>
raghavendra-talur and others added 18 commits December 11, 2023 16:49
The mdl_version_test function was inserted in a place where it would be
run before checking if mdl was installed or not.

Now we have one function for each of the tools where a check is made if
it exists and if it is of the right version and then the tool is run.

Signed-off-by: Raghavendra Talur <[email protected]>
This is required for a future change where we will set a funlen limit
which is stricter for the main code. The test function can be allowed to
be longer especially because they have the tabular structure which is
usually long.

Signed-off-by: Raghavendra Talur <[email protected]>
So we can use "bin" directory in the source tree. For consistency ignore
also "testbin" at the root of the project.

Signed-off-by: Nir Soffer <[email protected]>
Like the busybox pod, but run as a init.d service that can be installed
in a cirros VM.

The service includes:
- /usr/bin/ramen - the logging "deamon" script
- /etc/init.d/ramen - the service script
- install - install script running inside the vm

The service can be installed by copying the `ramen` directory into the
VM and running the install script.

The log can be inspected by running:

    $ virtctl ssh cirros@sample-vm -n kubevirt-sample --context dr1 --command "tail -f /var/log/ramen.log"
    Wed Oct 25 23:43:12 UTC 2023 UPDATE
    Wed Oct 25 23:43:22 UTC 2023 UPDATE
    Wed Oct 25 23:43:32 UTC 2023 UPDATE

Example logs after failover:

    Sun Oct 29 20:23:40 UTC 2023 UPDATE
    Sun Oct 29 20:23:50 UTC 2023 UPDATE
    Sun Oct 29 20:26:24 UTC 2023 START
    Sun Oct 29 20:26:34 UTC 2023 UPDATE
    Sun Oct 29 20:26:44 UTC 2023 UPDATE

In failover we lost data written since the last snapshot (20:23:50)
until the VM was started on the failover cluster (20:26:24).

Example log after relocate:

    Sun Oct 29 20:28:14 UTC 2023 UPDATE
    Sun Oct 29 20:28:24 UTC 2023 UPDATE
    Sun Oct 29 20:28:33 UTC 2023 STOP
    Sun Oct 29 20:31:20 UTC 2023 START
    Sun Oct 29 20:31:31 UTC 2023 UPDATE
    Sun Oct 29 20:31:41 UTC 2023 UPDATE

During relocate the application was terminated gracefully at 20:28:33
and start on the other cluster at 20:31:20 without loosing any data.

The STOP message depends on the cirros VM shutting down all processes
gracefully, so it may not be emitted in some cases. To ensure graceful
shutdown the user must stop the application gracefully.

Example graceful shutdown:

    virtctl ssh cirros@sample-vm -n kubevirt-sample --context dr1 --command 'sudo /etc/init.d/ramen stop'

Signed-off-by: Nir Soffer <[email protected]>
The cirros image tiny (~30m) and starts in 4 seconds to login prompt. It
is basically bare bones VM with busybox and some cloud-init support.

The cirros image is not compatible with virt-customize (used to inject
stuff into the image) since it packs the root file system inside the
initrd file system. We fix this by booting the image once with qemu-kvm.
During the first boot cirros init system unpack the root file system
into the root partition (/dev/sda1).  From this point, the VM image is
compatible with virt-customize.

To start the VM quickly, avoiding few minutes delay looking up
cloud-init metadata server, we provide a temporary instance-id via
cloud-init noclound iso image.

We customize the VM with the ramen logger, to make the VM ready for DR
testing outside of the box.

The VM image is packed inside a container (called container disk in
kubvirt). Kubevirt can start vm or import images from these images.

The image is stored in my private quay.io repo for now, will move later
to ramedr repo.

Signed-off-by: Nir Soffer <[email protected]>
This verifies both ssh access and the ramen service inside the vm, and a
good example of accessing the ramen log.

Signed-off-by: Nir Soffer <[email protected]>
We can have many release using the same upstream version. Lets make this
easier to consume by adding a release number.

Signed-off-by: Nir Soffer <[email protected]>
This helps to understand downtime during failover and relocate. It can
also help to diagnose boot time.

Example log:

    Tue Oct 31 20:29:15 UTC 2023 START uptime=3.98

Boot time with different nested vm levels:

machine    nested     time
--------------------------
laptop          0     3.99s
laptop          1     6.11s
server          2    32.21s

Signed-off-by: Nir Soffer <[email protected]>
Stating the ramen logger first to minimize downtime when inspecting
failover or relocate flows.

This change decreases the ramen service boot time by 2.5 seconds with
local vm, 1.4 second in nested vm, but does not improve much double
nested vm.

Boot time with different nested vm levels:

machine    nested     before    after
-------------------------------------
laptop          0     3.99s     1.55s
laptop          1     6.11s     4.73s
server          2    33.21s    32.79s

Signed-off-by: Nir Soffer <[email protected]>
Adding the quiet[1] option disables most kernel log messages during boot
These message are not needed normally and available in dmesg if needed
later.

With this change the ramen service boot time is up to 1.47 times faster.

machine    nested     before    after
-------------------------------------
laptop          0     1.55s     1.39s
laptop          1     4.73s     3.28s
server          2    32.79s    22.33s

[1] https://www.kernel.org/doc/html/v4.14/admin-guide/kernel-parameters.html

Thanks: Peter Lauterbach <[email protected]>
Signed-off-by: Nir Soffer <[email protected]>
We can generate the secret during kustomization, but it is not possible
to refer to `~/.ssh/id_rsa.pub`. We cannot use a static symlink, and if
we create a symlink we need to use

    kustomize build --load-restrictor=LoadRestrictionsNone

which makes it harder to apply the kustomization manually for testing.

To keep things simple, we copy we copy the user default public key to
the vm kustomization directory during the test. From this point the
kustomization can be applied manually.

Signed-off-by: Nir Soffer <[email protected]>
If `minikube profile list` is called before `minikube start` it fails
because `$MINIKUBE_HOME/.minikube/profiles` does not exist[1]. I posted
a fix[1] on Jan 5 2023 but it was not accepted. Talur posted another
fix[2] on July 18 2023 but it is still waiting for review.

Fixing it in drenv until minikube maintainers find time to handle this.

[1] kubernetes/minikube#15593
[2] kubernetes/minikube#15594
[3] kubernetes/minikube#16900

Signed-off-by: Nir Soffer <[email protected]>
We used the $container driver (defaults to docker) to be able to run the
tests on github. This creates several issues:

- tests/scripts/drenv-selftest uses docker, so it will not reveal issues
  with the $vm driver, which is the one we really care about
- drenv tests uses docker instead of vm
- developers need to install both docker (for the tests) and podman (for
  building images and ramenctl)
- users cannot use drenv-selftest (since we don't ask to install docker
  in the user guide)

Fix the issues by adding new `vm.yaml` for testing using the $vm driver
and rename `test.yaml` to `container.yaml`.

`test/scripts/drenv-selftest` uses now `vm.yaml`, so it tests what we
care about and can be used by users to validate the setup.

`conftest.py` uses now vm driver by default. To use the container driver
you can run the tests with:

    DRIVER=container make test

The github workflow uses DRIVER=container to override the driver because
we don't have KVM hardware acceleration in github.

Signed-off-by: Nir Soffer <[email protected]>
We tested both $vm and $container, but we really care only about the $vm
driver. Simplify the configuration so we don't require docker.

Signed-off-by: Nir Soffer <[email protected]>
Environment was misspelled in many files.

Signed-off-by: Nir Soffer <[email protected]>
It was used only for running drenv tests since podman is broken in
github. We use now the $vm driver to run the drenv tests.

Signed-off-by: Nir Soffer <[email protected]>
Instead of starting and deleting the test cluster for each run, use
existing cluster started by the developer (or CI). With this change
`make test` run 6 time faster (10s vs 60s).

To start the test cluster before the test use:

    make cluster

To delete the test cluster use:

    make clean

Signed-off-by: Nir Soffer <[email protected]>
Using minikube directly is good way to debug minikube but for validating
drenv we have a selftest script ensuring that drenv works with minikube
on any support platform.

Signed-off-by: Nir Soffer <[email protected]>
@raghavendra-talur raghavendra-talur merged commit 169e764 into main Dec 12, 2023
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants