Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

boot VMIs from the checkup's setup #217

Closed
wants to merge 15 commits into from

Conversation

RamLavi
Copy link
Collaborator

@RamLavi RamLavi commented Jan 18, 2024

The current checkup's approach to waiting until the VMI is "booted", i.e. the guest-agent service is ready. It relies on the fact that the service running the tuned-adm + reboot runs before the guest-agent service starts running. This assumption however is incorrect.

In order to make sure that the tuned-adm commands are run before the setup decalres the VMIs ready, moving to a new approach: polling the existence of the marker file, added only after the tuned-adm is properly configured.
This PR introduces this new approach, using guest-agent-ping probe in order to wait for the VMI to be ready.

Additionally, it changes the Setup so that the two VMIs are being setup and polled in parallel.

@RamLavi RamLavi changed the title Manually boot checup vm is Manually boot checkup VMIs Jan 18, 2024
@orelmisan
Copy link
Member

We had an offline discussion:

  1. We are not interested in making the setup part more complex than it has to be.
  2. We are not interested in logging in to the serial console from Checkup.Setup().
  3. It was raised that we could try to explore the KubeVirt readiness probes.
  4. The new custom service introduced by Move first boot to new service #215 could potentially be removed, and its content could be moved to cloud-init.

There is no need to pass the tuned-adm-set-marker file to the function.
Removing it from the service and hard-coding it to the service script on
both VM image scripts.

Signed-off-by: Ram Lavi <[email protected]>
mounting the hugepages folder does not work well in multiple reboots.
Moving to mounting the hugepages folder via /etc/fstab approach [0] on both vm
images.

[0] https://www.redhat.com/sysadmin/etc-fstab

Signed-off-by: Ram Lavi <[email protected]>
The current approach in getting the vmi ready for the checkup is to set
the tuned-adm + reboot commands before the guest-agent runs. The reason
is that guest-agent ready is the signal that the checkup uses in order
to know if the VMI has successfully booted.
Rebooting before the guest-agent is ready is important to avoid
a race where the checkup continues to the test before the tuned-adm
kernel-args are set (requires reboot).
This approach is flawed. The scheduling mentioned above applies only to
the when the service starts, and does not ensure that it runs serially
after the guest-agent service. This means that the reboot can actually
be performed later in the systemd boot sequence - well after the
guest-agent service is started. Setting the --force flag in the reboot
command in order to hasten the reboot does not change this behavior.

This commit removes the reboot from the dpdk-checkup-boot script on both
vm images, in favor of a manual reboot done on the checkup itself, that
will be introduced in later commits.

Signed-off-by: Ram Lavi <[email protected]>
Currently the image is set to add a service running the commands needed
for configuring the guest for DPDK. This service is set to schedule
before the guest-agent service, but in reality it takes longer for it to
finish running and by then the guest agent service is already up and
running.
Since the guest-agent service being ready is the criteria for the VMI
being booted and the checkup moving to the test execution phase, this
service finishing to run after guest-agent exposes the checkup for a
race where the checkup continues before the guest was properly
configured.
Hence, there is no point in running this service the way it does.
This commit removes the dpdk-checkup-boot.service in favor of adding it
to the cloud-init service in future commits. This has the benefit of
simplifying the image build process.

Signed-off-by: Ram Lavi <[email protected]>
This commit adds the cloud-init script to the VMs, mounts it, and runs
it in a new runcmd section of the cloud-init section, using the
guest-agent-ping probe as an example [0].

[0]
https://kubevirt.io/user-guide/virtual_machines/liveness_and_readiness_probes/

Signed-off-by: Ram Lavi <[email protected]>
This commit adds a new configmap that will be consumed by the
vmi-under-test.
The unit test is generalized in order to include any configmap
deleted/created

Signed-off-by: Ram Lavi <[email protected]>
This commit adds the cloud-init script to the VMs, mounts it, and runs
it in a new runcmd section of the cloud-init section, using the
guest-agent-ping probe as an example [0].

[0]
https://kubevirt.io/user-guide/virtual_machines/liveness_and_readiness_probes/

Signed-off-by: Ram Lavi <[email protected]>
This commit enables the guest-agent-exec option on the guest, in order
to use the probe polling option, that will be introduced in future
commits.

Signed-off-by: Ram Lavi <[email protected]>
Currently the setup function only waits for the VMI to boot, i.e. for
the guest-agent condition to be ready.
This commit moves the waitForVMIToBoot function to a new function
setupVMIWaitReady, in preparation to next commit where more actions will
taken on the VMIs.

Signed-off-by: Ram Lavi <[email protected]>
Currently the checkup setup only waits for the VMI to finish "booting",
i.e. the guest agent service to be ready.
However this is not enough in order to ensure that the VMI has been
configured, a procedure currently done on the cloud-init service.
When the configuration is complete, the script configuring the guest in
the cloud-init service adds a marker file.

This commit:
- introduces a new waiting mechanism, using guest-agent-ping probe [0]
to poll-wait the guest until the file is present, and only then sets the
VMI ready condition to true.
- adds a wait for the VMI ready condition to be true.

[0]
https://kubevirt.io/user-guide/virtual_machines/liveness_and_readiness_probes/#defining-guest-agent-ping-probes

Signed-off-by: Ram Lavi <[email protected]>
In order to allow soft reboot of a VirtualMachineInstance object during
the checkup's setup phase - add the relevant Role object.

Signed-off-by: Ram Lavi <[email protected]>
The guest-agent-ping wait-poll is waiting until the marker file,
indicating that the VMI has been properly configured - is set before
setting the VMI to condition ready.
This commit is adding a soft reboot to the VMI, after it is set to
ready.
The reboot is necessary in order for the tuned-adm command set on the
cloud-init service to take affect on the kernel args.

Signed-off-by: Ram Lavi <[email protected]>
Currently the setup is performed on the VMIs in serial order.
In order to reduce the wait time, run the waitForVMIReady in parallel.

Signed-off-by: Ram Lavi <[email protected]>
@RamLavi
Copy link
Collaborator Author

RamLavi commented Jan 24, 2024

Passes e2e on a CNV4.15 cluster:

make test/e2e
podman run --rm \
           --volume /home/ralavi/go/src/github.com/kiagnose/kubevirt-dpdk-checkup:/home/ralavi/go/src/github.com/kiagnose/kubevirt-dpdk-checkup:Z \
           --volume /home/ralavi/.kube/sno01-cnvqe2-rdu2:/root/.kube:Z,ro \
           --workdir /home/ralavi/go/src/github.com/kiagnose/kubevirt-dpdk-checkup \
           -e KUBECONFIG=/root/.kube/kubeconfig \
           -e TEST_CHECKUP_IMAGE=quay.io/ramlavi/kubevirt-dpdk-checkup:latest \
           -e TEST_NAMESPACE=dpdk-checkup-ns-1 \
           -e NETWORK_ATTACHMENT_DEFINITION_NAME=dpdk-sriovnetwork-ns-1 \
           -e TRAFFIC_GEN_CONTAINER_DISK_IMAGE=quay.io/ramlavi/kubevirt-dpdk-checkup-traffic-gen:latest \
           -e VM_UNDER_TEST_CONTAINER_DISK_IMAGE=quay.io/ramlavi/kubevirt-dpdk-checkup-vm:latest \
           docker.io/library/golang:1.20.12 go test ./tests/... -test.v -test.timeout=1h -ginkgo.v -ginkgo.timeout=1h
=== RUN   TestKubevirtDpdkCheckup
Running Suite: KubevirtDpdkCheckup Suite - /home/ralavi/go/src/github.com/kiagnose/kubevirt-dpdk-checkup/tests
==============================================================================================================
Random Seed: 1706100921

Will run 1 of 1 specs
------------------------------
[BeforeSuite] 
/home/ralavi/go/src/github.com/kiagnose/kubevirt-dpdk-checkup/tests/test_suite_test.go:56
[BeforeSuite] PASSED [0.002 seconds]
------------------------------
Execute the checkup Job should complete successfully
/home/ralavi/go/src/github.com/kiagnose/kubevirt-dpdk-checkup/tests/checkup_test.go:83
• [315.344 seconds]
------------------------------

Ran 1 of 1 Specs in 315.346 seconds
SUCCESS! -- 1 Passed | 0 Failed | 0 Pending | 0 Skipped
--- PASS: TestKubevirtDpdkCheckup (315.35s)
PASS

logs (verbose false as it has no value for this PR):

2024/01/24 12:55:25 kubevirt-dpdk-checkup starting...
2024/01/24 12:55:25 Using the following config:
2024/01/24 12:55:25 "timeout": "1h0m0s"
2024/01/24 12:55:25 "networkAttachmentDefinitionName": "dpdk-sriovnetwork-ns-1"
2024/01/24 12:55:25 "trafficGenContainerDiskImage": "quay.io/ramlavi/kubevirt-dpdk-checkup-traffic-gen:latest"
2024/01/24 12:55:25 "trafficGenTargetNodeName": ""
2024/01/24 12:55:25 "trafficGenPacketsPerSecond": "8m"
2024/01/24 12:55:25 "TrafficGenEastMacAddress": "50:dc:ef:91:3b:01"
2024/01/24 12:55:25 "TrafficGenWestMacAddress": "50:f8:c5:e3:0a:02"
2024/01/24 12:55:25 "vmUnderTestContainerDiskImage": "quay.io/ramlavi/kubevirt-dpdk-checkup-vm:latest"
2024/01/24 12:55:25 "vmUnderTestTargetNodeName": ""
2024/01/24 12:55:25 "VMUnderTestEastMacAddress": "60:15:8b:11:54:01"
2024/01/24 12:55:25 "VMUnderTestWestMacAddress": "60:f6:f1:12:aa:02"
2024/01/24 12:55:25 "testDuration": "1m0s"
2024/01/24 12:55:25 "portBandwidthGbps": "10"
2024/01/24 12:55:25 "verbose": false
2024/01/24 12:55:25 Creating ConfigMap "dpdk-checkup-ns-1/dpdk-traffic-gen-config-pwr7x"...
2024/01/24 12:55:25 Creating ConfigMap "dpdk-checkup-ns-1/vmi-under-test-config-pwr7x"...
2024/01/24 12:55:25 Creating VMI "dpdk-checkup-ns-1/vmi-under-test-pwr7x"...
2024/01/24 12:55:25 Creating VMI "dpdk-checkup-ns-1/dpdk-traffic-gen-pwr7x"...
2024/01/24 12:55:25 Waiting for VMI "dpdk-checkup-ns-1/dpdk-traffic-gen-pwr7x" to boot...
2024/01/24 12:55:25 Waiting for VMI "dpdk-checkup-ns-1/vmi-under-test-pwr7x" to boot...
2024/01/24 12:56:35 VMI "dpdk-checkup-ns-1/vmi-under-test-pwr7x" had successfully booted
2024/01/24 12:56:35 Waiting for VMI "dpdk-checkup-ns-1/vmi-under-test-pwr7x" ready condition...
2024/01/24 12:56:40 VMI "dpdk-checkup-ns-1/dpdk-traffic-gen-pwr7x" had successfully booted
2024/01/24 12:56:40 Waiting for VMI "dpdk-checkup-ns-1/dpdk-traffic-gen-pwr7x" ready condition...
2024/01/24 12:57:10 VMI "dpdk-checkup-ns-1/vmi-under-test-pwr7x" has successfully reached ready condition
2024/01/24 12:57:10 Performing boot for tuned-adm profile to take affect on VMI "dpdk-checkup-ns-1/vmi-under-test-pwr7x"...
{"component":"","level":"info","msg":"SoftReboot VMI","pos":"vmi.go:268","timestamp":"2024-01-24T12:57:10.802933Z"}
2024/01/24 12:57:10 VMI "dpdk-checkup-ns-1/dpdk-traffic-gen-pwr7x" has successfully reached ready condition
2024/01/24 12:57:10 Performing boot for tuned-adm profile to take affect on VMI "dpdk-checkup-ns-1/dpdk-traffic-gen-pwr7x"...
{"component":"","level":"info","msg":"SoftReboot VMI","pos":"vmi.go:268","timestamp":"2024-01-24T12:57:10.830319Z"}
2024/01/24 12:57:11 Login to VMI under test...
2024/01/24 12:58:22 Login to traffic generator...
2024/01/24 12:58:34 Starting traffic generator Server Service...
2024/01/24 12:58:34 Waiting until traffic generator Server Service is ready...
2024/01/24 12:59:00 trex-server is now ready
2024/01/24 12:59:00 Starting testpmd in VMI...
2024/01/24 12:59:05 Clearing testpmd stats in VMI...
2024/01/24 12:59:05 Clearing Trex console stats before test...
2024/01/24 12:59:07 Running traffic for 1m0s...
2024/01/24 12:59:07 Monitoring traffic generator side drop rates every 10s during the test duration...
2024/01/24 13:00:07 finished polling for drop rates
2024/01/24 13:00:07 traffic Generator Max Drop Rate: 0.000000Bps
2024/01/24 13:00:12 traffic Generator port 0 Packet output errors: 0
2024/01/24 13:00:12 traffic Generator port 1 Packet output errors: 0
2024/01/24 13:00:12 traffic Generator packet sent via port 0: 480000004
2024/01/24 13:00:12 get testpmd stats in VM-Under-Test...
2024/01/24 13:00:14 VMI-Under-Test's side packets Dropped: Rx: 0; TX: 0
2024/01/24 13:00:14 VMI-Under-Test's side test packets received (including dropped, excluding non-related packets): 480000004
2024/01/24 13:00:14 Trying to delete VMI: "dpdk-checkup-ns-1/vmi-under-test-pwr7x"
2024/01/24 13:00:14 Trying to delete VMI: "dpdk-checkup-ns-1/dpdk-traffic-gen-pwr7x"
2024/01/24 13:00:14 Deleting ConfigMap "dpdk-checkup-ns-1/dpdk-traffic-gen-config-pwr7x"...
2024/01/24 13:00:14 Deleting ConfigMap "dpdk-checkup-ns-1/vmi-under-test-config-pwr7x"...
2024/01/24 13:00:14 Waiting for VMI "dpdk-checkup-ns-1/vmi-under-test-pwr7x" to be deleted...
2024/01/24 13:00:29 VMI "dpdk-checkup-ns-1/vmi-under-test-pwr7x" was deleted successfully
2024/01/24 13:00:29 Waiting for VMI "dpdk-checkup-ns-1/dpdk-traffic-gen-pwr7x" to be deleted...
2024/01/24 13:00:29 VMI "dpdk-checkup-ns-1/dpdk-traffic-gen-pwr7x" was deleted successfully

@RamLavi RamLavi changed the title Manually boot checkup VMIs boot VMIs from the checkuo's setup Jan 24, 2024
@RamLavi RamLavi changed the title boot VMIs from the checkuo's setup boot VMIs from the checkup's setup Jan 24, 2024
@RamLavi
Copy link
Collaborator Author

RamLavi commented Jan 25, 2024

broke this PR into multiple PRs #218, #219, #220, #221, #223, #224. Closing this PR

@RamLavi RamLavi closed this Jan 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants