Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check whether build passes. Do not reuse builds across tests. #445

Merged
merged 1 commit into from
Sep 30, 2024

Conversation

SlouchyButton
Copy link
Contributor

@SlouchyButton SlouchyButton commented Aug 28, 2024

Fixes: #444

This contains the same change as in ruby container PR sclorg/s2i-ruby-container#556.

The other change is related to ruby container issue sclorg/s2i-ruby-container#557.

Ruby container issue has a more thorough description on what's specifically the problem.

Reason why docker kill has to be changed to docker stop in test/test-lib-nodejs.sh L: 176 is that docker kill doesn't wait for the container to stop. It forcefully kills it and continues. This then creates a race condition with the clearing function. When the docker rmi -f is called, the previous kill call hasn't stopped the container fully yet, so the docker rmi -f will see the existing container. It will try to delete it, because we are forcing image deletion, but before it goes through the internal process the container gets stopped and deleted by the previous docker kill call.

When the docker rmi -f actually reaches the part where it wants to delete the container, it doesn't exist anymore and fails. docker stop on the other hand won't work asynchronously and will try to gracefully stop the container and in case it doesn't respond it will forcefully kill and continue only after the successful kill happened.

Copy link
Member

@hhorak hhorak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, include the PR description in the commit as well, github might not be here forever (if this reads somebody from GitHub, be aware that I wish it will be here forever, but you know... Microsoft... well, anyway...)

For the change itself, I do understand the kill/stop part, but I don't understand why you need to re-define evaluate_build_result function. I understand what it does, but when do we need to remove a container image when the build failed? Then nothing should be built, no? Or when to expect the build fails?

@hhorak
Copy link
Member

hhorak commented Aug 29, 2024

[test]

@SlouchyButton
Copy link
Contributor Author

SlouchyButton commented Aug 30, 2024

why you need to re-define evaluate_build_result function

That is related to this: sclorg/s2i-ruby-container#556
I wanted both projects to use the same approach to detecting failed build.

The problem with ct_check_testcase_result is that it will happily continue and ignore that the build failed. It also won't add any entry to test result. It also (in case of the ruby container) didn't cause resulting return code of the test suite to be non-zero if build failed. You pointing it out made me take a look at that code again, and it needs a small change.

The thing that I am trying to achieve with that evaluate_build_result function, is so the tests will fail early and gracefully. If we rely only on ct_check_testcase_result that build might fail, and still continue and start a test on that build. In the ruby container, this resulted in a test reusing a built image from a previous test. Adding evaluate_build_result together with skipping the test when build failed added a new entry to the tests results that would inform you that the build failed.

It would fail early and not run any tests on that image, as it doesn't make sense to even try and run the tests on an image that failed to build. Otherwise, we would run the tests and fail on podman trying to start an image that doesn't exist for every test that depends on that image. It would work still, but I don't think it's a clean solution. Also, it helps with debugging as you will immediately see that the build failed, not the tests itself.

Both changes checking for build to pass and cleaning the images between the tests are redundant with each other. One of them would fix the underlying issue, but in my opinion each has a slightly different reason. Checking for build to pass allows us to be more verbose for debugging and stopping early, resulting in faster tests. Cleaning images is removing the possibility to start an image that we didn't want to test completely.

Problem here is that I missed the part that should actually skip the test if the build failed, I will add that.

@hhorak
Copy link
Member

hhorak commented Aug 30, 2024

@SlouchyButton thanks for explaining, I think at least part of this explanation would be also worth keeping in the commit messages.

More thoughts:

  • the problem with re-using image from previous test could be likely solved by removing the image (if it exists) before the build starts (in the build function directly) -- then we would end up with an image only when it was freshly built for the test case and the build succeeded; not sure whether this change would make the need of evaluate_build_result void, and ct_check_testcase_result could be used instead (with an improvement below)
  • I like the idea to add more info about what steps of the test passed and what failed -- what if we add this option to the ct_check_testcase_result itself, so we could add such info in other places as well when we want (with some indentation, so we see the difference from the "final test result" and "sub-result")

@SlouchyButton
Copy link
Contributor Author

  • what if we add this option to the ct_check_testcase_result itself, so we could add such info in other places as well

Love the idea, would definitely help with debugging as we would be able to pinpoint exactly where the build failed. It would also allow us to continue with the 'fail early, fail gracefully' approach.

The thing is that since this is a test-lib function, we will have to go through every single container and check all uses of that function before changing it, as this change seems like it could be breaking.

Another thing I am a bit scared about is that we have already pushed Bash far beyond what it can handle in some places (sclorg/container-common-scripts#380) and we are just adding more and more stuff onto it. This specifically shouldn't really be that complicated, but since we are changing test-lib, there is always a higher chance of breaking stuff, especially since this change would be modifying how that function works substantially.

I would still continue with this idea honestly, I like it, it will be a helpful feature to have. We should create a new issue for https://github.com/sclorg/container-common-scripts

@SlouchyButton
Copy link
Contributor Author

  • solved by removing the image (if it exists) before the build starts

Currently, this PR removes the images after the test, so after merging that will be happening.

not sure whether this change would make the need of evaluate_build_result void, and ct_check_testcase_result could be used instead (with an improvement below)

From how I understand that improvement we would basically make ct_check_testcase_result work similarly to how evaluate_build_result works now, so yes, there would be no need to have evaluate_build_result anymore.

@hhorak
Copy link
Member

hhorak commented Aug 30, 2024

The thing is that since this is a test-lib function, we will have to go through every single container and check all uses of that function before changing it, as this change seems like it could be breaking.

I don't think it would need to be breaking. The function ct_check_testcase_result currently accepts exactly one argument, so unless some container passes more arguments that have no meaning (that could be double-checked to be sure), we could make the second argument optional, so ignoring it would be legal and default. We'd only add the 2nd argument when we want extra info.

@SlouchyButton
Copy link
Contributor Author

I don't think it would need to be breaking.

The thing I had in mind is whether we could be modifying the final result with such subtest results everywhere.

After looking at this again I realized the evaluate_build_result is calling ct_update_test_result which actually adds it to the test log. That is also test-lib function, so this change shouldn't depend on specific container tests. I mistakenly thought that evaluate_build_result depends on a local test code. This shouldn't be as dramatic change as I made it seem.

@hhorak
Copy link
Member

hhorak commented Aug 30, 2024

Cool. One more thing that bothers me is that the RHEL8 v18 tests fail repeatadly, while not seeing any nodejs failures in the nightly -- so I tried a different PR to see whether the failure is connected with this change or not: #446

@hhorak
Copy link
Member

hhorak commented Aug 30, 2024

RHEL8 v18 tests fail repeatadly,

Or is it possible that it failed silently before this fix? that could explain it...

@SlouchyButton
Copy link
Contributor Author

Or is it possible that it failed silently before this fix?

I don't think so. We added tests for builds failing, but the build passed. The only thing that failed was

[FAILED] for 'hw' test_run_hw_application (00:00:11)

fec7c64b7b9910932279506c33aca66a0f9af9dc2b63e762a29883cc77861801
time="2024-08-29T11:39:00-04:00" level=warning msg="StopSignal SIGTERM failed to stop container laughing_agnesi in 10 seconds, resorting to SIGKILL"
fec7c64b7b9910932279506c33aca66a0f9af9dc2b63e762a29883cc77861801
rm: cannot remove '/tmp/tmp.UzRE2OjMVt/./tmp.nh4Dei6dEd.cid': No such file or directory
Test for image 'ubi8/nodejs-18:1' FAILED (exit code: 1)

And it seems it failed due to the last command in the test function is kill_test_application and it's last command was rm. Normally, this command is used in other places so it should work, but in this case the file wasn't found so it failed, and non-zero return code probably bubbled up through the functions as it was a last command.

The test function is interesting tho and I am not sure if it even does what it is supposed to do, since this code:

function test_run_hw_application() {
  # Verify that the HTTP connection can be established to test application container
  run_test_application hw
  # Wait for the container to write it's CID file
  wait_for_cid
  ct_check_testcase_result $?
  kill_test_application
}

AFAIK starts the test app, then waits for CID, but then when ct_check_testcase_result is called we are getting the resulting return code of wait_for_cid, which doesn't return anything, so I think we are effectively getting a return code of sleep in wait_for_cid. And then we kill it with no further testing.

Am I missing something here? @hhorak

Furthermore, the test on rhel8 for version 18 passes locally for me.

@SlouchyButton
Copy link
Contributor Author

[test]

Copy link
Member

@phracek phracek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pull request LGTM, but some nitpicks.

test/test-lib-nodejs.sh Show resolved Hide resolved
@phracek
Copy link
Member

phracek commented Sep 25, 2024

CentOS Stream 22 issue #452

Copy link

github-actions bot commented Sep 25, 2024

Pull Request validation

Failed

🔴 Failed or pending statuses - Testing Farm - CentOS Stream 10 - 22[error]
🔴 Approval - missing or changes were requested

Success

🟢 Review - Reviewed by a member

@phracek
Copy link
Member

phracek commented Sep 26, 2024

[test]

Copy link

github-actions bot commented Sep 26, 2024

Testing Farm results

namecomposearchstatusstarted (UTC)timelogs
Fedora - 20-minimalFedora-latestx86_64✅ passed30.09.2024 10:01:5119min 34stest pipeline
Fedora - 22-minimalFedora-latestx86_64✅ passed30.09.2024 10:02:1510min 21stest pipeline
CentOS Stream 9 - 20-minimalCentOS-Stream-9x86_64✅ passed30.09.2024 10:01:4820min 3stest pipeline
Fedora - 18-minimalFedora-latestx86_64✅ passed30.09.2024 10:01:5110min 40stest pipeline
CentOS Stream 10 - 22-minimalCentOS-Stream-10x86_64✅ passed30.09.2024 10:01:5912min 5stest pipeline
RHEL9 - 20-minimalRHEL-9.4.0-Nightlyx86_64✅ passed30.09.2024 10:01:4926min 38stest pipeline
RHEL9 - 18-minimalRHEL-9.4.0-Nightlyx86_64✅ passed30.09.2024 10:01:4914min 37stest pipeline
RHEL9 - 22-minimalRHEL-9.4.0-Nightlyx86_64✅ passed30.09.2024 10:02:3115min 26stest pipeline
RHEL8 - 18-minimalRHEL-8.10.0-Nightlyx86_64✅ passed30.09.2024 10:01:4819min 52stest pipeline
RHEL8 - 20-minimalRHEL-8.10.0-Nightlyx86_64✅ passed30.09.2024 10:01:5129min 56stest pipeline
Fedora - 22Fedora-latestx86_64✅ passed30.09.2024 10:01:5913min 39stest pipeline
CentOS Stream 10 - 22CentOS-Stream-10x86_64❌ failed30.09.2024 10:02:0113min 17stest pipeline
CentOS Stream 9 - 20CentOS-Stream-9x86_64✅ passed30.09.2024 10:01:5016min 48stest pipeline
Fedora - 18Fedora-latestx86_64✅ passed30.09.2024 10:01:4614min 60stest pipeline
RHEL9 - 18RHEL-9.4.0-Nightlyx86_64✅ passed30.09.2024 10:01:4618min 1stest pipeline
RHEL9 - 20RHEL-9.4.0-Nightlyx86_64✅ passed30.09.2024 10:01:4918min 49stest pipeline
RHEL9 - 22RHEL-9.4.0-Nightlyx86_64✅ passed30.09.2024 10:02:0319min 47stest pipeline
RHEL8 - 20RHEL-8.10.0-Nightlyx86_64✅ passed30.09.2024 10:01:5224min 25stest pipeline
RHEL8 - 18RHEL-8.10.0-Nightlyx86_64✅ passed30.09.2024 10:01:4620min 51stest pipeline
Fedora - 20Fedora-latestx86_64✅ passed30.09.2024 10:01:5015min 37stest pipeline

@phracek
Copy link
Member

phracek commented Sep 27, 2024

Rebase it please against master.

@SlouchyButton
Copy link
Contributor Author

Rebase it please against master.

@phracek done

@phracek
Copy link
Member

phracek commented Sep 30, 2024

[test]

@SlouchyButton
Copy link
Contributor Author

SlouchyButton commented Sep 30, 2024

EDIT: mb, didn't notice it was already mentioned and linked to the issue, so disregard this

In test_run_binary_application:

npm error In file included from rdkafka_int.h:116,
npm error                  from rdkafka.c:43:
npm error rdkafka_conf.h:40:10: fatal error: openssl/engine.h: No such file or directory
npm error    40 | #include <openssl/engine.h>
npm error       |          ^~~~~~~~~~~~~~~~~~
npm error compilation terminated.

It seems that the sample app uses OpenSSL engine library, but that one got deprecated/removed in F41. I'd guess that C10s already embraced this change. Tests pass for Fedora as we are using F40 image that is not affected.

Ref: https://fedoraproject.org/wiki/Changes/OpensslDeprecateEngine

@phracek
Copy link
Member

phracek commented Sep 30, 2024

C10S issue is tracked here: #452

@phracek phracek merged commit 4e2ef5a into sclorg:master Sep 30, 2024
20 of 21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tests reuse cached container even if container build for test group failed
3 participants