Check whether build passes. Do not reuse builds across tests. #445

SlouchyButton · 2024-08-28T13:45:10Z

Fixes: #444

This contains the same change as in ruby container PR sclorg/s2i-ruby-container#556.

The other change is related to ruby container issue sclorg/s2i-ruby-container#557.

Ruby container issue has a more thorough description on what's specifically the problem.

Reason why docker kill has to be changed to docker stop in test/test-lib-nodejs.sh L: 176 is that docker kill doesn't wait for the container to stop. It forcefully kills it and continues. This then creates a race condition with the clearing function. When the docker rmi -f is called, the previous kill call hasn't stopped the container fully yet, so the docker rmi -f will see the existing container. It will try to delete it, because we are forcing image deletion, but before it goes through the internal process the container gets stopped and deleted by the previous docker kill call.

When the docker rmi -f actually reaches the part where it wants to delete the container, it doesn't exist anymore and fails. docker stop on the other hand won't work asynchronously and will try to gracefully stop the container and in case it doesn't respond it will forcefully kill and continue only after the successful kill happened.

hhorak

Please, include the PR description in the commit as well, github might not be here forever (if this reads somebody from GitHub, be aware that I wish it will be here forever, but you know... Microsoft... well, anyway...)

For the change itself, I do understand the kill/stop part, but I don't understand why you need to re-define evaluate_build_result function. I understand what it does, but when do we need to remove a container image when the build failed? Then nothing should be built, no? Or when to expect the build fails?

hhorak · 2024-08-29T15:22:08Z

[test]

SlouchyButton · 2024-08-30T08:44:01Z

why you need to re-define evaluate_build_result function

That is related to this: sclorg/s2i-ruby-container#556
I wanted both projects to use the same approach to detecting failed build.

The problem with ct_check_testcase_result is that it will happily continue and ignore that the build failed. It also won't add any entry to test result. It also (in case of the ruby container) didn't cause resulting return code of the test suite to be non-zero if build failed. You pointing it out made me take a look at that code again, and it needs a small change.

The thing that I am trying to achieve with that evaluate_build_result function, is so the tests will fail early and gracefully. If we rely only on ct_check_testcase_result that build might fail, and still continue and start a test on that build. In the ruby container, this resulted in a test reusing a built image from a previous test. Adding evaluate_build_result together with skipping the test when build failed added a new entry to the tests results that would inform you that the build failed.

It would fail early and not run any tests on that image, as it doesn't make sense to even try and run the tests on an image that failed to build. Otherwise, we would run the tests and fail on podman trying to start an image that doesn't exist for every test that depends on that image. It would work still, but I don't think it's a clean solution. Also, it helps with debugging as you will immediately see that the build failed, not the tests itself.

Both changes checking for build to pass and cleaning the images between the tests are redundant with each other. One of them would fix the underlying issue, but in my opinion each has a slightly different reason. Checking for build to pass allows us to be more verbose for debugging and stopping early, resulting in faster tests. Cleaning images is removing the possibility to start an image that we didn't want to test completely.

Problem here is that I missed the part that should actually skip the test if the build failed, I will add that.

hhorak · 2024-08-30T09:15:09Z

@SlouchyButton thanks for explaining, I think at least part of this explanation would be also worth keeping in the commit messages.

More thoughts:

the problem with re-using image from previous test could be likely solved by removing the image (if it exists) before the build starts (in the build function directly) -- then we would end up with an image only when it was freshly built for the test case and the build succeeded; not sure whether this change would make the need of evaluate_build_result void, and ct_check_testcase_result could be used instead (with an improvement below)
I like the idea to add more info about what steps of the test passed and what failed -- what if we add this option to the ct_check_testcase_result itself, so we could add such info in other places as well when we want (with some indentation, so we see the difference from the "final test result" and "sub-result")

SlouchyButton · 2024-08-30T09:29:31Z

what if we add this option to the ct_check_testcase_result itself, so we could add such info in other places as well

Love the idea, would definitely help with debugging as we would be able to pinpoint exactly where the build failed. It would also allow us to continue with the 'fail early, fail gracefully' approach.

The thing is that since this is a test-lib function, we will have to go through every single container and check all uses of that function before changing it, as this change seems like it could be breaking.

Another thing I am a bit scared about is that we have already pushed Bash far beyond what it can handle in some places (sclorg/container-common-scripts#380) and we are just adding more and more stuff onto it. This specifically shouldn't really be that complicated, but since we are changing test-lib, there is always a higher chance of breaking stuff, especially since this change would be modifying how that function works substantially.

I would still continue with this idea honestly, I like it, it will be a helpful feature to have. We should create a new issue for https://github.com/sclorg/container-common-scripts

SlouchyButton · 2024-08-30T09:34:35Z

solved by removing the image (if it exists) before the build starts

Currently, this PR removes the images after the test, so after merging that will be happening.

not sure whether this change would make the need of evaluate_build_result void, and ct_check_testcase_result could be used instead (with an improvement below)

From how I understand that improvement we would basically make ct_check_testcase_result work similarly to how evaluate_build_result works now, so yes, there would be no need to have evaluate_build_result anymore.

hhorak · 2024-08-30T09:50:06Z

The thing is that since this is a test-lib function, we will have to go through every single container and check all uses of that function before changing it, as this change seems like it could be breaking.

I don't think it would need to be breaking. The function ct_check_testcase_result currently accepts exactly one argument, so unless some container passes more arguments that have no meaning (that could be double-checked to be sure), we could make the second argument optional, so ignoring it would be legal and default. We'd only add the 2nd argument when we want extra info.

SlouchyButton · 2024-08-30T09:59:39Z

I don't think it would need to be breaking.

The thing I had in mind is whether we could be modifying the final result with such subtest results everywhere.

After looking at this again I realized the evaluate_build_result is calling ct_update_test_result which actually adds it to the test log. That is also test-lib function, so this change shouldn't depend on specific container tests. I mistakenly thought that evaluate_build_result depends on a local test code. This shouldn't be as dramatic change as I made it seem.

hhorak · 2024-08-30T14:18:10Z

Cool. One more thing that bothers me is that the RHEL8 v18 tests fail repeatadly, while not seeing any nodejs failures in the nightly -- so I tried a different PR to see whether the failure is connected with this change or not: #446

hhorak · 2024-08-30T14:42:22Z

RHEL8 v18 tests fail repeatadly,

Or is it possible that it failed silently before this fix? that could explain it...

SlouchyButton · 2024-09-02T07:12:58Z

Or is it possible that it failed silently before this fix?

I don't think so. We added tests for builds failing, but the build passed. The only thing that failed was

[FAILED] for 'hw' test_run_hw_application (00:00:11)

fec7c64b7b9910932279506c33aca66a0f9af9dc2b63e762a29883cc77861801
time="2024-08-29T11:39:00-04:00" level=warning msg="StopSignal SIGTERM failed to stop container laughing_agnesi in 10 seconds, resorting to SIGKILL"
fec7c64b7b9910932279506c33aca66a0f9af9dc2b63e762a29883cc77861801
rm: cannot remove '/tmp/tmp.UzRE2OjMVt/./tmp.nh4Dei6dEd.cid': No such file or directory
Test for image 'ubi8/nodejs-18:1' FAILED (exit code: 1)

And it seems it failed due to the last command in the test function is kill_test_application and it's last command was rm. Normally, this command is used in other places so it should work, but in this case the file wasn't found so it failed, and non-zero return code probably bubbled up through the functions as it was a last command.

The test function is interesting tho and I am not sure if it even does what it is supposed to do, since this code:

function test_run_hw_application() {
  # Verify that the HTTP connection can be established to test application container
  run_test_application hw
  # Wait for the container to write it's CID file
  wait_for_cid
  ct_check_testcase_result $?
  kill_test_application
}

AFAIK starts the test app, then waits for CID, but then when ct_check_testcase_result is called we are getting the resulting return code of wait_for_cid, which doesn't return anything, so I think we are effectively getting a return code of sleep in wait_for_cid. And then we kill it with no further testing.

Am I missing something here? @hhorak

Furthermore, the test on rhel8 for version 18 passes locally for me.

SlouchyButton · 2024-09-25T08:43:31Z

[test]

phracek

This pull request LGTM, but some nitpicks.

test/test-lib-nodejs.sh

phracek · 2024-09-25T11:15:52Z

CentOS Stream 22 issue #452

github-actions · 2024-09-25T14:19:13Z

Pull Request validation

Failed

🔴 Failed or pending statuses - Testing Farm - CentOS Stream 10 - 22[error]
🔴 Approval - missing or changes were requested

Success

🟢 Review - Reviewed by a member

phracek · 2024-09-26T10:30:48Z

[test]

github-actions · 2024-09-26T10:50:18Z

Testing Farm results

name	compose	arch	status	started (UTC)	time	logs
Fedora - 20-minimal	Fedora-latest	x86_64	✅ passed	30.09.2024 10:01:51	19min 34s	test pipeline
Fedora - 22-minimal	Fedora-latest	x86_64	✅ passed	30.09.2024 10:02:15	10min 21s	test pipeline
CentOS Stream 9 - 20-minimal	CentOS-Stream-9	x86_64	✅ passed	30.09.2024 10:01:48	20min 3s	test pipeline
Fedora - 18-minimal	Fedora-latest	x86_64	✅ passed	30.09.2024 10:01:51	10min 40s	test pipeline
CentOS Stream 10 - 22-minimal	CentOS-Stream-10	x86_64	✅ passed	30.09.2024 10:01:59	12min 5s	test pipeline
RHEL9 - 20-minimal	RHEL-9.4.0-Nightly	x86_64	✅ passed	30.09.2024 10:01:49	26min 38s	test pipeline
RHEL9 - 18-minimal	RHEL-9.4.0-Nightly	x86_64	✅ passed	30.09.2024 10:01:49	14min 37s	test pipeline
RHEL9 - 22-minimal	RHEL-9.4.0-Nightly	x86_64	✅ passed	30.09.2024 10:02:31	15min 26s	test pipeline
RHEL8 - 18-minimal	RHEL-8.10.0-Nightly	x86_64	✅ passed	30.09.2024 10:01:48	19min 52s	test pipeline
RHEL8 - 20-minimal	RHEL-8.10.0-Nightly	x86_64	✅ passed	30.09.2024 10:01:51	29min 56s	test pipeline
Fedora - 22	Fedora-latest	x86_64	✅ passed	30.09.2024 10:01:59	13min 39s	test pipeline
CentOS Stream 10 - 22	CentOS-Stream-10	x86_64	❌ failed	30.09.2024 10:02:01	13min 17s	test pipeline
CentOS Stream 9 - 20	CentOS-Stream-9	x86_64	✅ passed	30.09.2024 10:01:50	16min 48s	test pipeline
Fedora - 18	Fedora-latest	x86_64	✅ passed	30.09.2024 10:01:46	14min 60s	test pipeline
RHEL9 - 18	RHEL-9.4.0-Nightly	x86_64	✅ passed	30.09.2024 10:01:46	18min 1s	test pipeline
RHEL9 - 20	RHEL-9.4.0-Nightly	x86_64	✅ passed	30.09.2024 10:01:49	18min 49s	test pipeline
RHEL9 - 22	RHEL-9.4.0-Nightly	x86_64	✅ passed	30.09.2024 10:02:03	19min 47s	test pipeline
RHEL8 - 20	RHEL-8.10.0-Nightly	x86_64	✅ passed	30.09.2024 10:01:52	24min 25s	test pipeline
RHEL8 - 18	RHEL-8.10.0-Nightly	x86_64	✅ passed	30.09.2024 10:01:46	20min 51s	test pipeline
Fedora - 20	Fedora-latest	x86_64	✅ passed	30.09.2024 10:01:50	15min 37s	test pipeline

phracek · 2024-09-27T10:50:33Z

Rebase it please against master.

Fixes: sclorg#444

SlouchyButton · 2024-09-30T09:24:25Z

Rebase it please against master.

@phracek done

phracek · 2024-09-30T10:01:31Z

[test]

SlouchyButton · 2024-09-30T10:45:31Z

EDIT: mb, didn't notice it was already mentioned and linked to the issue, so disregard this

In test_run_binary_application:

npm error In file included from rdkafka_int.h:116,
npm error                  from rdkafka.c:43:
npm error rdkafka_conf.h:40:10: fatal error: openssl/engine.h: No such file or directory
npm error    40 | #include <openssl/engine.h>
npm error       |          ^~~~~~~~~~~~~~~~~~
npm error compilation terminated.

It seems that the sample app uses OpenSSL engine library, but that one got deprecated/removed in F41. I'd guess that C10s already embraced this change. Tests pass for Fedora as we are using F40 image that is not affected.

Ref: https://fedoraproject.org/wiki/Changes/OpensslDeprecateEngine

phracek · 2024-09-30T10:58:01Z

C10S issue is tracked here: #452

SlouchyButton requested review from hhorak and phracek August 28, 2024 13:45

SlouchyButton self-assigned this Aug 28, 2024

SlouchyButton added bug ready for review labels Aug 29, 2024

hhorak requested changes Aug 29, 2024

View reviewed changes

phracek requested changes Sep 25, 2024

View reviewed changes

test/test-lib-nodejs.sh Show resolved Hide resolved

github-actions bot added the pr/changes-requested label Sep 25, 2024

github-actions bot added the pr/failing-ci label Sep 26, 2024

Check whether build passes. Do not reuse builds across tests.

b37e426

Fixes: sclorg#444

SlouchyButton force-pushed the master branch from 84aac2c to b37e426 Compare September 30, 2024 09:23

github-actions bot removed the pr/failing-ci label Sep 30, 2024

github-actions bot added the pr/failing-ci label Sep 30, 2024

phracek merged commit 4e2ef5a into sclorg:master Sep 30, 2024
20 of 21 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check whether build passes. Do not reuse builds across tests. #445

Check whether build passes. Do not reuse builds across tests. #445

SlouchyButton commented Aug 28, 2024 •

edited by github-actions bot

Loading

hhorak left a comment

hhorak commented Aug 29, 2024

SlouchyButton commented Aug 30, 2024 •

edited

Loading

hhorak commented Aug 30, 2024

SlouchyButton commented Aug 30, 2024

SlouchyButton commented Aug 30, 2024

hhorak commented Aug 30, 2024

SlouchyButton commented Aug 30, 2024

hhorak commented Aug 30, 2024

hhorak commented Aug 30, 2024

SlouchyButton commented Sep 2, 2024

SlouchyButton commented Sep 25, 2024

phracek left a comment

phracek commented Sep 25, 2024

github-actions bot commented Sep 25, 2024 •

edited

Loading

phracek commented Sep 26, 2024

github-actions bot commented Sep 26, 2024 •

edited

Loading

phracek commented Sep 27, 2024

SlouchyButton commented Sep 30, 2024

phracek commented Sep 30, 2024

SlouchyButton commented Sep 30, 2024 •

edited

Loading

phracek commented Sep 30, 2024

Check whether build passes. Do not reuse builds across tests. #445

Check whether build passes. Do not reuse builds across tests. #445

Conversation

SlouchyButton commented Aug 28, 2024 • edited by github-actions bot Loading

hhorak left a comment

Choose a reason for hiding this comment

hhorak commented Aug 29, 2024

SlouchyButton commented Aug 30, 2024 • edited Loading

hhorak commented Aug 30, 2024

SlouchyButton commented Aug 30, 2024

SlouchyButton commented Aug 30, 2024

hhorak commented Aug 30, 2024

SlouchyButton commented Aug 30, 2024

hhorak commented Aug 30, 2024

hhorak commented Aug 30, 2024

SlouchyButton commented Sep 2, 2024

SlouchyButton commented Sep 25, 2024

phracek left a comment

Choose a reason for hiding this comment

phracek commented Sep 25, 2024

github-actions bot commented Sep 25, 2024 • edited Loading

Pull Request validation

Failed

Success

phracek commented Sep 26, 2024

github-actions bot commented Sep 26, 2024 • edited Loading

Testing Farm results

phracek commented Sep 27, 2024

SlouchyButton commented Sep 30, 2024

phracek commented Sep 30, 2024

SlouchyButton commented Sep 30, 2024 • edited Loading

phracek commented Sep 30, 2024

SlouchyButton commented Aug 28, 2024 •

edited by github-actions bot

Loading

SlouchyButton commented Aug 30, 2024 •

edited

Loading

github-actions bot commented Sep 25, 2024 •

edited

Loading

github-actions bot commented Sep 26, 2024 •

edited

Loading

SlouchyButton commented Sep 30, 2024 •

edited

Loading