fix(cleanup): cleanup sct-runner when logs collected #8912

soyacz · 2024-10-03T08:58:54Z

When test fail, often sct-runners are kept for long - especially in perf tests. This is because of broken/unclear logic behind setting sct-runner keep flags.

Fixed that logic to:

when test timeouted and logs collected, keep sct-runner for another 6 h
if logs were collected, regardless of test status, terminate sct-runner immediately
otherwise keep for additional 48h

Testing

[ ]

PR pre-checks (self review)

I added the relevant backport labels
I didn't leave commented-out/debugging code

Reminders

Add New configuration option and document them (in sdcm/sct_config.py)
Add unit tests to cover my changes (under unit-test/ folder)
Update the Readme/doc folder relevant to this change (if needed)

When test fail, often sct-runners are kept for long - especially in perf tests. This is because of broken/unclear logic behind setting sct-runner keep flags. Fixed that logic to: - when test timeouted and logs collected, keep sct-runner for another 6 h - if logs were collected, regardless of test status, terminate sct-runner immediately - otherwise keep for additional 48h

soyacz · 2024-10-03T08:59:54Z

this is example test, where sct-runner was set to be kept for 64h, while logs were collected and no reason to keep it: https://jenkins.scylladb.com/job/scylla-staging/job/lukasz/job/manager-restore-benchmark/3/

soyacz · 2024-10-03T09:01:35Z

I'm not sure if this logic is right, and where should be backported.

fruch · 2024-10-06T11:55:59Z

sdcm/sct_runner.py

-    LOGGER.info("No changes to make to runner tags.")
+    if sct_runner_info.logs_collected:
+        if not dry_run:
+            sct_runner_info.sct_runner_class.set_tags(sct_runner_info, {"keep": "0", "keep-action": "terminate"})


this code is running on this runner ? isn't there a possibility it would get cleared right away ?
maybe 1 is a safer option ?

why in perf tests are not getting into the first if ?

that's the intention - clear right away (it's the last step in the pipeline, part of clean_sct_runners method) if logs are collected.

Perf tests often fail with an error event, and not getting into the first if. The second seemed to have broken logic: if not timeout_flag - meaning only if not timed out (basically always - our tests rarely timeout recently) we prolong by another 6 hours which makes no sense.

Two changes:

invert logic behind timeout_flag - when test timed out, add another 6h just in case some more info are required to be obtained from runner (e.g. see which process frozen).

Logic was changed to simpler one: if logs are collected, terminate runner, otherwise prolong duration by 48h.

soyacz · 2024-10-16T06:30:44Z

basically this PR is about cleaning sct-runners if logs are collected right after the test. Do we want that behavior?

fruch · 2024-10-16T06:46:37Z

basically this PR is about cleaning sct-runners if logs are collected right after the test. Do we want that behavior?

we we have a way during the tests to prevent this cleanup ?

if the machine would be manually marked as keep, would it still gonna be deleted ?

i.e. we need to have some escape hatch if during the test we decided we want to keep the resources.

now we have a few hours after the end of the test to keep it.

soyacz · 2024-10-16T07:00:59Z

basically this PR is about cleaning sct-runners if logs are collected right after the test. Do we want that behavior?

we we have a way during the tests to prevent this cleanup ?

if the machine would be manually marked as keep, would it still gonna be deleted ?

Yes, but from my understanding it's going to be removed also with current code if test didn't timeout (usual case).
Possibly we should change this logic and keep the 'keep: alive' tag if present (and add it if db nodes are marked as keep? E.g. in reuse cluster case)

i.e. we need to have some escape hatch if during the test we decided we want to keep the resources.

now we have a few hours after the end of the test to keep it.

yes. That's the question about this PR, whether we need (and why) to keep SCT runner after the test.

soyacz requested review from fruch and juliayakovlev October 3, 2024 08:58

github-actions bot assigned soyacz Oct 3, 2024

soyacz added the backport/perf-v15 label Oct 3, 2024

soyacz requested a review from roydahan October 3, 2024 09:01

fruch reviewed Oct 6, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cleanup): cleanup sct-runner when logs collected #8912

fix(cleanup): cleanup sct-runner when logs collected #8912

soyacz commented Oct 3, 2024

soyacz commented Oct 3, 2024

soyacz commented Oct 3, 2024

fruch Oct 6, 2024

soyacz Oct 6, 2024

soyacz commented Oct 16, 2024

fruch commented Oct 16, 2024

soyacz commented Oct 16, 2024

fix(cleanup): cleanup sct-runner when logs collected #8912

Are you sure you want to change the base?

fix(cleanup): cleanup sct-runner when logs collected #8912

Conversation

soyacz commented Oct 3, 2024

Testing

PR pre-checks (self review)

Reminders

soyacz commented Oct 3, 2024

soyacz commented Oct 3, 2024

fruch Oct 6, 2024

Choose a reason for hiding this comment

soyacz Oct 6, 2024

Choose a reason for hiding this comment

soyacz commented Oct 16, 2024

fruch commented Oct 16, 2024

soyacz commented Oct 16, 2024