-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(cleanup): cleanup sct-runner when logs collected #8912
base: master
Are you sure you want to change the base?
fix(cleanup): cleanup sct-runner when logs collected #8912
Conversation
When test fail, often sct-runners are kept for long - especially in perf tests. This is because of broken/unclear logic behind setting sct-runner keep flags. Fixed that logic to: - when test timeouted and logs collected, keep sct-runner for another 6 h - if logs were collected, regardless of test status, terminate sct-runner immediately - otherwise keep for additional 48h
this is example test, where sct-runner was set to be kept for 64h, while logs were collected and no reason to keep it: https://jenkins.scylladb.com/job/scylla-staging/job/lukasz/job/manager-restore-benchmark/3/ |
I'm not sure if this logic is right, and where should be backported. |
LOGGER.info("No changes to make to runner tags.") | ||
if sct_runner_info.logs_collected: | ||
if not dry_run: | ||
sct_runner_info.sct_runner_class.set_tags(sct_runner_info, {"keep": "0", "keep-action": "terminate"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this code is running on this runner ? isn't there a possibility it would get cleared right away ?
maybe 1 is a safer option ?
why in perf tests are not getting into the first if ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's the intention - clear right away (it's the last step in the pipeline, part of clean_sct_runners
method) if logs are collected.
Perf tests often fail with an error event, and not getting into the first if. The second seemed to have broken logic: if not timeout_flag
- meaning only if not timed out (basically always - our tests rarely timeout recently) we prolong by another 6 hours which makes no sense.
Two changes:
- invert logic behind timeout_flag - when test timed out, add another 6h just in case some more info are required to be obtained from runner (e.g. see which process frozen).
- Logic was changed to simpler one: if logs are collected, terminate runner, otherwise prolong duration by 48h.
basically this PR is about cleaning sct-runners if logs are collected right after the test. Do we want that behavior? |
we we have a way during the tests to prevent this cleanup ? if the machine would be manually marked as keep, would it still gonna be deleted ? i.e. we need to have some escape hatch if during the test we decided we want to keep the resources. now we have a few hours after the end of the test to keep it. |
Yes, but from my understanding it's going to be removed also with current code if test didn't timeout (usual case).
yes. That's the question about this PR, whether we need (and why) to keep SCT runner after the test. |
When test fail, often sct-runners are kept for long - especially in perf tests. This is because of broken/unclear logic behind setting sct-runner keep flags.
Fixed that logic to:
Testing
PR pre-checks (self review)
backport
labelsReminders
sdcm/sct_config.py
)unit-test/
folder)