Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(cleanup): cleanup sct-runner when logs collected #8912

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

soyacz
Copy link
Contributor

@soyacz soyacz commented Oct 3, 2024

When test fail, often sct-runners are kept for long - especially in perf tests. This is because of broken/unclear logic behind setting sct-runner keep flags.

Fixed that logic to:

  • when test timeouted and logs collected, keep sct-runner for another 6 h
  • if logs were collected, regardless of test status, terminate sct-runner immediately
  • otherwise keep for additional 48h

Testing

  • [ ]

PR pre-checks (self review)

  • I added the relevant backport labels
  • I didn't leave commented-out/debugging code

Reminders

  • Add New configuration option and document them (in sdcm/sct_config.py)
  • Add unit tests to cover my changes (under unit-test/ folder)
  • Update the Readme/doc folder relevant to this change (if needed)

When test fail, often sct-runners are kept for long - especially in perf
tests. This is because of broken/unclear logic behind setting sct-runner
keep flags.

Fixed that logic to:
- when test timeouted and logs collected, keep sct-runner for another 6 h
- if logs were collected, regardless of test status, terminate
  sct-runner immediately
- otherwise keep for additional 48h
@soyacz
Copy link
Contributor Author

soyacz commented Oct 3, 2024

this is example test, where sct-runner was set to be kept for 64h, while logs were collected and no reason to keep it: https://jenkins.scylladb.com/job/scylla-staging/job/lukasz/job/manager-restore-benchmark/3/

@soyacz
Copy link
Contributor Author

soyacz commented Oct 3, 2024

I'm not sure if this logic is right, and where should be backported.

@soyacz soyacz requested a review from roydahan October 3, 2024 09:01
LOGGER.info("No changes to make to runner tags.")
if sct_runner_info.logs_collected:
if not dry_run:
sct_runner_info.sct_runner_class.set_tags(sct_runner_info, {"keep": "0", "keep-action": "terminate"})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code is running on this runner ? isn't there a possibility it would get cleared right away ?
maybe 1 is a safer option ?

why in perf tests are not getting into the first if ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's the intention - clear right away (it's the last step in the pipeline, part of clean_sct_runners method) if logs are collected.

Perf tests often fail with an error event, and not getting into the first if. The second seemed to have broken logic: if not timeout_flag - meaning only if not timed out (basically always - our tests rarely timeout recently) we prolong by another 6 hours which makes no sense.

Two changes:

  1. invert logic behind timeout_flag - when test timed out, add another 6h just in case some more info are required to be obtained from runner (e.g. see which process frozen).
  2. Logic was changed to simpler one: if logs are collected, terminate runner, otherwise prolong duration by 48h.

@soyacz
Copy link
Contributor Author

soyacz commented Oct 16, 2024

basically this PR is about cleaning sct-runners if logs are collected right after the test. Do we want that behavior?

@fruch
Copy link
Contributor

fruch commented Oct 16, 2024

basically this PR is about cleaning sct-runners if logs are collected right after the test. Do we want that behavior?

we we have a way during the tests to prevent this cleanup ?

if the machine would be manually marked as keep, would it still gonna be deleted ?

i.e. we need to have some escape hatch if during the test we decided we want to keep the resources.

now we have a few hours after the end of the test to keep it.

@soyacz
Copy link
Contributor Author

soyacz commented Oct 16, 2024

basically this PR is about cleaning sct-runners if logs are collected right after the test. Do we want that behavior?

we we have a way during the tests to prevent this cleanup ?

if the machine would be manually marked as keep, would it still gonna be deleted ?

Yes, but from my understanding it's going to be removed also with current code if test didn't timeout (usual case).
Possibly we should change this logic and keep the 'keep: alive' tag if present (and add it if db nodes are marked as keep? E.g. in reuse cluster case)

i.e. we need to have some escape hatch if during the test we decided we want to keep the resources.

now we have a few hours after the end of the test to keep it.

yes. That's the question about this PR, whether we need (and why) to keep SCT runner after the test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants