Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reject time losses mid-run, take 2 #1571

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

dubslow
Copy link
Contributor

@dubslow dubslow commented Feb 16, 2023

Considerably more sophisticated than my prior attempt some months ago. There's a lot going on here.

The first 6 commits are very nearly non functional, being solely cleanups and refactorings which were necessary for me to even hope to be able to understand what I was modifying (and even then I'm only like 80% sure). Even if the functionality of this PR is rejected, the first 6 commits of cleanups and refactoring should be considered/merged independently. (The third commit is slightly functional, but only the slightest bit, I believe only cosmetically.)

The last two commits are the meat of the proposed new functionality. Instead of waiting until the run is "over" to autopurge crash_or_time tasks (if autopurge is even set, which it usually is not), we purge them immediately, mid-run, if the worker has a recent history of spewing garbage.

My usual caveat applies: this is completely untested, and even more than that, I'm not even entirely sure that the last two commits are correct, in the sense of doing what I intended. In any case, I believe it should be easier to understand each commit individually than the grand sum diff. All kinds of review/feedback is requested and appreciated.

(Also, I think I found two other potential bugs, not directly related; rather than change them, I merely marked them with comments, to be reconsidered later.)

(Also also, does the rename "bad" --> "purged" imply a db upgrade is required?)

@dubslow dubslow force-pushed the rejectTimeLosses2 branch 6 times, most recently from dab0923 to 0a38d45 Compare February 16, 2023 14:13
Factor out task_mark_failed
Factor out task_mark_bad and task_mark_excessive_residual
Cleanups, change behavior of handle_crash_or_time to automatically mark high residual
Reorder update_residuals and get_bad_workers
Cosmetic: rename bad tasks as purged tasks
Slightly tighten allowable timelosses bound
Bad history: 3 of 8 recent tasks are bad

I'm not sure about the correctness of this commit, especially regarding locks and concurrency.
@dubslow
Copy link
Contributor Author

dubslow commented Feb 19, 2023

I've stored the original, more detailed commit history here: master...dubslow:fishtest:rejectTimeLosses2Orig

and instead have forcepushed a simplified history more suitable for merging.

@dubslow
Copy link
Contributor Author

dubslow commented Feb 19, 2023

I think it's probably a benefit to have "this run was purged of n tasks" as an event log, in parallel with the time or crashes log event, but I'll put that idea off until this one is complete in some form.

@dubslow
Copy link
Contributor Author

dubslow commented Apr 5, 2023

@vdbergh btw I would very much appreciate your review/feedback of this patch, since among other reasons you're the most familiar with the code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant