Handling job teardown hangs #3708
Replies: 2 comments
-
On the ☕ hour call, @grondo and I landed on a proposal where both the scheduler and exec system add a "fudge factor" to the expiration time of a job. This would give the exec system time to tear things down. In the case where the teardown exceeds that "fudge factor", the exec system would mark that node as |
Beta Was this translation helpful? Give feedback.
-
I also have in my notes from the ☕ call that it would be good to get input from @ryanday36 on how this is handled currently on our systems. |
Beta Was this translation helpful? Give feedback.
-
Currently, when a scheduler makes an allocation for a job, it embeds within R an absolute expiration time at which the job should no longer be on the system (i.e., the current time + the job's walltime). This way the exec system and the scheduler can both plan on a node being free at a given time. Later on after the job has been terminated due to exceeding its walltime, the exec system confirms the node is free with a
release
event during thecleanup
state.This is all great for schedulers that do not consider the future and attempt to backfill. The trouble comes in when a scheduler plans to launch a job on a node immediately after its expiration time and the exec system is slow to release that node (e.g., due to an epilogue script (or process) that hangs during job teardown).
It is my understanding that flux-sched's Fluxion will keep the hung node marked as allocated in it's graph data structure and not allocate any new jobs on the node, but the node won't be marked as allocated/reserved in the future, so new reservations will be made on the node starting at current time + 1 second. This is probably sub-optimal from a scheduling-perspective. We should come up with a better way to handle this scenario and coordinate the node's hung state between
exec
andsched
.Beta Was this translation helpful? Give feedback.
All reactions