Handling job teardown hangs #3708

SteVwonder · 2021-06-09T22:50:36Z

SteVwonder
Jun 9, 2021
Maintainer

Currently, when a scheduler makes an allocation for a job, it embeds within R an absolute expiration time at which the job should no longer be on the system (i.e., the current time + the job's walltime). This way the exec system and the scheduler can both plan on a node being free at a given time. Later on after the job has been terminated due to exceeding its walltime, the exec system confirms the node is free with a release event during the cleanup state.

This is all great for schedulers that do not consider the future and attempt to backfill. The trouble comes in when a scheduler plans to launch a job on a node immediately after its expiration time and the exec system is slow to release that node (e.g., due to an epilogue script (or process) that hangs during job teardown).

It is my understanding that flux-sched's Fluxion will keep the hung node marked as allocated in it's graph data structure and not allocate any new jobs on the node, but the node won't be marked as allocated/reserved in the future, so new reservations will be made on the node starting at current time + 1 second. This is probably sub-optimal from a scheduling-perspective. We should come up with a better way to handle this scenario and coordinate the node's hung state between exec and sched.

SteVwonder · 2021-06-09T22:53:20Z

SteVwonder
Jun 9, 2021
Maintainer Author

On the ☕ hour call, @grondo and I landed on a proposal where both the scheduler and exec system add a "fudge factor" to the expiration time of a job. This would give the exec system time to tear things down. In the case where the teardown exceeds that "fudge factor", the exec system would mark that node as DOWN. This would be a signal to the scheduler to not make any future (backfill) reservations on the node. It would also be a signal to the sysadmins that they should take a look at what is hanging the node/epilogue/job.

0 replies

SteVwonder · 2021-06-17T20:54:50Z

SteVwonder
Jun 17, 2021
Maintainer Author

I also have in my notes from the ☕ call that it would be good to get input from @ryanday36 on how this is handled currently on our systems.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling job teardown hangs #3708

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Handling job teardown hangs #3708

SteVwonder Jun 9, 2021 Maintainer

Replies: 2 comments

SteVwonder Jun 9, 2021 Maintainer Author

SteVwonder Jun 17, 2021 Maintainer Author

SteVwonder
Jun 9, 2021
Maintainer

SteVwonder
Jun 9, 2021
Maintainer Author

SteVwonder
Jun 17, 2021
Maintainer Author