You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As we start to work on defining preemption support, we have to consider several aspects of the problem:
how to specify the local preemption policies. I suggest that this is not relevant to this organization - it really is a problem for the local host environment, each of which may choose to do it differently.
how to query what the local preemption policy is, what options are definable by the app/tool, and how it is implemented. This is largely a question of attribute definition for query support. Should include indication of whether preemption is a “ctrl-z” (i.e., pause but remaining in memory) or a complete shutdown and removal - or both (perhaps selectable by app, maybe as required by replacement job)
how to communicate an app’s preemption support to the host environment. For example, if I can support preemption, what handshake do I understand, what constraints exist on my support
the preemption handshake itself. How does the host alert the app to proposed preemption, can the app respond with a counterproposal (e.g., take part of my allocation but leave some part of me running, I need N seconds to prepare, …), desired/required restart mechanism (e.g., restore from checkpoint), etc.
We currently have the following relevant definitions in pmix_common.h:
Session control attributes:
PMIX_SESSION_PREEMPT (bool) preempt indicated jobs (given in accompanying pmix_info_t
via the PMIX_NSPACE attribute) in the specified session and recover
all their resources. If no PMIX_NSPACE is specified, then preempt
all jobs in the session.
Attributes relating to allocation requests:
PMIX_ALLOC_PREEMPTIBLE (bool) by default, all jobs in the resulting allocation are
to be considered preemptible (overridable at per-job level)
Attributes relating to spawn requests:
PMIX_JOB_CTRL_PREEMPTIBLE (bool) job can be pre-empted
Events:
PMIX_JCTRL_PREEMPT_ALERT monitored by client to detect RM intends to preempt
PMIX_JCTRL_CHECKPOINT_COMPLETE sent by client and monitored by server to
notify that a checkpoint operation has completed
The text was updated successfully, but these errors were encountered:
As we start to work on defining preemption support, we have to consider several aspects of the problem:
how to specify the local preemption policies. I suggest that this is not relevant to this organization - it really is a problem for the local host environment, each of which may choose to do it differently.
how to query what the local preemption policy is, what options are definable by the app/tool, and how it is implemented. This is largely a question of attribute definition for query support. Should include indication of whether preemption is a “ctrl-z” (i.e., pause but remaining in memory) or a complete shutdown and removal - or both (perhaps selectable by app, maybe as required by replacement job)
how to communicate an app’s preemption support to the host environment. For example, if I can support preemption, what handshake do I understand, what constraints exist on my support
the preemption handshake itself. How does the host alert the app to proposed preemption, can the app respond with a counterproposal (e.g., take part of my allocation but leave some part of me running, I need N seconds to prepare, …), desired/required restart mechanism (e.g., restore from checkpoint), etc.
We currently have the following relevant definitions in pmix_common.h:
Session control attributes:
Attributes relating to allocation requests:
Attributes relating to spawn requests:
Events:
The text was updated successfully, but these errors were encountered: