Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preemption support #504

Open
rhc54 opened this issue Jan 23, 2024 · 0 comments
Open

Preemption support #504

rhc54 opened this issue Jan 23, 2024 · 0 comments

Comments

@rhc54
Copy link
Member

rhc54 commented Jan 23, 2024

As we start to work on defining preemption support, we have to consider several aspects of the problem:

  • how to specify the local preemption policies. I suggest that this is not relevant to this organization - it really is a problem for the local host environment, each of which may choose to do it differently.

  • how to query what the local preemption policy is, what options are definable by the app/tool, and how it is implemented. This is largely a question of attribute definition for query support. Should include indication of whether preemption is a “ctrl-z” (i.e., pause but remaining in memory) or a complete shutdown and removal - or both (perhaps selectable by app, maybe as required by replacement job)

  • how to communicate an app’s preemption support to the host environment. For example, if I can support preemption, what handshake do I understand, what constraints exist on my support

  • the preemption handshake itself. How does the host alert the app to proposed preemption, can the app respond with a counterproposal (e.g., take part of my allocation but leave some part of me running, I need N seconds to prepare, …), desired/required restart mechanism (e.g., restore from checkpoint), etc.

We currently have the following relevant definitions in pmix_common.h:

Session control attributes:

PMIX_SESSION_PREEMPT   (bool) preempt indicated jobs (given in accompanying pmix_info_t
                       via the PMIX_NSPACE attribute) in the specified session and recover
                       all their resources. If no PMIX_NSPACE is specified, then preempt
                       all jobs in the session.

Attributes relating to allocation requests:

PMIX_ALLOC_PREEMPTIBLE   (bool) by default, all jobs in the resulting allocation are
                         to be considered preemptible (overridable at per-job level)

Attributes relating to spawn requests:

PMIX_JOB_CTRL_PREEMPTIBLE    (bool) job can be pre-empted

Events:

PMIX_JCTRL_PREEMPT_ALERT    monitored by client to detect RM intends to preempt

PMIX_JCTRL_CHECKPOINT_COMPLETE   sent by client and monitored by server to
                        notify that a checkpoint operation has completed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant