GitHub is NOT the preferred viewer for this file. Please visit https://flux-framework.rtfd.io/projects/flux-rfc/en/latest/spec_27.html
This specification describes Version 1 of the Flux Resource Allocation Protocol implemented by the job manager and a compliant Flux scheduler.
Name | github.com/flux-framework/rfc/spec_27.rst |
Editor | Jim Garlick <[email protected]> |
State | raw |
The Flux job manager's role is managing the queue of pending job requests and transitioning jobs through the job states defined in RFC 21, actively initiating actions in each state. A Flux scheduler's role is passively fulfilling resource allocation requests that the job manager makes on behalf of jobs.
A scheduler implementation registers the generic service name sched
and provides several well known service methods. The job manager requests
resources from the scheduler with a sched.alloc
request when a job enters
SCHED state. It releases resources with one or more sched.free
requests
after a job enters CLEANUP state.
The simplest imaginable scheduler satisfies sched.alloc
requests in order
until it is out of resources, then blocks until sched.free
requests
release enough resources to satisfy the next sched.alloc
request.
More complex schedulers consider multiple sched.alloc
requests and
satisfy them out of order to prioritize or balance measures of success
such as resource utilization or fairness.
Abstract resource allocation requests are expressed as a jobspec object (RFC 14). Concrete resources assignments are expressed a an R object (RFC 20). These objects are stored in the KVS per the job schema (RFC 16).
This RFC describes the RPC messages outlined above. It also describes the initialization messages used to establish parameters for scheduler operation and identify resources that are already allocated at scheduler startup. It does not cover the mechanism by which a scheduler discovers the initial inventory of resources.
- Support multiple scheduler implementations, minimizing repeated code in schedulers.
- Allow the maximum number of outstanding allocation requests sent by the job manager to be controlled by the scheduler.
- Allow the scheduler module to be reloaded with recovery of resource allocations of running jobs.
- Allow the scheduler module to abort (return from its module thread unexpectedly) without impacting running work.
- Send allocation requests to scheduler in priority, submit time order.
- Inform scheduler of job priority and submit time so it can reorder requests internally, combining these factors with others.
- Support job cancellation in SCHED state.
- Support job priority change in SCHED state.
- The resource allocation protocol should not present obstacles to scaling to O(1M) jobs in SCHED state.
- The protocol should not inhibit scaling job throughput to O(100) jobs per second.
- Capture scheduler specific job annotations for display by the job listing tool (e.g. start time estimates).
- Allow the expiration time of a resource allocation to be adjusted.
- Detect unsatisfiable job requests at submission time.
To escape scalability limitations of the Flux "tag pool", sched.alloc
and
sched.free
RPCs use the job ID to match requests and responses, and set the
RFC 6 matchtag message field to zero. It follows that:
- The job ID MUST appear in the
sched.alloc
andsched.free
request and response message payloads. - There SHALL NOT be more than one
sched.alloc
orsched.free
request in flight for each job, since otherwise a request could not be uniquely matched to a response using the job ID. - The errnum field in
sched.alloc
andsched.free
response messages MUST be set to zero, even if the response indicates an error. Otherwise, the message payload could not include the job ID since RFC 6 defines the payload of an error response to be optional error text. - The job manager SHALL treat a conventional Flux error response to
sched.alloc
orsched.free
with a nonzero errnum field as a scheduler fatality, and SHALL not send further requests to the scheduler until it receives a newjob-manager.sched-ready
request (see Finalization below).
The other RPCs behave conventionally.
A detailed description of these RPCs follows.
Before any other RPCs are sent to the job manager, the scheduler SHALL
send a request to job-manager.sched-hello
with the FLUX_MSGFLAG_STREAMING
flag set. The request payload SHALL either be empty or consist of a JSON
object with the following OPTIONAL keys:
- partial-ok
- (boolean) The scheduler SHALL set this flag to
true
if it can handle thefree
key in hello responses. That is, it can process jobs with partially released resource sets. If this key is missing it SHALL be interpreted asfalse
.
The job manager SHALL send one response message for each job with allocated resources. Each response payload SHALL consist of a JSON object with the following REQUIRED keys:
- id
- (integer) job ID
- priority
- (integer) priority in the range of 0 through 4294967295
- userid
- (integer) job owner
- t_submit
- (double) job submission time
and the following OPTIONAL key:
- free
- (string) An RFC 22 idset representing the ranks (execution targets) of this job's resource set that have already been freed. If this key is omitted, the scheduler SHALL assume the empty set.
Example:
{
"id": 1552593348,
"priority": 43444,
"userid": 5588,
"t_submit": 1552593348.073045,
"free":"1-16,18",
}
For each job response, the scheduler SHALL mark its assigned resources allocated internally. It MAY look up R in the KVS by job ID according to the job schema (RFC 16).
The scheduler SHALL wait for an error response with ENODATA set, indicating the stream of responses has completed (RFC 6).
If an error response other than ENODATA is returned to the
job-manager.sched-hello
request, the scheduler SHALL log the error
and exit its module thread.
Once the scheduler has processed the job-manager.sched-hello
handshake,
it SHALL notify the job manager that it is ready to accept allocation requests
by sending a request to job-manager.sched-ready
.
The request payload SHALL consist of a JSON object with the following REQUIRED key:
- mode
- (string) selected concurrency mode
The mode string SHALL be one of the following:
- unlimited
- The job manager SHALL send a
sched.alloc
request for all jobs in SCHED state, with no limit on concurrency. - limited
- The job manager SHALL limit the number of concurrent
sched.alloc
requests to value specified by thelimit
key (described below).
The following key is REQUIRED for limited
mode only:
- limit
- (integer) The number of concurrent
sched.alloc
requests that can be sent.limit
can be in the range of 1 to 2147483647.
Example:
{"mode":"limited","limit":42}
The response payload is a JSON object with the following REQUIRED keys:
- count
- (integer) current queue depth
After responding to the job-manager.sched-ready
request, the job manager
MAY immediately begin sending sched.alloc
and sched.free
requests.
If an error response is returned to the job-manager.sched-ready
request,
the scheduler SHALL log the error and exit its module thread.
The job manager SHALL send a sched.alloc
request when a job enters SCHED
state, and concurrency criteria established by the initialization handshake
are met. The request payload consists of a JSON object with the following
REQUIRED keys:
- id
- (integer) job ID
- priority
- (integer) priority in the range of 0 through 4294967295
- userid
- (integer) job owner
- jobspec
- (object) jobspec object (RFC 14)
Example:
{
"id": 1552593348,
"priority": 53444,
"userid": 5588,
"jobspec": {
"resources": [
{
"type": "slot",
"count": 1,
"with": [{"type": "core", "count": 1}], "label": "task"
}
],
"tasks": [
{
"command": ["/bin//true"],
"slot": "task",
"count": {"per_slot": 1}
}
],
"attributes": {
"system": {
"duration": 0,
"cwd": "/home/user/project",
}
},
"version": 1
}
}
The jobspec sent with sched.alloc
MAY have its environment section
redacted to reduce its size, since the environment is not needed by the
scheduler. Should it be needed, the full jobspec SHALL be stable in the
KVS per the job schema (RFC 16) when the sched.alloc
request is received.
The response payload is a JSON object with the following REQUIRED keys:
- id
- (integer) job ID
- type
- (integer) response type in the range of 0 through 3
There are four response types:
- SUCCESS (0)
- Resources have been allocated
- ANNOTATE (1)
- The scheduler wishes to annotate the job (see below)
- DENY (2)
- The job cannot be scheduled
- CANCEL (3)
- The alloc request was canceled by a
sched.cancel
request (see below).
The alloc
request MAY receive multiple responses.
If resources can be allocated, the scheduler SHALL ensure that R has been successfully committed to the KVS per the job schema (RFC 16) before responding.
In addition to the above REQUIRED keys, the SUCCESS response includes the OPTIONAL key:
- annotations
- (object) key value pairs
Example:
{
"id": 1552593348,
"type": 0,
"annotations": {
"sched": {
"resource_summary":"rank[0-1]/core0"
}
}
}
If present, the job manager SHALL update the job's annotation dictionary
as described in the next section. The scheduler MAY delete annotations
such as sched.t_estimate
that are not relevant now that the allocation
request has been satisfied.
The job manager posts an alloc
event in response to the successful
allocation of resources. A snapshot of job's annotation dictionary, after
the above update, is included in the alloc
event context per RFC 21,
thus preserving it in job record when the allocation is successful.
After the SUCCESS response, the sched.alloc
request is complete and may be
retired by the job manager and scheduler.
While a job is in SCHED state, the scheduler MAY send multiple ANNOTATE
type responses to the sched.alloc
request to update scheduler-defined
information for display by the job listing tool.
In addition to the above REQUIRED keys, the ANNOTATE response includes the REQUIRED key:
- annotations
- (object) key value pairs
The job manager SHALL maintain a dictionary of annotations for each job.
Each ANNOTATE response and the SUCCESS response (if it contains annotations) SHALL update the dictionary according to the following rules:
- If a key exists and is a dictionary, and the new value is a dictionary, the rules below SHALL be applied to the dictionary recursively.
- If a key exists, its value SHALL be replaced with the new value.
- If a key exists and the new value is JSON null, the key SHALL be removed.
- If a key does not exist, the key SHALL be added with the new value.
The key MAY be one of the following:
- sched
- (dictionary) dictionary object containing scheduler specific annotations
- sched.t_estimate
- (double) estimated absolute start time in seconds since UNIX epoch
- sched.reason_pending
- (string) human readable reason job is pending
- sched.resource_summary
- (string) human readable overview of assigned resources
- sched.queue
- (string) human readable identification of job queue
- user
- (dictionary) dictionary object containing user specific annotations
A scheduler MAY define additional sched
keys as needed.
A value MAY be any valid JSON value.
Example:
{
"id": 1552593348,
"type": 1,
"annotations": {
"sched": {
"t_estimate": 593016000.0,
"reason_pending": "requested GPUs are unavailable"
}
}
}
Annotations SHALL be considered volatile until a SUCCESS response is received
to the sched.alloc
request, as described in Alloc Success above.
Annotations SHALL be discarded by the job manager if the allocation fails.
If the resource request can never be fulfilled, the scheduler SHALL
respond to the sched.alloc
with a DENY type response.
In addition to the above REQUIRED Keys, the DENY response includes the OPTIONAL key:
- note
- (string) the reason why the allocation cannot ever be granted
Example:
{
"id": 1552593348,
"type": 2,
"note": "more nodes requested than configured"
}
If present, the note SHALL be added to the exception
event context
generated by the job manager when processing the allocation failure.
After the DENY response, the sched.alloc
request is complete and may be
retired by the job manager and scheduler.
When the scheduler receives a sched.cancel
request for a job (see below),
it SHALL respond to the corresponding sched.alloc
request with response
type CANCEL. Only the REQUIRED keys above are allowed in a CANCEL response.
Example:
{
"id": 1552593348,
"type": 3
}
After the CANCEL response, the sched.alloc
request is complete and may be
retired by the job manager and scheduler.
The job manager may cancel a pending sched.alloc
request by sending
a request to sched.cancel
with payload consisting of a JSON object
with the following REQUIRED key:
- id
- (integer) job ID
Example:
{
"id": 1552593348
}
The scheduler SHALL NOT respond directly to the sched.cancel
request.
Instead, if a sched.alloc
request is pending for the specified job,
it SHALL respond to the sched.alloc
request with a CANCEL response
as described above. If the specified job does not have a pending
sched.alloc
request, the request SHALL be ignored by the scheduler.
Note that receipt of a sched.cancel
does not necessarily indicate
that the job is canceled. For example, the job manager may cancel all
outstanding sched.alloc
requests in response to the queue being
administratively disabled, or to make room for higher priority jobs
in single
mode.
When jobs with outstanding sched.alloc
requests are re-prioritized,
the job manager notifies the scheduler by sending a sched.prioritize
request. The request payload consists of a JSON object with the following
REQUIRED key:
- jobs
- (array) list of [id, priority] tuples
Each tuple SHALL consist of a two element array, containing:
- [0]
- (integer) job ID
- [1]
- (integer) priority in the range of 0 through 4294967295
Example:
{
"jobs":[
[49056579584, 444],
[57428410368, 298],
[63988301824, 343205],
[69675778048, 99]
]
}
Job IDs which cannot be correlated to a pending sched.alloc
request
may be safely ignored.
No response is sent to the sched.prioritize
request.
Note
A job manager priority plugin MAY initiate a priority update of many
jobs at once. The job manager captures these updates in a single
sched.prioritize
request.
The job manager MAY request an adjustment to the expiration time of an
existing allocation by sending a sched.expiration
request. The request
payload consists of a JSON object with the following REQUIRED keys:
- id
- (integer) job ID
- expiration
- (integer) the proposed new expiration time, in seconds since the Unix Epoch (1970-01-01 UTC). It MAY reduce or extend the current expiration time.
The response consists of an empty payload on success.
The request MAY fail, for example if:
- The job ID is invalid or does not currently have an allocation.
- The new expiration time would invalidate an advance reservation.
- The new expiration is less than the instance lifetime.
- The scheduler does not implement
sched.expiration
.
Note
The job-manager SHALL interpret an ENOSYS error response to
sched.expiration
as "not implemented" and process the new
expiration time as though the request were successful. Schedulers
that require accurate expiration times SHOULD implement this RPC to
avoid making schedules that are based on outdated information.
Note that enforcement of expiration times is the responsibility of the
execution system, not the scheduler.
The job manager SHALL send one or more sched.free
requests to release
allocated resources to the scheduler. The request payload consists of
a JSON object with the following REQUIRED keys:
- id
- (integer) job ID
- R
- (object) RFC 20 resource set from which the
scheduling
key MAY be omitted. This resource set MAY be a subset of the original allocation.
and the following OPTIONAL key:
- final
- (boolean) If present and true, this free request is the final one for the job. If absent or false, more free requests are forthcoming for the job.
Example:
{
"id": 1552593348,
"R": {
"version": 1,
"execution": {
"R_lite": [
{ "rank": "0", "children": { "core": "0-3" } }
],
"nodelist": [ "test0" ],
"starttime": 1710076092,
"expiration": 1710076122
}
}
"final":true
}
Upon receipt of a sched.free
request, the scheduler SHOULD mark the
designated resources as available for reuse.
When a job's resources are released with multiple sched.free
requests,
the scheduler MAY assume that the resources associated with a given rank
in the original allocation are never split over multiple free requests.
There is no response.
If the job manager receives a conventional Flux error response to
a sched.alloc
or sched.free
request, it SHALL log the error
and suspend scheduling operations. This ensures that, if the scheduler
is not loaded, and the broker responds with an ENOSYS error on its behalf,
the job manager behaves appropriately.
Similarly, if the job manager receives a disconnect
request from the
scheduler, it SHALL suspend scheduler operations.
Operations MAY resume if the scheduler re-establishes itself with the
job-manager.sched-hello
and job-manager.sched-ready
handshakes.
When a job encounters a fatal exception, the job manager transitions it to CLEANUP state.
Upon the job entering CLEANUP state, the job manager sends a sched.cancel
request on its behalf if the job has an outstanding sched.alloc
request.
If the job is holding resources when it enters CLEANUP, the job manager sends
a sched.free
request.
If the scheduler is monitoring job exceptions, it SHOULD NOT react in ways that might conflict with the job manager's actions.
A scheduler or other entity MAY register a generic feasibility
service
name through which unsatisfiable jobs may be detected at job submission.
The feasibility
service MAY be registered on one or more nodes to
distribute the load of feasibility checks. The job ingest validator (the
main user of the feasibility
service) runs on all ranks and issues the
feasibility.check
RPC with FLUX_NODEID_ANY
. The request is routed to
a local feasibility
service if available, or an upstream one per RFC 3.
To determine the feasibility of a job request, a feasibility.check
request is sent with the following REQUIRED key:
- jobspec
- (object) The jobspec for which feasibility should be checked.
If the included jobspec could not ever be satisfied, even if all resources
were available and ready, then the feasibility.check
service SHALL
respond with a error response including a human readable error string.
The response SHALL consist of an empty payload on success.