-
Notifications
You must be signed in to change notification settings - Fork 13
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
rfc: add notification service design doc
Problem: no design currently exists for the Flux email service as noted in flux-framework/flux-core#4435. Add a RFC-style document detailing this.
- Loading branch information
Showing
6 changed files
with
275 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
attributes: | ||
system: | ||
notify: | ||
include: "{id.f58} {event} {return_code}" | ||
service: "slack" | ||
handle: "elvis" | ||
events: "FINISH" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
attributes: | ||
system: | ||
notify: "default" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,253 @@ | ||
.. github display | ||
GitHub is NOT the preferred viewer for this file. Please visit | ||
https://flux-framework.rtfd.io/projects/flux-rfc/en/latest/spec_28.html | ||
44/Flux Library for Adaptable Notifications Version 1 | ||
########################################################### | ||
|
||
This specification describes the Flux service that allows users to | ||
receive external notifications for events in a Flux job. | ||
|
||
.. list-table:: | ||
:widths: 25 75 | ||
|
||
* - **Name** | ||
- github.com/flux-framework/rfc/spec_44.rst | ||
* - **Editor** | ||
- William Hobbs <[email protected]> | ||
* - **State** | ||
- raw | ||
|
||
Language | ||
******** | ||
|
||
.. include:: common/language.rst | ||
|
||
Related Standards | ||
***************** | ||
|
||
- :doc:`spec_19` | ||
- :doc:`spec_21` | ||
- :doc:`spec_25` | ||
|
||
Background | ||
********** | ||
|
||
Toward the goal of supporting users who run batch jobs with variable end time | ||
dependent on queues, runtime, and other factors, the Flux Library for Adaptable | ||
Notifications (FLAN) provides event-driven functionality that sends external | ||
notifications of job events. | ||
|
||
Terminology | ||
*********** | ||
|
||
These terms may have broader meaning in other RFCs or the Flux project. To | ||
avoid confusion, below is a glossary of terms as they apply in this document. | ||
|
||
Notification | ||
An email, Slack message, Mattermost message, etc. triggered by FLAN but | ||
ultimately external to the FLAN service. | ||
|
||
Notification-enabled jobs | ||
Jobs that include a jobspec attribute requesting a notification for certain | ||
events in the job's life cycle. For a more detailed definition of job events, | ||
refer to :doc:`spec_21`. | ||
|
||
|
||
Requirements | ||
************ | ||
|
||
- By default in a system-instance, do not notify a user of any job events. | ||
Allow the user to override this default with a jobspec attribute, | ||
system.notify. | ||
- Support notification after any event of the job, where events are defined in | ||
:doc:`spec_21`. | ||
- Support email as the primary end user notification delivery. | ||
- Build a driver capable of sending POST requests to chat services for | ||
notification delivery, provided they have an API capable of accepting such | ||
requests. Examples include but are not limited to Mattermost and Slack. | ||
- Utilize as few resources as possible in the Flux job-manager. Under no | ||
circumstances will a notification block any stage or event of a Flux job. | ||
- Provide configurable rate-limiting to ensure users can never be overwhelmed | ||
by notifications. | ||
|
||
Implementation | ||
************** | ||
|
||
FLAN SHALL be implemented in two parts: | ||
|
||
The jobtap plugin | ||
A shared library based on the API defined in | ||
`flux-jobtap-plugins(7) <https://flux-framework.readthedocs.io/projects/flux-core/en/latest/man7/flux-jobtap-plugins.html>`_ | ||
that streams the jobids of notification-enabled jobs to the Python driver. | ||
|
||
The Python driver | ||
A Python process used for tracking notification-enabled jobs through the job | ||
life cycle. Started by the flux user on the node containing the rank 0 broker | ||
in a cluster, it asynchronously monitors the events of all notification-enabled | ||
jobs. It attaches callbacks to certain events and sends notifications. | ||
|
||
Initial Request | ||
--------------- | ||
|
||
After the jobtap plugin has been loaded in the job-manager, the Python driver | ||
SHALL send a ``notify.enable`` streaming RPC request to the jobtap plugin | ||
at initialization. | ||
|
||
The ``notify.enable`` request has no payload. | ||
|
||
At initialization the Python driver SHALL create a kvs subdirectory, ``notify``. | ||
Should this directory already exist, the Python driver SHALL NOT crash. The | ||
Python driver SHALL traverse the existing directory and record the jobids in it. | ||
The Python driver SHALL reconcile the jobids in the KVS with the jobids in the | ||
responses to the initial RPC. | ||
|
||
Initial Responses | ||
----------------- | ||
|
||
The jobtap plugin SHALL keep a record of jobids for jobs that are ACTIVE and | ||
notification-enabled. On initialization, all of the jobids in this record | ||
SHALL be sent as individual responses to the Python driver. | ||
|
||
jobid | ||
As defined in :doc:`spec_19`, a single jobid for a notification-enabled job. | ||
|
||
.. note:: | ||
The initial responses are intended to restore state should the Python driver | ||
crash. | ||
|
||
Additional Responses | ||
-------------------- | ||
|
||
The jobtap plugin SHALL continue to send responses to the initial | ||
``notify.enable`` RPC request whenever notification-enabled jobs enter the | ||
DEPEND state. The jobtap plugin SHALL add these jobids to its record | ||
of ACTIVE, notification-enabled jobs. | ||
|
||
For each response received by the Python driver, the driver SHALL create a | ||
KVS subdirectory, ``notify.<jobid>``. In this directory the driver SHALL | ||
insert keys representing the job events for which users have requested a | ||
notification. These keys values SHALL be empty. The key SHALL be deleted | ||
after the corresponding notification is sent. | ||
|
||
The Python driver MUST then asynchronously monitor the job as it reaches | ||
events of interest. | ||
|
||
When the job reaches an event of interest, the Python driver SHALL | ||
generate an email and send it to the user. (FLAN MAY eventually support | ||
other means of notification delivery, such as chat services.) The Python | ||
driver SHALL subsequently delete the corresponding key in the KVS, | ||
``notify.<jobid>.<state>``. | ||
|
||
The ``notify.<jobid>`` KVS subdirectory SHALL be deleted when the job reaches | ||
an INACTIVE state. If the ``notify.<jobid>`` directory is non-empty upon | ||
reaching the INACTIVE state, this indicates some notifications have been missed. | ||
The Python driver SHALL send a final notification to the user documenting | ||
that their notification-enabled job has reached an INACTIVE state. | ||
|
||
.. note:: | ||
This design is intended to ensure that no double-notifications are sent upon | ||
the restart of the Python script, the jobtap plugin, or the job-manager. | ||
|
||
Error Response | ||
-------------- | ||
|
||
If an error response is returned to ``notify.enable``, this indicates that the | ||
jobtap plugin is not loaded in the job-manager. The Python driver SHALL exit | ||
immediately, and print an appropriate error message. | ||
|
||
Disconnect Request | ||
------------------ | ||
|
||
If a disconnect request is received by the jobtap plugin, this indicates the | ||
Python driver has exited. The jobtap plugin SHALL continue to add notification- | ||
enabled jobs to its record as they enter the DEPEND state. When the Python | ||
driver reconnects, the jobtap plugin SHALL respond to its initial ``notify.enable`` | ||
RPC request with a response RPC for each jobid that is notification-enabled. | ||
|
||
User Interface | ||
************** | ||
|
||
Users SHALL create notification-enabled jobs by specifying an attribute in their | ||
job's jobspec. Jobspec attributes are defined in :doc:`spec_25` | ||
|
||
Basic Use Case | ||
-------------- | ||
|
||
Users SHALL add the following attribute to their jobspec: | ||
|
||
.. literalinclude:: data/spec_44/example2.yaml | ||
:language: yaml | ||
|
||
The default behavior SHALL be to send a notification via email to the user | ||
when the job reaches the START and FINISH events. | ||
|
||
Advanced Use Cases | ||
------------------ | ||
|
||
Only the basic use case SHALL be supported in v1. | ||
|
||
The ``system.notify`` jobspec attribute SHALL accept a dictionary containing some | ||
or all of the following values: | ||
|
||
.. literalinclude:: data/spec_44/example1.yaml | ||
:language: yaml | ||
|
||
For System Administrators | ||
------------------------- | ||
|
||
The webhooks and other secrets required to connect to chat services SHALL be included | ||
in a ``config.toml`` file. The path to this file MUST be provided to the FLAN | ||
Python driver on initialization. Note that best practice for managing webhooks is | ||
to keep them secret. | ||
|
||
Example Life Cycle of a Notification-Enabled Job | ||
************************************************ | ||
|
||
Coming soon! | ||
|
||
Edge Cases | ||
********** | ||
|
||
These edge cases MAY be supported in FLAN v1. | ||
|
||
Restarting the job-manager | ||
-------------------------- | ||
|
||
In the event the job-manager crashes or is shut down the Python driver SHALL exit | ||
immediately and log an error. | ||
|
||
Flux does not currently support restarting with running jobs. However, on a system | ||
restart, all events for all ACTIVE jobs are replayed. This means that when each | ||
notification-enabled job reaches the DEPEND event, the jobtap plugin SHALL | ||
send a streaming RPC response and record the jobid. The Python driver, upon | ||
receiving a new jobid MUST ensure that the jobid does not have | ||
a previous entry in the KVS. Since the KVS is reloaded on a restart, any outstanding | ||
notifications SHALL have corresponding keys there. If a jobid received by the Python | ||
driver already has a KVS subdirectory, the Python driver SHALL ignore the job's | ||
event notification requests in the jobspec and only send notifications that | ||
correspond with the keys in the KVS. This prevents a double-notification of the user | ||
for the same job state on a restart of the job-manger or FLAN service. | ||
|
||
Expiration of notifications | ||
--------------------------- | ||
|
||
In certain cases, a restart of the service may be delayed such that events of interest | ||
on notification-enabled jobs are long past. FLAN MAY support an "expiration" setting | ||
which would stop any notification from final delivery if a set amount of time had | ||
passed since the event. | ||
|
||
Subinstance notifications | ||
------------------------- | ||
|
||
Due to the recursive launch feature of Flux, users may wish to have notifications | ||
for states of batch jobs that are not at the system-instance level. This MAY NOT | ||
be supported in FLAN v1. | ||
|
||
Invalid jobspec attributes | ||
-------------------------- | ||
|
||
FLAN MAY eventually provide a plugin for validating the advanced use | ||
cases detailed above. In the interim, if a user tries to utilize the advanced | ||
case and provide junk keys or values, FLAN SHALL defer to default mode. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters