From 066fc56c844ec3befa2a8f999a3d6f8a9255a5b1 Mon Sep 17 00:00:00 2001 From: Hobbs Date: Mon, 13 May 2024 08:11:10 -0700 Subject: [PATCH] rfc: add notification service design doc Problem: no design currently exists for the Flux email service as noted in flux-framework/flux-core#4435. Add a RFC-style document detailing this. --- README.md | 1 + index.rst | 7 ++ spec_44.rst | 204 +++++++++++++++++++++++++++++++++++++++++++++++++++ spell.en.pws | 2 + 4 files changed, 214 insertions(+) create mode 100644 spec_44.rst diff --git a/README.md b/README.md index 4fee37b..5e935a2 100644 --- a/README.md +++ b/README.md @@ -53,6 +53,7 @@ Table of Contents - [41/Job Information Service](spec_41.rst) - [42/Subprocess Server Protocol](spec_42.rst) - [43/Job List Service](spec_43.rst) +- [44/Flux Library for Adaptable Notifications](spec_44.rst) Build Instructions ------------------ diff --git a/index.rst b/index.rst index d3c7feb..7f00c87 100644 --- a/index.rst +++ b/index.rst @@ -283,6 +283,12 @@ standard I/O management of remote processes. The Flux Job List Service provides read-only summary information for jobs. +:doc:`spec_44` +~~~~~~~~~~~~~~ + +This specification describes the Flux service that allows users to +receive external notifications for events in a Flux job. + .. Each file must appear in a toctree .. toctree:: :hidden: @@ -328,3 +334,4 @@ The Flux Job List Service provides read-only summary information for jobs. spec_41 spec_42 spec_43 + spec_44 diff --git a/spec_44.rst b/spec_44.rst new file mode 100644 index 0000000..f7d6d3b --- /dev/null +++ b/spec_44.rst @@ -0,0 +1,204 @@ +.. github display + GitHub is NOT the preferred viewer for this file. Please visit + https://flux-framework.rtfd.io/projects/flux-rfc/en/latest/spec_28.html + +44/Flux Library for Adaptable Notifications Version 1 +########################################################### + +This specification describes the Flux service that allows users to +receive external notifications for events in a Flux job. + +.. list-table:: + :widths: 25 75 + + * - **Name** + - github.com/flux-framework/rfc/spec_44.rst + * - **Editor** + - William Hobbs + * - **State** + - raw + +Language +******** + +.. include:: common/language.rst + +Related Standards +***************** + +- :doc:`spec_14` +- :doc:`spec_21` +- :doc:`spec_25` + +Background +********** + +The Flux Library for Adaptable Notifications (FLAN) provides event-driven +functionality that sends external notifications of job events. + +Terminology +*********** + +These terms may have broader meaning in other RFCs or the Flux project. To +avoid confusion, below is a glossary of terms as they apply in this document. + +Notification + An email or other notification triggered by FLAN but whose ultimate delivery + is handled by an external service. + +Notification-enabled jobs + Jobs that include a jobspec attribute requesting a notification for certain + events in the job's life cycle. For a more detailed definition of job events, + refer to :doc:`spec_21`. + +Requirements +************ + + - By default, do not notify a user of any job events. + - Allow the user to override this default with a jobspec attribute, + ``system.notify``. + - Support notification after any event of the job, where events are defined in + :doc:`spec_21`. + - Support email for end user notification delivery. + - Allow for extensibility via plugins to support more end user notification + delivery services, such as Slack and Mattermost. The implementation of + plugins for any service other than email is not a requirement. + - Utilize as few resources as possible in the Flux job-manager. Under no + circumstances will a notification block any stage or event of a Flux job. + - Provide configurable rate-limiting to ensure users can never be overwhelmed + by notifications. + +Implementation +************** + +FLAN SHALL be implemented by a service that MAY be started under ``systemd``. + +Introduction +============ + +The Flux job-manager journal of events (JoE) is an interface that streams job +events in real-time for jobs in a Flux instance. The JoE can be configured to +send all completed events in addition to streaming real-time events. The JoE +includes annotations such as jobspec and R where appropriate. + +The Flux Library for Adaptable Notifications (FLAN) provides a server that +opens a streaming RPC request to the JoE, receives events from the JoE, stores +jobspec, event logs, and resource sets by jobid for all active jobs, and allows +for clients to asynchronously perform operations (such as send emails) based on +these events. + +FLAN implements an event dispatcher to handle batches of events +based on a timer, allowing for a massive number of events to be handled +semi-synchronously with rate limiting for massive job throughput. Since +FLAN is run as a separate process alongside a Flux instance, it can never +block the job-manager or other critical Flux services. FLAN must be run under +the instance owner credentials but needn't be run on the same node as the +rank 0 Flux broker. + +Initial Request +=============== + +FLAN SHALL open a streaming RPC request to the JoE. FLAN SHALL request the full +journal, including completed events. + +Initial Response(s) +=================== + +An "initial response" is any response prior to the JoE's "sentinel," which +indicates that the backlog has completed transmission. + +Initial responses are per-jobid and can include multiple events. FLAN SHALL +store the annotations (jobspec, R) per jobid and process each event. + +Event Dispatcher +================ + +Instead of handling each event sequentially, events shall be queued and handled +in batches by the event dispatcher. The event dispatcher SHALL contain a queue +of events to process. The event dispatcher SHALL process these events after +receiving a signal from the reactor's timer watcher. The timer watcher SHALL +have a configurable delta (time between wake-ups). + +Upon waking up, the event dispatcher shall determine if an event is "of +interest," and process the event if so. Only the most recent event for a given +job SHALL be processed. Processing of the event involves clients of the FLAN +server completing any process they specify, such as sending an email. + +On the initial run of the event dispatcher, FLAN SHALL compare the events in +its queue to a record in the KVS of "handled" events, and ignore "handled" +events. The initial run of the event dispatcher SHALL block subsequent runs. +For each subsequent iteration of the event dispatcher, FLAN SHALL write to the +KVS a record of the events it has processed before the event dispatcher goes to +sleep. + +Subsequent Responses +==================== + +Subsequent responses from the JoE shall be queued in the event dispatcher in +real-time, and processed when the dispatcher wakes up. A record of jobspec, R, +and eventlog for each event SHALL be stored, and the record removed by the +event dispatcher when it receives the ``clean`` event for a job. + +User Interface +************** + +Users SHALL create notification-enabled jobs by specifying an attribute in their +job's jobspec. Jobspec attributes are defined in :doc:`spec_25`. + +Basic Use Case +============== + +Users SHALL add the following attribute to their jobspec: + +.. code-block:: json + + { + "attributes": { + "system": { + "notify": "default" + } + } + } + +The default behavior SHALL be to send a notification via email to the user +when the job reaches the START and FINISH events. + +A future update to the jobspec API will make this jobspec attribute easily +accessible via a single argument to a command, ``--notify``. + +Advanced Use Cases +================== + +Only the basic use case SHALL be supported in v1. + +The ``system.notify`` jobspec attribute SHALL accept a dictionary containing some +or all of the following values: + +.. code-block:: json + + { + "attributes": { + "system": { + "notify": { + "service": "slack", + "handle": "elvis", + "include": ["R", "eventlog", "return_code"], + "states": ["start", "prolog_finish"] + } + } + } + } + +Edge Cases +********** + +These edge cases MAY be supported in FLAN v1. + +Expiration of notifications +=========================== + +In certain cases, a restart of the service may be delayed such that events of interest +on notification-enabled jobs are long past. FLAN MAY support an "expiration" setting +which would stop any notification from final delivery if a set amount of time had +passed since the event. + diff --git a/spell.en.pws b/spell.en.pws index 26b80fe..d48fd31 100644 --- a/spell.en.pws +++ b/spell.en.pws @@ -8,6 +8,8 @@ org tarball tarballs adoc +Mattermost +JoE api env github