-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rfc: add notification service design doc #414
Open
wihobbs
wants to merge
1
commit into
flux-framework:master
Choose a base branch
from
wihobbs:flanrfc
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,204 @@ | ||
.. github display | ||
GitHub is NOT the preferred viewer for this file. Please visit | ||
https://flux-framework.rtfd.io/projects/flux-rfc/en/latest/spec_28.html | ||
|
||
44/Flux Library for Adaptable Notifications Version 1 | ||
########################################################### | ||
|
||
This specification describes the Flux service that allows users to | ||
receive external notifications for events in a Flux job. | ||
|
||
.. list-table:: | ||
:widths: 25 75 | ||
|
||
* - **Name** | ||
- github.com/flux-framework/rfc/spec_44.rst | ||
* - **Editor** | ||
- William Hobbs <[email protected]> | ||
* - **State** | ||
- raw | ||
|
||
Language | ||
******** | ||
|
||
.. include:: common/language.rst | ||
|
||
Related Standards | ||
***************** | ||
|
||
- :doc:`spec_14` | ||
- :doc:`spec_21` | ||
- :doc:`spec_25` | ||
|
||
Background | ||
********** | ||
|
||
The Flux Library for Adaptable Notifications (FLAN) provides event-driven | ||
functionality that sends external notifications of job events. | ||
|
||
Terminology | ||
*********** | ||
|
||
These terms may have broader meaning in other RFCs or the Flux project. To | ||
avoid confusion, below is a glossary of terms as they apply in this document. | ||
|
||
Notification | ||
An email or other notification triggered by FLAN but whose ultimate delivery | ||
is handled by an external service. | ||
|
||
Notification-enabled jobs | ||
Jobs that include a jobspec attribute requesting a notification for certain | ||
events in the job's life cycle. For a more detailed definition of job events, | ||
refer to :doc:`spec_21`. | ||
|
||
Requirements | ||
************ | ||
|
||
- By default, do not notify a user of any job events. | ||
- Allow the user to override this default with a jobspec attribute, | ||
``system.notify``. | ||
- Support notification after any event of the job, where events are defined in | ||
:doc:`spec_21`. | ||
- Support email for end user notification delivery. | ||
- Allow for extensibility via plugins to support more end user notification | ||
delivery services, such as Slack and Mattermost. The implementation of | ||
plugins for any service other than email is not a requirement. | ||
- Utilize as few resources as possible in the Flux job-manager. Under no | ||
circumstances will a notification block any stage or event of a Flux job. | ||
- Provide configurable rate-limiting to ensure users can never be overwhelmed | ||
by notifications. | ||
|
||
Implementation | ||
************** | ||
|
||
FLAN SHALL be implemented by a service that MAY be started under ``systemd``. | ||
|
||
Introduction | ||
============ | ||
|
||
The Flux job-manager journal of events (JoE) is an interface that streams job | ||
events in real-time for jobs in a Flux instance. The JoE can be configured to | ||
send all completed events in addition to streaming real-time events. The JoE | ||
includes annotations such as jobspec and R where appropriate. | ||
|
||
The Flux Library for Adaptable Notifications (FLAN) provides a server that | ||
opens a streaming RPC request to the JoE, receives events from the JoE, stores | ||
jobspec, event logs, and resource sets by jobid for all active jobs, and allows | ||
for clients to asynchronously perform operations (such as send emails) based on | ||
these events. | ||
|
||
FLAN implements an event dispatcher to handle batches of events | ||
based on a timer, allowing for a massive number of events to be handled | ||
semi-synchronously with rate limiting for massive job throughput. Since | ||
FLAN is run as a separate process alongside a Flux instance, it can never | ||
block the job-manager or other critical Flux services. FLAN must be run under | ||
the instance owner credentials but needn't be run on the same node as the | ||
rank 0 Flux broker. | ||
|
||
Initial Request | ||
=============== | ||
|
||
FLAN SHALL open a streaming RPC request to the JoE. FLAN SHALL request the full | ||
journal, including completed events. | ||
|
||
Initial Response(s) | ||
=================== | ||
|
||
An "initial response" is any response prior to the JoE's "sentinel," which | ||
indicates that the backlog has completed transmission. | ||
|
||
Initial responses are per-jobid and can include multiple events. FLAN SHALL | ||
store the annotations (jobspec, R) per jobid and process each event. | ||
|
||
Event Dispatcher | ||
================ | ||
|
||
Instead of handling each event sequentially, events shall be queued and handled | ||
in batches by the event dispatcher. The event dispatcher SHALL contain a queue | ||
of events to process. The event dispatcher SHALL process these events after | ||
receiving a signal from the reactor's timer watcher. The timer watcher SHALL | ||
have a configurable delta (time between wake-ups). | ||
|
||
Upon waking up, the event dispatcher shall determine if an event is "of | ||
interest," and process the event if so. Only the most recent event for a given | ||
job SHALL be processed. Processing of the event involves clients of the FLAN | ||
server completing any process they specify, such as sending an email. | ||
|
||
On the initial run of the event dispatcher, FLAN SHALL compare the events in | ||
its queue to a record in the KVS of "handled" events, and ignore "handled" | ||
events. The initial run of the event dispatcher SHALL block subsequent runs. | ||
For each subsequent iteration of the event dispatcher, FLAN SHALL write to the | ||
KVS a record of the events it has processed before the event dispatcher goes to | ||
sleep. | ||
|
||
Subsequent Responses | ||
==================== | ||
|
||
Subsequent responses from the JoE shall be queued in the event dispatcher in | ||
real-time, and processed when the dispatcher wakes up. A record of jobspec, R, | ||
and eventlog for each event SHALL be stored, and the record removed by the | ||
event dispatcher when it receives the ``clean`` event for a job. | ||
|
||
User Interface | ||
************** | ||
|
||
Users SHALL create notification-enabled jobs by specifying an attribute in their | ||
job's jobspec. Jobspec attributes are defined in :doc:`spec_25`. | ||
|
||
Basic Use Case | ||
============== | ||
|
||
Users SHALL add the following attribute to their jobspec: | ||
|
||
.. code-block:: json | ||
|
||
{ | ||
"attributes": { | ||
"system": { | ||
"notify": "default" | ||
} | ||
} | ||
} | ||
|
||
The default behavior SHALL be to send a notification via email to the user | ||
when the job reaches the START and FINISH events. | ||
|
||
A future update to the jobspec API will make this jobspec attribute easily | ||
accessible via a single argument to a command, ``--notify``. | ||
|
||
Advanced Use Cases | ||
================== | ||
|
||
Only the basic use case SHALL be supported in v1. | ||
|
||
The ``system.notify`` jobspec attribute SHALL accept a dictionary containing some | ||
or all of the following values: | ||
|
||
.. code-block:: json | ||
|
||
{ | ||
"attributes": { | ||
"system": { | ||
"notify": { | ||
"service": "slack", | ||
"handle": "elvis", | ||
"include": ["R", "eventlog", "return_code"], | ||
"states": ["start", "prolog_finish"] | ||
} | ||
} | ||
} | ||
} | ||
|
||
Edge Cases | ||
********** | ||
|
||
These edge cases MAY be supported in FLAN v1. | ||
|
||
Expiration of notifications | ||
=========================== | ||
|
||
In certain cases, a restart of the service may be delayed such that events of interest | ||
on notification-enabled jobs are long past. FLAN MAY support an "expiration" setting | ||
which would stop any notification from final delivery if a set amount of time had | ||
passed since the event. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -8,6 +8,8 @@ org | |
tarball | ||
tarballs | ||
adoc | ||
Mattermost | ||
JoE | ||
api | ||
env | ||
github | ||
|
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain what you mean by rate limiting here? Arguably we can have it, but we shouldn't ever find ourselves in a scenario where "the same" event is triggering notifications so many times as to need it. When would this happen/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@grondo and I discussed this yesterday.
A user could submit 1000+ jobs at the system instance and just reuse jobspecs for all of them, not realizing they had requested notification. We don't want to spam them with an individual email for each of these, or overwhelm the email server.
Rate limiting could be implemented with a reactor-like design (maybe an "event dispatcher" is the term). As notifications are generated, they could be pushed into a list/queue. The dispatcher could wake up at a set interval, check for new notifications, and batch all of the user's jobs into just one notification. It's not critical that the notifications be instantaneous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Take 1000+ to mean "some large number for which we don't want to send individual notifications."