Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rfc: add notification service design doc #414

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@ Table of Contents
- [41/Job Information Service](spec_41.rst)
- [42/Subprocess Server Protocol](spec_42.rst)
- [43/Job List Service](spec_43.rst)
- [44/Flux Library for Adaptable Notifications](spec_44.rst)

Build Instructions
------------------
Expand Down
7 changes: 7 additions & 0 deletions index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -283,6 +283,12 @@ standard I/O management of remote processes.

The Flux Job List Service provides read-only summary information for jobs.

:doc:`spec_44`
~~~~~~~~~~~~~~

This specification describes the Flux service that allows users to
receive external notifications for events in a Flux job.

.. Each file must appear in a toctree
.. toctree::
:hidden:
Expand Down Expand Up @@ -328,3 +334,4 @@ The Flux Job List Service provides read-only summary information for jobs.
spec_41
spec_42
spec_43
spec_44
204 changes: 204 additions & 0 deletions spec_44.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,204 @@
.. github display
GitHub is NOT the preferred viewer for this file. Please visit
https://flux-framework.rtfd.io/projects/flux-rfc/en/latest/spec_28.html

44/Flux Library for Adaptable Notifications Version 1
###########################################################

This specification describes the Flux service that allows users to
receive external notifications for events in a Flux job.

.. list-table::
:widths: 25 75

* - **Name**
- github.com/flux-framework/rfc/spec_44.rst
* - **Editor**
- William Hobbs <[email protected]>
* - **State**
- raw

Language
********

.. include:: common/language.rst

Related Standards
*****************

- :doc:`spec_14`
- :doc:`spec_21`
- :doc:`spec_25`

Background
**********

The Flux Library for Adaptable Notifications (FLAN) provides event-driven
functionality that sends external notifications of job events.

Terminology
***********

These terms may have broader meaning in other RFCs or the Flux project. To
avoid confusion, below is a glossary of terms as they apply in this document.

Notification
An email or other notification triggered by FLAN but whose ultimate delivery
is handled by an external service.

Notification-enabled jobs
Jobs that include a jobspec attribute requesting a notification for certain
events in the job's life cycle. For a more detailed definition of job events,
refer to :doc:`spec_21`.

Requirements
************

- By default, do not notify a user of any job events.
- Allow the user to override this default with a jobspec attribute,
``system.notify``.
- Support notification after any event of the job, where events are defined in
:doc:`spec_21`.
- Support email for end user notification delivery.
- Allow for extensibility via plugins to support more end user notification
delivery services, such as Slack and Mattermost. The implementation of
plugins for any service other than email is not a requirement.
- Utilize as few resources as possible in the Flux job-manager. Under no
circumstances will a notification block any stage or event of a Flux job.
- Provide configurable rate-limiting to ensure users can never be overwhelmed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain what you mean by rate limiting here? Arguably we can have it, but we shouldn't ever find ourselves in a scenario where "the same" event is triggering notifications so many times as to need it. When would this happen/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@grondo and I discussed this yesterday.

A user could submit 1000+ jobs at the system instance and just reuse jobspecs for all of them, not realizing they had requested notification. We don't want to spam them with an individual email for each of these, or overwhelm the email server.

Rate limiting could be implemented with a reactor-like design (maybe an "event dispatcher" is the term). As notifications are generated, they could be pushed into a list/queue. The dispatcher could wake up at a set interval, check for new notifications, and batch all of the user's jobs into just one notification. It's not critical that the notifications be instantaneous.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take 1000+ to mean "some large number for which we don't want to send individual notifications."

by notifications.

Implementation
**************

FLAN SHALL be implemented by a service that MAY be started under ``systemd``.

Introduction
============

The Flux job-manager journal of events (JoE) is an interface that streams job
events in real-time for jobs in a Flux instance. The JoE can be configured to
send all completed events in addition to streaming real-time events. The JoE
includes annotations such as jobspec and R where appropriate.

The Flux Library for Adaptable Notifications (FLAN) provides a server that
opens a streaming RPC request to the JoE, receives events from the JoE, stores
jobspec, event logs, and resource sets by jobid for all active jobs, and allows
for clients to asynchronously perform operations (such as send emails) based on
these events.

FLAN implements an event dispatcher to handle batches of events
based on a timer, allowing for a massive number of events to be handled
semi-synchronously with rate limiting for massive job throughput. Since
FLAN is run as a separate process alongside a Flux instance, it can never
block the job-manager or other critical Flux services. FLAN must be run under
the instance owner credentials but needn't be run on the same node as the
rank 0 Flux broker.

Initial Request
===============

FLAN SHALL open a streaming RPC request to the JoE. FLAN SHALL request the full
journal, including completed events.

Initial Response(s)
===================

An "initial response" is any response prior to the JoE's "sentinel," which
indicates that the backlog has completed transmission.

Initial responses are per-jobid and can include multiple events. FLAN SHALL
store the annotations (jobspec, R) per jobid and process each event.

Event Dispatcher
================

Instead of handling each event sequentially, events shall be queued and handled
in batches by the event dispatcher. The event dispatcher SHALL contain a queue
of events to process. The event dispatcher SHALL process these events after
receiving a signal from the reactor's timer watcher. The timer watcher SHALL
have a configurable delta (time between wake-ups).

Upon waking up, the event dispatcher shall determine if an event is "of
interest," and process the event if so. Only the most recent event for a given
job SHALL be processed. Processing of the event involves clients of the FLAN
server completing any process they specify, such as sending an email.

On the initial run of the event dispatcher, FLAN SHALL compare the events in
its queue to a record in the KVS of "handled" events, and ignore "handled"
events. The initial run of the event dispatcher SHALL block subsequent runs.
For each subsequent iteration of the event dispatcher, FLAN SHALL write to the
KVS a record of the events it has processed before the event dispatcher goes to
sleep.

Subsequent Responses
====================

Subsequent responses from the JoE shall be queued in the event dispatcher in
real-time, and processed when the dispatcher wakes up. A record of jobspec, R,
and eventlog for each event SHALL be stored, and the record removed by the
event dispatcher when it receives the ``clean`` event for a job.

User Interface
**************

Users SHALL create notification-enabled jobs by specifying an attribute in their
job's jobspec. Jobspec attributes are defined in :doc:`spec_25`.

Basic Use Case
==============

Users SHALL add the following attribute to their jobspec:

.. code-block:: json

{
"attributes": {
"system": {
"notify": "default"
}
}
}

The default behavior SHALL be to send a notification via email to the user
when the job reaches the START and FINISH events.

A future update to the jobspec API will make this jobspec attribute easily
accessible via a single argument to a command, ``--notify``.

Advanced Use Cases
==================

Only the basic use case SHALL be supported in v1.

The ``system.notify`` jobspec attribute SHALL accept a dictionary containing some
or all of the following values:

.. code-block:: json

{
"attributes": {
"system": {
"notify": {
"service": "slack",
"handle": "elvis",
"include": ["R", "eventlog", "return_code"],
"states": ["start", "prolog_finish"]
}
}
}
}

Edge Cases
**********

These edge cases MAY be supported in FLAN v1.

Expiration of notifications
===========================

In certain cases, a restart of the service may be delayed such that events of interest
on notification-enabled jobs are long past. FLAN MAY support an "expiration" setting
which would stop any notification from final delivery if a set amount of time had
passed since the event.

2 changes: 2 additions & 0 deletions spell.en.pws
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@ org
tarball
tarballs
adoc
Mattermost
JoE
api
env
github
Expand Down
Loading