rfc: add notification service design doc #414

wihobbs · 2024-05-13T15:17:52Z

This is a RFC-style design document being put up to document the design we have for the flux job notification service.

I'm looking to get feedback on the clarity of this document and make sure we have as many edge cases as possible covered. If this isn't an appropriate RFC for the whole project it can be moved somewhere else, but the github PR interface provides a nice way for us to engage in a feedback loop.

I wanted to add a flow chart detailing some of the interactions but am still working on that. I think there's enough here for a first pass review, though.

cmoussa1

Great start @wihobbs! I've left some comments mostly just looking at grammar; feel free to take them or leave them. Sorry in advance if I am misunderstanding some of the RFC language here!

spec_44.rst

garlick

Excellent!

I had a few minor suggestions about the flow of the document. Feel free to ignore if it doesn't make sense to you!

index.rst

spec_44.rst

chu11

Just a side question, is the python driver typically going to be associated with system instance? So we'd run this python driver as a side daemon.

Or is the python driver something the user would start with their subinstance? How would we expect them to generlaly start the process? Like add a python myscript.py & in their batch file?

spec_44.rst

chu11 · 2024-05-13T19:10:22Z

spec_44.rst

+The default behavior SHALL be to send a notification to the users' primary email
+address, as provided by an LDAP query, when the job reaches the START and FINISH
+events.


I initially read this and thought this implementation was way too specific. But then I thought maybe you're just giving a use case example. Perhaps the words "The default behavior" threw me off. Maybe "For example, the default behavior could be ..."

The default behavior is supposed to be detailing what happens when a user says flux batch --setattr=system.notify=default. The eventual goal is to support doing crazy things like --setattr=system.notify.service=slack --setattr=system.notify.events=validate,epilog-start --setattr=system.notify.include="{id.f58} {R} {eventlog}, etc. so I wanted to give the behavior in the general case (which is probably all we'll actually support in v1.)

I do think that maybe specifying LDAP is a tad specific. Though I've generally tried to be as specific as possible in this RFC since its more of a design document for a service than a traditional RFC...

we may want to think through the ergonomics of requesting notifications before codifying into an RFC. Requiring a user to specify multiple --setattr options on the command line might not go over that well..

On second thought, it would be fine to have something in here, it can always be updated in the future. However, just keep in the back of your mind that users would appreciate a single --setattr so perhaps comma delimited options or other ways to provide options in a single go should be considered.

For longer term, maybe a design that is readable and simple, something like --notify.service=slack I can read from left to right.

wihobbs · 2024-05-13T19:21:36Z

Just a side question, is the python driver typically going to be associated with system instance? So we'd run this python driver as a side daemon.

The idea was that a sysadmin would load the jobtap plugin (maybe just include it in the config?) and then start the Python driver under systemd on the management node.

Still TBD is how a user would be able to start this service themselves to get notifications for batch jobs in sub-instances (which is currently how some DATs are organized).

grondo

Ok, I just took a quick first pass 👍

spec_44.rst

grondo · 2024-05-14T01:04:51Z

spec_44.rst

+
+The ``notify.enable`` request has no payload.
+
+At initialization the python driver SHALL create a kvs subdirectory, ``notify``.


What if the directory already exists? :-)

The Python driver can traverse the existing directory for jobid keys.

If the jobtap plugin could insert and remove jobid keys itself, this eliminates the need for the record of jobids in the jobtap plugin ("hash table"). The jobtap plugin could have an INACTIVE callback to remove, and add the key in DEPEND. Currently the KVS is only being used to ensure uniqueness of the notification being sent.

Thoughts on this design?

spec_44.rst

index.rst

spec_44.rst

vsoch · 2024-05-14T02:32:32Z

spec_44.rst

+ - Support notification after any event of the job, where events are defined in
+   :doc:`spec_21`.
+ - Support email as the primary end user notification delivery.
+ - Build a driver capable of sending POST requests to chat services for


Where do you see tokens or credentials for these posts going? I assume it's on the level of the user? So the user needs to setup their own slack app to make it work? But what about an organization or group - how would that be handled, permissions wise for the credentials?

Actually, I had envisioned credentials/tokens being stored in a config file on the management node. The only thing a user would need to pass to the jobspec would be their handle, i.e. hobbs17 on the LLNL slack. The slack admins would set up a bot to handle delivering the message to the user, and the sysadmins would store the webhooks for this bot in a toml file on the management node.

An edge case I haven't considered yet is "what if we need to support multiple 'teams' on Slack," i.e. LLNL and llnl-performance.

I should probably go read the Slack/Mattermost API docs and make sure this approach will actually work. I've built very little with MM and nothing with Slack...

In any event I would probably place configuration location of third party tokens, etc. outside the scope of this document, which should focus on how the service monitors jobs for events of interest to turn them into generic notifications. (i.e. the protocol between Python service and a jobtap plugin, how restarts are handled, etc)

spec_44.rst

vsoch · 2024-05-14T02:45:10Z

spec_44.rst

+
+.. note::
+  This design is intended to ensure that no double-notifications are sent upon
+  the restart of the Python script, the jobtap plugin, or the job-manager.


Unless the directory deletion is missed?

vsoch · 2024-05-14T02:45:55Z

spec_44.rst

+If a disconnect request is received by the jobtap plugin, this indicates the
+python driver has exited. The jobtap plugin SHALL continue to add notification-
+enabled jobs to its hash table as they enter the DEPEND state. When the python
+driver reconnects, the jobtap plugin SHALL respond to its initial ``notify.enable``


What if it doesn't reconnect?

"reconnect" is probably the wrong word. The Python driver if disconnected from the jobtap plugin will exit immediately and log an error. Eventually, systemd or an admin will start a new Python driver which sends a new initial notify.enable RPC request. No notifications can be sent while there's not an active Python driver process.

My question then is - what if the admin doesn't start a new Python driver? The hash table keeps filling up and explodes?

This is a really good point. I hadn't thought of this. Maybe the jobtap plugin should remove jobids when they reach an inactive state... the notifications would be missed, though. Hmm. 🤔

There might be a way to log a warning reminding the user/admins to start the Python driver, but I'll have to look into that more.

Let's add this to the list of things to investigate. It shouldn't be an issue to keep a hash/list of all jobids with notifications, but we should not assume there is not a better way at this point in the design.

Given this, my question above is becoming a bit more urgent. Is there a reason to keep these things separate? If so, is there a reason the jobtap plugin shouldn't manage the lifetime of the driver as a subprocess or systemd unit?

spec_44.rst

vsoch · 2024-05-14T02:48:22Z

spec_44.rst

+-------------------------
+
+The webhooks and other secrets required to connect to chat services SHALL be included
+in a ``config.toml`` file. The path to this file MUST be provided to the FLAN 


I would either place this in a directory that has a meaningful path (for the developer to see) or give it name that is more specific to describe what it is. And I am wondering who has ownership of this file - a single user or if a center needs to take responsibility, for example, to provide a single credential for all users to use (is that a good idea)?

Likely both use cases could be wanted (the "I am a user with my own custom plugin and private credential" and "I am a center providing this to my users."

Oh, maybe for the "center providing for users" case this needs an /etc/flux/notify.d/ or similar. Probably an implementation detail outside this RFC, but I had not thought about standardizing the config file path. Good idea!

vsoch · 2024-05-14T02:49:55Z

spec_44.rst

+In the event the job-manager crashes or is shut down the python driver SHALL exit
+immediately and log an error. 
+
+Flux does not currently support restarting with running jobs. However, on a system


Trying to restore state sounds messy - is there any reason to not recreate notifications on restart so there aren't any mismatches between state brought up and state known to the plugin?

Notifications would be recreated internally in the Python driver, since losing its connection to Flux is a fatal error (and that would happen in a system-instance restart).

Restoring state is messy...but I want to try for it anyway :). Here's an idea for how:

Python driver crashes because job-manager is shut down.

Job-manager is restarted (and so is the KVS). KVS comes back; jobtap plugin is loaded on start. All job eventlogs are replayed.

All of the jobtap plugin depend callbacks are hit again. Jobtap plugin records these jobids.

A "new" Python driver starts and gets a bunch of jobids sent to it. It watches the events [that may be somewhat stale.] (This was @chu11's point in a message above.) It does not add jobid folders to the KVS because they are already there.

Before it sends a notification, it checks the KVS notify.<jobid> folder to see if __ (event) key is there. If the key is not there, that means a notification has already been sent. (On reflection, maybe it should be flipped -- i.e. if the key is there, that means a notification has already been sent.)

Stop! Do not pass go, do not collect $200! Because we know the job is being watched, but a notification has already been sent, keep chugging and watching for others that may not have been sent.

It is a bit unclear here what is meant by the "depend callbacks are hit again".

Note that there are two jobtap callbacks related to dependencies and the DEPEND state:

job.state.depend: One of the generic job.state.* callbacks, called when the job reaches the DEPEND state.

job.dependency.SCHEME: Called for each dependency SCHEME in jobspec of a submitted job

I don't think the job.state.* callbacks are called on reload. You can test this by writing a jobtap plugin that subscribes to all these states and prints what it sees on a restart.

The job.dependency.SCHEME callbacks are specifically issued on a restart for jobs with dependencies defined in jobspec.

I know we talked about this before, but I don't think I properly understood you were talking about one of these callback types and not the other.

Probably the best way for plugins to be notified of new jobs is to use the job.new callback, which is called for new jobs after they've been validated, on plugin reload, and on job manager restart.

For lack of a better place to document this (tagging @garlick as well):

It appears the job.state.priority callback is re-triggered when flux start is run in recovery mode with pending jobs, but no other job.state.* callbacks.

#include <flux/core.h> #include <flux/jobtap.h> #include <stdio.h> #include <jansson.h> #include <stdlib.h> static int print_cb (flux_plugin_t *p, const char *topic, flux_plugin_arg_t *args, void *arg) { flux_jobid_t id; long unsigned int state; flux_t *h = flux_jobtap_get_flux(p); json_t *event; if (flux_plugin_arg_unpack(args, FLUX_PLUGIN_ARG_IN, "{s:I s:i s:o}", "id", &id, "state", &state, "entry", &event) < 0) return -1; if (id) { char* json_string = json_dumps(event, JSON_ENCODE_ANY); flux_log(h, 0, "jobid %ld has entered state %s\n", id, json_string); } return 0; } static const struct flux_plugin_handler tab[] = { {"job.state.*", print_cb, NULL}, {0}, }; int flux_plugin_init (flux_plugin_t *p) { if (flux_plugin_register(p, "jobtap-print", tab) < 0) { return -1; } return 0; }

And on shutdown/restart:

auk108 ~/scripts/jobtap-test $ flux start -o,--config-path=$(pwd)/config.toml (s=1,d=0) auk108 ~/scripts/jobtap-test $ flux submit --urgency=0 hostname May 14 15:19:03.251059 job-manager.emerg[0]: jobid 330880253952 has entered state 140651589009410 {"timestamp": 1715725143.2404392, "name": "validate"} May 14 15:19:03.251083 job-manager.emerg[0]: jobid 330880253952 has entered state 140651589009412 {"timestamp": 1715725143.2510674, "name": "depend"} May 14 15:19:03.251104 job-manager.emerg[0]: jobid 330880253952 has entered state 140651589009416 {"timestamp": 1715725143.251091, "name": "priority", "context": {"priority": 0}} ƒ9h7knoh (s=1,d=0) auk108 ~/scripts/jobtap-test $ flux shutdown --dump=$(pwd)/dump.tar auk108 ~/scripts/jobtap-test $ flux start -o,--config-path=$(pwd)/config.toml --recovery=$(pwd)/dump.tar May 14 15:22:46.751219 job-manager.emerg[0]: jobid 330880253952 has entered state 139796890517508 {"timestamp": 1715725366.7511699, "name": "flux-restart"} May 14 15:22:46.751270 job-manager.emerg[0]: jobid 330880253952 has entered state 139796890517512 {"timestamp": 1715725366.7512314, "name": "priority", "context": {"priority": 0}} +----------------------------------------------------- | Entering Flux recovery mode. | All resources will be offline during recovery. | Any rc1 failures noted above may result in | reduced functionality until manually corrected. | | broker.rc1_path /home/hobbs17/flux-core/../bin/etc/flux/rc1 | broker.rc3_path /home/hobbs17/flux-core/../bin/etc/flux/rc3 | config.path /home/hobbs17/scripts/jobtap-test/config.toml | statedir changes will not be preserved | | Exit this shell when finished. +-----------------------------------------------------

I accept the suggestion to tie the streaming RPC response to job.new, which is replayed on restart (tweaked the test above and verified).

We will have to make the ability to restart the service dependent on a requirement that this plugin be included in the job-manager's plugins table in the config.

job.new is also called for every active job when a plugin is loaded (on the assumption that this is the method that "introduces" jobs to jobtap plugins). This should allow the plugin to be reloaded (e.g. to get a new version) and get back to the same place it was before it was unloaded.

Nice job on the test plugin, btw. You could tweak it to register callbacks for * if you want to see everything.

Glad I posted this because I lost it in my home directory somewhere over the last 3 months!

Trying to restore state sounds messy

And this is correct. A thought for the new design based on the job-manager journal of events -- I wonder always storing the most recently received event would be enough to restore state? This could be stored somewhere persistent and overwritten each time a new event was received. Then, on a restart, you could inspect the event and throw out any event that took place before it. I'll think on this more...

With a dispatcher handling actual notification transmission there's an opportunity a few notifications would be missed on an abrupt shutdown, but maybe that's not so bad for a first pass.

wihobbs · 2024-05-14T17:35:35Z

Note that ^ push is mostly grammar/clarity fixes and I'm still working through the [excellent, but involved :) ] design feedback.

vsoch · 2024-05-15T03:08:31Z

@wihobbs you've probably already exceeded most of us in knowledge of jobtap plugins, but in case you are interested we are having a little (hopefully fun) hackathon this week on Thursday - grandmaster @trws is going to show us the ropes for making plugins, and ❗ I think there might be rust involved! If you are interested, even for the fun or sharing your expertise, I can send along the invite.

wihobbs · 2024-05-15T18:03:05Z

Tons of great feedback on this. Thanks to everybody for participating and sharing your expertise.

I tried to close or consolidate duplicate issues. I apologize in advance if something was inadvertently closed -- please reopen if so.

I'm making a high-level summary of the outstanding design todos:

Recovery of state on a crash. Note that there are three things that can crash, and we should possibly have a different strategy for handling each:
- Crash of the job-manager (+Python driver+jobtap plugin) -- essentially restoring the state of everything.
- Crash of the Python driver, but jobtap plugin and job-manager persist.
- Crash of the jobtap plugin, but not the job-manager (+?Python driver)
Ensuring uniqueness of events before sending notifications (related: no double-or-worse notifications, and no missed notifications)
Rate limiting: why do we need it and how will we implement it?
Extensibility of the service via plugins (for external services such as Slack). Implementation of individual services is not a requirement.
Ergonomics of the "advanced use case," where we allow a user to specify services other than email, non-standard things to include, and states other than START and FINISH in which to trigger notifications.

trws · 2024-05-15T18:34:45Z

spec_44.rst

+  `flux-jobtap-plugins(7) <https://flux-framework.readthedocs.io/projects/flux-core/en/latest/man7/flux-jobtap-plugins.html>`_
+  that streams the jobids of notification-enabled jobs to the Python driver.
+
+The Python driver


Is this documenting a thing that already exists? If not, is there a reason these two should be separate? If it's a language issue, a python-based jobtap plugin shouldn't be an issue.

It's largely because the jobtap plugin resides in the job manager and we want it to do as little as possible since it would be in the critical path.

The jobtap plugin (at the time) seemed like the most efficient way to avoid having to fetch and examine the jobspec of every job since it can just peek at the jobspec that the job manager already has in memory and send the jobid along to the Python process only when the user has requested notification.

However, I'm glad you asked because I wonder if there's a better way to do this using the job-manager journal or some other more widely applicable method. It sure would be nice if it was easy for people to develop services like this and not have to install yet another jobtap plugin to identify jobs of interest.

Something to consider...

Ah, that's a good point. Blocking the job manager would be unfortunate.

The journal is a good thought, or maybe even a pub-sub event from the plugin so anyone could subscribe to it? Not sure, there are definitely tradeoffs here, but it sounds like the current approach has some potential to cause lost or repeated notification as well, especially racing with the event delivery and kvs removal vs jobtap replay if flux is restarted.

wihobbs · 2024-10-10T20:37:23Z

I need to reread some of the previous comments before asking for another set of reviews. This is a pretty substantial update to the proposed design.

Some concerns will stay and some will go, but a major advantage of subscribing to the journal of events is that it absolutely cannot block the job manager, because it's a separate process completely.

I also noticed that we've ended up in a race over who gets RFC no. 44 😉😆

Problem: no design currently exists for the Flux email service as noted in flux-framework/flux-core#4435. Add a RFC-style document detailing this.

wihobbs · 2024-10-17T22:49:27Z

Ideas that were discussed, but not included in the RFC:

Posting an event to the eventlog of the job when a notification is sent. It seems easier to keep the record of what's been sent all in one place that's quickly accessible by the driver, rather than dissociated in each job's eventlog.
A serialized hash table, hashing each notification before sending, so that a collision (sent notification) can be easily detected. I determined that this was an implementation detail and too deep for an RFC.
Using systemd freeze and thaw for storing the state of the driver. I don't think this will work in the way we thought, and we have access to the KVS, so I figured we'd just use that for storing state.

grondo · 2024-10-17T23:14:51Z

Posting an event to the eventlog of the job when a notification is sent. It seems easier to keep the record of what's been sent all in one place that's quickly accessible by the driver, rather than dissociated in each job's eventlog.

Haven't read through the new RFC yet, but just as a counterpoint here: There are some nice benefits to using an event in the eventlog. For example when jobs are purged those events won't appear on the next restart (the notification service will have to purge its own records otherwise, adding unnecessary complexity). Also, you're already getting all job events through the journal, so this would limit the number of sources the notification service has to deal with, again simplifying the implementation.

That being said, I think we discussed that perhaps an event posted by the notification service would appear in the job eventlog after it is needed (e.g. the start event is posted, the service sends a notification, then posts a notification event -- when restarting the service would see the start event before the notification event so it may not be helpful for preventing duplicate notifications) IMO, this is a more compelling argument to drop the suggestion that events be posted by the notification service in job eventlogs than because it seems easier.

wihobbs force-pushed the flanrfc branch 2 times, most recently from a56d23d to 60db4cf Compare May 13, 2024 15:50

cmoussa1 reviewed May 13, 2024

View reviewed changes

garlick reviewed May 13, 2024

View reviewed changes

index.rst Outdated Show resolved Hide resolved

spec_44.rst Outdated Show resolved Hide resolved

spec_44.rst Outdated Show resolved Hide resolved

spec_44.rst Outdated Show resolved Hide resolved

chu11 reviewed May 13, 2024

View reviewed changes

wihobbs force-pushed the flanrfc branch 2 times, most recently from e227539 to a350340 Compare May 13, 2024 22:36

grondo reviewed May 14, 2024

View reviewed changes

vsoch reviewed May 14, 2024

View reviewed changes

wihobbs force-pushed the flanrfc branch from a350340 to 5eba985 Compare May 14, 2024 17:34

wihobbs force-pushed the flanrfc branch 2 times, most recently from b6366cc to 2883259 Compare May 15, 2024 17:59

trws reviewed May 15, 2024

View reviewed changes

wihobbs force-pushed the flanrfc branch from 2883259 to d775154 Compare May 15, 2024 20:47

grondo mentioned this pull request May 16, 2024

Plugin capabilities for front-end submission tools flux-framework/flux-core#2875

Open

wihobbs force-pushed the flanrfc branch from d775154 to 88e1fee Compare October 10, 2024 20:27

wihobbs force-pushed the flanrfc branch 2 times, most recently from b364b0d to 082ef36 Compare October 10, 2024 20:52

rfc: add notification service design doc

066fc56

Problem: no design currently exists for the Flux email service as noted in flux-framework/flux-core#4435. Add a RFC-style document detailing this.

wihobbs force-pushed the flanrfc branch 2 times, most recently from d775154 to 066fc56 Compare October 17, 2024 22:37

wihobbs requested review from cmoussa1 and grondo October 17, 2024 22:44


		The ``notify.enable`` request has no payload.

		At initialization the python driver SHALL create a kvs subdirectory, ``notify``.

rfc: add notification service design doc #414

Are you sure you want to change the base?

rfc: add notification service design doc #414

Conversation

wihobbs commented May 13, 2024

cmoussa1 left a comment

Choose a reason for hiding this comment

garlick left a comment

Choose a reason for hiding this comment

chu11 left a comment • edited Loading

Choose a reason for hiding this comment

chu11 May 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wihobbs commented May 13, 2024

grondo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wihobbs commented May 14, 2024

vsoch commented May 15, 2024

wihobbs commented May 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wihobbs commented Oct 10, 2024

wihobbs commented Oct 17, 2024

grondo commented Oct 17, 2024

chu11 left a comment •

edited

Loading

chu11 May 13, 2024 •

edited

Loading

wihobbs commented May 15, 2024 •

edited

Loading