-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rfc: add notification service design doc #414
base: master
Are you sure you want to change the base?
Conversation
a56d23d
to
60db4cf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great start @wihobbs! I've left some comments mostly just looking at grammar; feel free to take them or leave them. Sorry in advance if I am misunderstanding some of the RFC language here!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent!
I had a few minor suggestions about the flow of the document. Feel free to ignore if it doesn't make sense to you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a side question, is the python driver typically going to be associated with system instance? So we'd run this python driver as a side daemon.
Or is the python driver something the user would start with their subinstance? How would we expect them to generlaly start the process? Like add a python myscript.py &
in their batch file?
spec_44.rst
Outdated
The default behavior SHALL be to send a notification to the users' primary email | ||
address, as provided by an LDAP query, when the job reaches the START and FINISH | ||
events. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initially read this and thought this implementation was way too specific. But then I thought maybe you're just giving a use case example. Perhaps the words "The default behavior" threw me off. Maybe "For example, the default behavior could be ..."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default behavior is supposed to be detailing what happens when a user says flux batch --setattr=system.notify=default
. The eventual goal is to support doing crazy things like --setattr=system.notify.service=slack --setattr=system.notify.events=validate,epilog-start --setattr=system.notify.include="{id.f58} {R} {eventlog}
, etc. so I wanted to give the behavior in the general case (which is probably all we'll actually support in v1.)
I do think that maybe specifying LDAP is a tad specific. Though I've generally tried to be as specific as possible in this RFC since its more of a design document for a service than a traditional RFC...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we may want to think through the ergonomics of requesting notifications before codifying into an RFC. Requiring a user to specify multiple --setattr
options on the command line might not go over that well..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought, it would be fine to have something in here, it can always be updated in the future. However, just keep in the back of your mind that users would appreciate a single --setattr
so perhaps comma delimited options or other ways to provide options in a single go should be considered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For longer term, maybe a design that is readable and simple, something like --notify.service=slack
I can read from left to right.
The idea was that a sysadmin would load the jobtap plugin (maybe just include it in the config?) and then start the Python driver under systemd on the management node. Still TBD is how a user would be able to start this service themselves to get notifications for batch jobs in sub-instances (which is currently how some DATs are organized). |
e227539
to
a350340
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I just took a quick first pass 👍
spec_44.rst
Outdated
|
||
The ``notify.enable`` request has no payload. | ||
|
||
At initialization the python driver SHALL create a kvs subdirectory, ``notify``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the directory already exists? :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Python driver can traverse the existing directory for jobid keys.
If the jobtap plugin could insert and remove jobid keys itself, this eliminates the need for the record of jobids in the jobtap plugin ("hash table"). The jobtap plugin could have an INACTIVE callback to remove, and add the key in DEPEND. Currently the KVS is only being used to ensure uniqueness of the notification being sent.
Thoughts on this design?
spec_44.rst
Outdated
- Support notification after any event of the job, where events are defined in | ||
:doc:`spec_21`. | ||
- Support email as the primary end user notification delivery. | ||
- Build a driver capable of sending POST requests to chat services for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do you see tokens or credentials for these posts going? I assume it's on the level of the user? So the user needs to setup their own slack app to make it work? But what about an organization or group - how would that be handled, permissions wise for the credentials?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I had envisioned credentials/tokens being stored in a config file on the management node. The only thing a user would need to pass to the jobspec would be their handle, i.e. hobbs17 on the LLNL slack. The slack admins would set up a bot to handle delivering the message to the user, and the sysadmins would store the webhooks for this bot in a toml file on the management node.
An edge case I haven't considered yet is "what if we need to support multiple 'teams' on Slack," i.e. LLNL and llnl-performance.
I should probably go read the Slack/Mattermost API docs and make sure this approach will actually work. I've built very little with MM and nothing with Slack...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In any event I would probably place configuration location of third party tokens, etc. outside the scope of this document, which should focus on how the service monitors jobs for events of interest to turn them into generic notifications. (i.e. the protocol between Python service and a jobtap plugin, how restarts are handled, etc)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
➕
spec_44.rst
Outdated
|
||
.. note:: | ||
This design is intended to ensure that no double-notifications are sent upon | ||
the restart of the Python script, the jobtap plugin, or the job-manager. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless the directory deletion is missed?
spec_44.rst
Outdated
If a disconnect request is received by the jobtap plugin, this indicates the | ||
python driver has exited. The jobtap plugin SHALL continue to add notification- | ||
enabled jobs to its hash table as they enter the DEPEND state. When the python | ||
driver reconnects, the jobtap plugin SHALL respond to its initial ``notify.enable`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if it doesn't reconnect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"reconnect" is probably the wrong word. The Python driver if disconnected from the jobtap plugin will exit immediately and log an error. Eventually, systemd or an admin will start a new Python driver which sends a new initial notify.enable
RPC request. No notifications can be sent while there's not an active Python driver process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My question then is - what if the admin doesn't start a new Python driver? The hash table keeps filling up and explodes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a really good point. I hadn't thought of this. Maybe the jobtap plugin should remove jobids when they reach an inactive state... the notifications would be missed, though. Hmm. 🤔
There might be a way to log a warning reminding the user/admins to start the Python driver, but I'll have to look into that more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add this to the list of things to investigate. It shouldn't be an issue to keep a hash/list of all jobids with notifications, but we should not assume there is not a better way at this point in the design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given this, my question above is becoming a bit more urgent. Is there a reason to keep these things separate? If so, is there a reason the jobtap plugin shouldn't manage the lifetime of the driver as a subprocess or systemd unit?
spec_44.rst
Outdated
------------------------- | ||
|
||
The webhooks and other secrets required to connect to chat services SHALL be included | ||
in a ``config.toml`` file. The path to this file MUST be provided to the FLAN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would either place this in a directory that has a meaningful path (for the developer to see) or give it name that is more specific to describe what it is. And I am wondering who has ownership of this file - a single user or if a center needs to take responsibility, for example, to provide a single credential for all users to use (is that a good idea)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likely both use cases could be wanted (the "I am a user with my own custom plugin and private credential" and "I am a center providing this to my users."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, maybe for the "center providing for users" case this needs an /etc/flux/notify.d/
or similar. Probably an implementation detail outside this RFC, but I had not thought about standardizing the config file path. Good idea!
spec_44.rst
Outdated
In the event the job-manager crashes or is shut down the python driver SHALL exit | ||
immediately and log an error. | ||
|
||
Flux does not currently support restarting with running jobs. However, on a system |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trying to restore state sounds messy - is there any reason to not recreate notifications on restart so there aren't any mismatches between state brought up and state known to the plugin?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Notifications would be recreated internally in the Python driver, since losing its connection to Flux is a fatal error (and that would happen in a system-instance restart).
Restoring state is messy...but I want to try for it anyway :). Here's an idea for how:
- Python driver crashes because job-manager is shut down.
- Job-manager is restarted (and so is the KVS). KVS comes back; jobtap plugin is loaded on start. All job eventlogs are replayed.
- All of the jobtap plugin depend callbacks are hit again. Jobtap plugin records these jobids.
- A "new" Python driver starts and gets a bunch of jobids sent to it. It watches the events [that may be somewhat stale.] (This was @chu11's point in a message above.) It does not add jobid folders to the KVS because they are already there.
- Before it sends a notification, it checks the KVS
notify.<jobid>
folder to see if __ (event) key is there. If the key is not there, that means a notification has already been sent. (On reflection, maybe it should be flipped -- i.e. if the key is there, that means a notification has already been sent.) - Stop! Do not pass go, do not collect $200! Because we know the job is being watched, but a notification has already been sent, keep chugging and watching for others that may not have been sent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a bit unclear here what is meant by the "depend callbacks are hit again".
Note that there are two jobtap callbacks related to dependencies and the DEPEND state:
job.state.depend
: One of the genericjob.state.*
callbacks, called when the job reaches the DEPEND state.job.dependency.SCHEME
: Called for each dependencySCHEME
in jobspec of a submitted job
I don't think the job.state.*
callbacks are called on reload. You can test this by writing a jobtap plugin that subscribes to all these states and prints what it sees on a restart.
The job.dependency.SCHEME
callbacks are specifically issued on a restart for jobs with dependencies defined in jobspec.
I know we talked about this before, but I don't think I properly understood you were talking about one of these callback types and not the other.
Probably the best way for plugins to be notified of new jobs is to use the job.new
callback, which is called for new jobs after they've been validated, on plugin reload, and on job manager restart.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For lack of a better place to document this (tagging @garlick as well):
It appears the job.state.priority
callback is re-triggered when flux start
is run in recovery mode with pending jobs, but no other job.state.*
callbacks.
#include <flux/core.h>
#include <flux/jobtap.h>
#include <stdio.h>
#include <jansson.h>
#include <stdlib.h>
static int print_cb (flux_plugin_t *p,
const char *topic,
flux_plugin_arg_t *args,
void *arg)
{
flux_jobid_t id;
long unsigned int state;
flux_t *h = flux_jobtap_get_flux(p);
json_t *event;
if (flux_plugin_arg_unpack(args,
FLUX_PLUGIN_ARG_IN,
"{s:I s:i s:o}",
"id", &id,
"state", &state,
"entry", &event) < 0)
return -1;
if (id) {
char* json_string = json_dumps(event, JSON_ENCODE_ANY);
flux_log(h, 0, "jobid %ld has entered state %s\n", id, json_string);
}
return 0;
}
static const struct flux_plugin_handler tab[] = {
{"job.state.*", print_cb, NULL},
{0},
};
int flux_plugin_init (flux_plugin_t *p) {
if (flux_plugin_register(p, "jobtap-print", tab) < 0) {
return -1;
}
return 0;
}
And on shutdown/restart:
auk108 ~/scripts/jobtap-test $ flux start -o,--config-path=$(pwd)/config.toml
(s=1,d=0) auk108 ~/scripts/jobtap-test $ flux submit --urgency=0 hostname
May 14 15:19:03.251059 job-manager.emerg[0]: jobid 330880253952 has entered state 140651589009410 {"timestamp": 1715725143.2404392, "name": "validate"}
May 14 15:19:03.251083 job-manager.emerg[0]: jobid 330880253952 has entered state 140651589009412 {"timestamp": 1715725143.2510674, "name": "depend"}
May 14 15:19:03.251104 job-manager.emerg[0]: jobid 330880253952 has entered state 140651589009416 {"timestamp": 1715725143.251091, "name": "priority", "context": {"priority": 0}}
ƒ9h7knoh
(s=1,d=0) auk108 ~/scripts/jobtap-test $ flux shutdown --dump=$(pwd)/dump.tar
auk108 ~/scripts/jobtap-test $ flux start -o,--config-path=$(pwd)/config.toml --recovery=$(pwd)/dump.tar
May 14 15:22:46.751219 job-manager.emerg[0]: jobid 330880253952 has entered state 139796890517508 {"timestamp": 1715725366.7511699, "name": "flux-restart"}
May 14 15:22:46.751270 job-manager.emerg[0]: jobid 330880253952 has entered state 139796890517512 {"timestamp": 1715725366.7512314, "name": "priority", "context": {"priority": 0}}
+-----------------------------------------------------
| Entering Flux recovery mode.
| All resources will be offline during recovery.
| Any rc1 failures noted above may result in
| reduced functionality until manually corrected.
|
| broker.rc1_path /home/hobbs17/flux-core/../bin/etc/flux/rc1
| broker.rc3_path /home/hobbs17/flux-core/../bin/etc/flux/rc3
| config.path /home/hobbs17/scripts/jobtap-test/config.toml
| statedir changes will not be preserved
|
| Exit this shell when finished.
+-----------------------------------------------------
I accept the suggestion to tie the streaming RPC response to job.new
, which is replayed on restart (tweaked the test above and verified).
We will have to make the ability to restart the service dependent on a requirement that this plugin be included in the job-manager
's plugins
table in the config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
job.new
is also called for every active job when a plugin is loaded (on the assumption that this is the method that "introduces" jobs to jobtap plugins). This should allow the plugin to be reloaded (e.g. to get a new version) and get back to the same place it was before it was unloaded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice job on the test plugin, btw. You could tweak it to register callbacks for *
if you want to see everything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Glad I posted this because I lost it in my home directory somewhere over the last 3 months!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trying to restore state sounds messy
And this is correct. A thought for the new design based on the job-manager journal of events -- I wonder always storing the most recently received event would be enough to restore state? This could be stored somewhere persistent and overwritten each time a new event was received. Then, on a restart, you could inspect the event and throw out any event that took place before it. I'll think on this more...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With a dispatcher handling actual notification transmission there's an opportunity a few notifications would be missed on an abrupt shutdown, but maybe that's not so bad for a first pass.
Note that ^ push is mostly grammar/clarity fixes and I'm still working through the [excellent, but involved :) ] design feedback. |
@wihobbs you've probably already exceeded most of us in knowledge of jobtap plugins, but in case you are interested we are having a little (hopefully fun) hackathon this week on Thursday - grandmaster @trws is going to show us the ropes for making plugins, and ❗ I think there might be rust involved! If you are interested, even for the fun or sharing your expertise, I can send along the invite. |
b6366cc
to
2883259
Compare
Tons of great feedback on this. Thanks to everybody for participating and sharing your expertise. I tried to close or consolidate duplicate issues. I apologize in advance if something was inadvertently closed -- please reopen if so. I'm making a high-level summary of the outstanding design todos:
|
spec_44.rst
Outdated
`flux-jobtap-plugins(7) <https://flux-framework.readthedocs.io/projects/flux-core/en/latest/man7/flux-jobtap-plugins.html>`_ | ||
that streams the jobids of notification-enabled jobs to the Python driver. | ||
|
||
The Python driver |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this documenting a thing that already exists? If not, is there a reason these two should be separate? If it's a language issue, a python-based jobtap plugin shouldn't be an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's largely because the jobtap plugin resides in the job manager and we want it to do as little as possible since it would be in the critical path.
The jobtap plugin (at the time) seemed like the most efficient way to avoid having to fetch and examine the jobspec of every job since it can just peek at the jobspec that the job manager already has in memory and send the jobid along to the Python process only when the user has requested notification.
However, I'm glad you asked because I wonder if there's a better way to do this using the job-manager journal or some other more widely applicable method. It sure would be nice if it was easy for people to develop services like this and not have to install yet another jobtap plugin to identify jobs of interest.
Something to consider...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that's a good point. Blocking the job manager would be unfortunate.
The journal is a good thought, or maybe even a pub-sub event from the plugin so anyone could subscribe to it? Not sure, there are definitely tradeoffs here, but it sounds like the current approach has some potential to cause lost or repeated notification as well, especially racing with the event delivery and kvs removal vs jobtap replay if flux is restarted.
I need to reread some of the previous comments before asking for another set of reviews. This is a pretty substantial update to the proposed design. Some concerns will stay and some will go, but a major advantage of subscribing to the journal of events is that it absolutely cannot block the job manager, because it's a separate process completely. I also noticed that we've ended up in a race over who gets RFC no. 44 😉😆 |
b364b0d
to
082ef36
Compare
Problem: no design currently exists for the Flux email service as noted in flux-framework/flux-core#4435. Add a RFC-style document detailing this.
d775154
to
066fc56
Compare
Ideas that were discussed, but not included in the RFC:
|
Haven't read through the new RFC yet, but just as a counterpoint here: There are some nice benefits to using an event in the eventlog. For example when jobs are purged those events won't appear on the next restart (the notification service will have to purge its own records otherwise, adding unnecessary complexity). Also, you're already getting all job events through the journal, so this would limit the number of sources the notification service has to deal with, again simplifying the implementation. That being said, I think we discussed that perhaps an event posted by the notification service would appear in the job eventlog after it is needed (e.g. the |
This is a RFC-style design document being put up to document the design we have for the flux job notification service.
I'm looking to get feedback on the clarity of this document and make sure we have as many edge cases as possible covered. If this isn't an appropriate RFC for the whole project it can be moved somewhere else, but the github PR interface provides a nice way for us to engage in a feedback loop.
I wanted to add a flow chart detailing some of the interactions but am still working on that. I think there's enough here for a first pass review, though.