-
Notifications
You must be signed in to change notification settings - Fork 51
CY2021 Q1 2 system instance planning
Jim Garlick edited this page Jan 21, 2021
·
2 revisions
System instance development CY21 Q1-2
Note: In the descriptions below an idle node is one that has not communicated in a configurable threshold of heartbeat periods. A down node is one that can no longer communicate because it has disconnected, or because it has been denied access after being idle for too long.
Feb release
- drain idle nodes, undrain nodes that become unidle again
- mark nodes down that have been idle for some period
- drain down nodes, require manual undrain on reconnect
- (prolog/epilog design placeholder)
- (partial resource release design placeholder)
- (rpc failure on down broker design placeholder)
Mar release
- implement prolog/epilog
- drain node on prolog/epilog failure
- (partial resource release design placeholder)
- (rpc failure on down broker design placeholder)
Apr release
- raise job exception(s) when nodes fail
- implement partial resource release
- (rpc failure on down broker design placeholder)
May release
- RPCs to down nodes eventually fail
Milestone: level 1 resiliency
- nodes are automatically drained when they fail
- flux remains responsive despite compute node failures
- jobs are killed when nodes they are running on fail
- allocated resources can be partially reclaimed on node failure
(More releases TBD)
Milestone: level 2 resiliency
- overlay routing can be "restarted"
- job shells can survive their brokers being restarted
- rolling software upgrade