forked from LLNL/LaunchMON
-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.ERROR_HANDLING
72 lines (57 loc) · 3.9 KB
/
README.ERROR_HANDLING
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
LaunchMON's Error Handling Semantics
A. When the LaunchMON engine fails, the cleanup semantics of LaunchMON is to
detach from the job while keeping the client tool running.
A.1. The LaunchMON engine fails. Unless it is due to SIGKILL, the
engine performs the cleanup in its signal handler. The cleanup is
A.1.1. to detach from the RM_job process, to keep the RM_daemon pro-
cess running (A.1.2), and and to notify this error event to the front-
end tool client before it exits (A.1.3.). RM_job refers to the
resource manager launcher process that monitors the target job.
RM_daemon refers to the resource manager launcher process that moni-
tors the deamons. [BlueGene Note] As BlueGene's control system
implements daemon co-spawning service as part of its Automatic Process
Acquisition Interface (APAI), RM_job is equal to RM_daemon.
B. FE API extension to be able to communciate a session's status to the to-
ol FEN.
B.1. interrupt interface
lmon_rc_e LMON_fe_regStatusCB ( int sessionHandle, int (*func) (int *status) );
The status is returned through the "status" argument.
C. When a client tool component (whether it is FE or daemons) fails, the
basic semantics of the LaunchMON cleanup procedure is to detach from th-
e target job, and kill the daemons (when FE fails) or notify the tool
front-end client (when a daemon failure is detected).
C.1. If the tool FE fails, the engine first detects the socket discon-
nection, at which point it tries to kill the RM_daemon process and
detaches from the RM_job process. However, if for some reason the
engine also gets into trouble, the engine would perform A.1 instead;
obviously in this case, the failing launchmon engine will keep the
RM_daemon process running, and won't be able to do A.1.3.
[BlueGene Note] As RM_daemon is equal to RM_job on BlueGene, and
the system control system doesn't offer a mechanism to clean up dae-
mons, LaunchMON does not currently enforce killing of daemons for this
condition.
C.2. One or more BE daemons fail. This fatal event gets propagated to
the RM_daemon process and the daemons are already cleaned up by the RM
by this time. Next, the engine gets notified and will begin the
cleanup. It will detach from the RM_job, notify the tool front-end,
and exit. However, for some reason, if the engine also gets into
trouble, it will perform A.1; obviously in this case, it doesn't need
to perform A.1.2.
[IBM BlueGene Note] As RM_daemon is equal to RM_job on BlueGene,
and the system control system does not offer a mechanism to
detect failures that occurred in the back-end daemons, LaunchMON does not
currently enforce this semantics on this platform.
[OpenRTE Note] Identical to the IBM BlueGene, RM_daemon is equal to RM_job
on OpenRTE and the system control system does not offer a mechanism to
detect failures that occurred in the back-end daemons. Hence, LaunchMON
does not currently enforce this semantics on this platform.
[Cray ALPS Note] The ALPS tool helper service currently does not offer
a mechanism to detect failures in the daemons; LaunchMON does not enforce
this semantics on this platform.
D. When the job fails, the basic cleanup semantics of LaunchMON is to notify
the FEN tool while keeping the daemons running.
D.1. The target job fails, and this fatal event first gets propagated
to the RM_job process. Next, the engine gets notified of this event,
and in turn notifies the front-end tool of this condition before it
exits. LaunchMON relies on the failure handling to the tool in this
case, thereby leaving the RM_daemons running.