Skip to content

Commit

Permalink
job-exec: fix potential hang after exec kill error
Browse files Browse the repository at this point in the history
Problem: When the job-exec module handles an error from the wait-all
composite future created from exec_kill(), it makes an assumption that
all child futures are also fulfilled, and calls flux_future_get()
ion each individual future in bulk_exec_kill_log_error() to print a
specific error. However, if bulk kill operation times out, then one
or more of the child futures may not be fulfilled, and the module
could block forever.

Check that each child future is ready in bulk_exec_kill_log_error()
before calling flux_future_get() to avoid the hang.

Fixes flux-framework#5523
  • Loading branch information
grondo committed Nov 6, 2023
1 parent a06600f commit d87308d
Showing 1 changed file with 2 additions and 1 deletion.
3 changes: 2 additions & 1 deletion src/modules/job-exec/bulk-exec.c
Original file line number Diff line number Diff line change
Expand Up @@ -541,7 +541,8 @@ void bulk_exec_kill_log_error (flux_future_t *f, flux_jobid_t id)
const char *name = flux_future_first_child (f);
while (name) {
flux_future_t *cf = flux_future_get_child (f, name);
if (flux_future_get (cf, NULL) < 0) {
if (flux_future_is_ready (cf)
&& flux_future_get (cf, NULL) < 0) {
uint32_t rank = flux_rpc_get_nodeid (cf);
flux_log_error (h,
"%s: exec_kill: %s (rank %lu)",
Expand Down

0 comments on commit d87308d

Please sign in to comment.