From d87308d161b2bf39d7b8db7b3ddcc12139223a49 Mon Sep 17 00:00:00 2001 From: "Mark A. Grondona" Date: Mon, 6 Nov 2023 12:57:32 -0800 Subject: [PATCH] job-exec: fix potential hang after exec kill error Problem: When the job-exec module handles an error from the wait-all composite future created from exec_kill(), it makes an assumption that all child futures are also fulfilled, and calls flux_future_get() ion each individual future in bulk_exec_kill_log_error() to print a specific error. However, if bulk kill operation times out, then one or more of the child futures may not be fulfilled, and the module could block forever. Check that each child future is ready in bulk_exec_kill_log_error() before calling flux_future_get() to avoid the hang. Fixes #5523 --- src/modules/job-exec/bulk-exec.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/src/modules/job-exec/bulk-exec.c b/src/modules/job-exec/bulk-exec.c index 52c879ed57e5..4ec1aed62841 100644 --- a/src/modules/job-exec/bulk-exec.c +++ b/src/modules/job-exec/bulk-exec.c @@ -541,7 +541,8 @@ void bulk_exec_kill_log_error (flux_future_t *f, flux_jobid_t id) const char *name = flux_future_first_child (f); while (name) { flux_future_t *cf = flux_future_get_child (f, name); - if (flux_future_get (cf, NULL) < 0) { + if (flux_future_is_ready (cf) + && flux_future_get (cf, NULL) < 0) { uint32_t rank = flux_rpc_get_nodeid (cf); flux_log_error (h, "%s: exec_kill: %s (rank %lu)",