-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
not ok 45 - stdin flow control works (all ranks, one rank will exit early) #6542
Comments
I hit this as well recently and what's interesting is that EPIPE is returned from
suggesting perhaps the remote subprocess already closed/exited when the write was about to occur? Lemme take a look. With the recent credits change, perhaps there is a small racy scenario I didn't consider in Edit: oh and you could EPIPE too if the stream is already closed |
hmmmm in a bit of a surprise, in
we seem to already be protecting against the EPIPE case .... huh ... . |
looking through code, something racy isn't immediately coming to mind. The check for FLUX_SUBPROCESS_INIT and FLUX_SUBPROCESS_RUNNING in Reasonable efforts couldn't get this to reproduce locally. gonna see if I can get debug in CI and gimme a clue. |
also as an aside, i noticed in the above output
The input is supposed to be 5 megs, that |
I'm going under the assumption that somehow the streams are closed and then attempted to be written to again. This would explain the output above where we (presumably) sending < 5M of data and some processes appear to be exiting. This could theoretically happen if the Trying to prove this is the case via debug in #6544, but have yet to reproduce this again after many CI runs. |
finally hit the issue again in CI, it confirmed my suspicions
the Still a little perplexed how this could be happening. I'm wondering if it's possible the |
That's a new one on me if so! I'm pretty sure a stopped watcher cannot run. |
It's possible I'm confusing myself. Maybe I'm getting confused w/ priority stuff that we've done before. Going to the
and later
what if a watcher is stopped after it has already been queued? does it get booted from the queue? It would think so? hope so? |
Ok, I'm probably confusing myself over something from before. I wrote a quick experiment and if you stop a watcher when it is queued up, it will get properly dequeued. So all is good in the world. |
Whew 👍 |
Ok, I think I have a theory on what is happening. I don't have definitive "proof", but think this race exists. High level code info as reminder
I think the race exists here
The
But this assumes we actually wrote the same number of stdin bytes to each subprocess. But as can be seen above we check if a subprocess is INIT/RUNNING before writing to it. What if some subprocess was (lets say) EXITED, then the above invariant would not hold. Under normal operations, the BUT I think this is racy. It requires that the subprocess exit notify And that can explain getting us into the situation we're now seeing:
There are obvious fixes that would not be decently performing (i.e. |
Problem: When a subprocess exits / fails, it is removed from the "subprocess credits" list. However, it is possible this exit / failure is not seen yet before the subprocess credits is updated during a stdin callback. This can lead to the subprocess credits list calculating an an invalid "min credits". This can lead to the stdin callback being started when it should not be. Solution: If the number of "active" subprocesses does not equal the number of subprocesses on the credits list, remove exited/failed credits from the credits list in stdin_cb(). Fixes flux-framework#6542
Problem: When a subprocess exits / fails, it is removed from the "subprocess credits" list. However, it is possible this exit / failure is not seen yet before the subprocess credits is updated during a stdin callback. This can lead to the subprocess credits list calculating an an invalid "min credits". This can lead to the stdin callback being started when it should not be. Solution: If the number of "active" subprocesses does not equal the number of subprocesses on the credits list, remove exited/failed credits from the credits list in stdin_cb(). Fixes flux-framework#6542
I've been seeing this test fail often in CI (maybe once every 3 runs of the entire CI workflow)
The text was updated successfully, but these errors were encountered: