-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retrieving the status of a submitted batch job returns NEW
unless the user waits a few seconds
#399
Comments
NEW
unless the user waits a few secondsNEW
unless the user waits a few seconds
This is... a known issue that we are working on. Invoking too many qstats in a short period of time can overwhelm the queuing system. So we do one every n seconds1 for all jobs tracked by an executor (actually, there's a single thread for all executor instances of the same kind in a process). The If you can do it, using notifications is the way to go. If you really have to use The current situation is not particularly nice and there is more demand for If it's really 1 I lie a bit. There's an initial delay (2s default) and a separate polling interval (30s by default). |
Thanks again, @hategan!! Very clear response. I greatly appreciate it!
Won't calling
That would certainly be welcome! Of course, I understand the challenge though with not just bombarding the system with qstat calls. Regardless of the ultimate plan here, this is super helpful because now I know that this delay period is intentional, so I can always be sure that if I |
@hategan: This is perhaps worthy of a separate discussion, but it seems that there is currently no way to get the status of a job (via |
In principle,
That was indeed the case earlier until it became clear that we're forcing a race condition, so the behaviour above is now in effect. In other words,
Yes, remote PSI/J should address this.
|
Awesome, thanks for the tip! Didn't know about the Very much looking forward to remote PSI/J!! |
In general, that's the problem with synchronous status: you can miss states and some of the mechanisms that PSI/J uses to detect exit code and other conditions are lost with But yes, something to think about... |
Agree wholeheartedly. And I believe your understanding is correct; the purge time for the queue history can be modified by the administrator, which means the utility of this approach could be incredibly variable. I agree it's probably not a sustainable solution unless Once remote PSI/J exists and is out in the world though, I imagine there'll be much less need for Thanks again! |
So I started updating the specification based on an earlier discussion we had. The idea was to make the But then I realized a few things:
In short, a waiting It is also reasonable to assume that, with proper documentation, users almost always do the right thing, which would eliminate use case #4, but that does not significantly affect the conclusion. So I'm inclined to address this with a documentation update for @andre-merzky, @arosen93: thoughts? |
Thanks @hategan ! I would be really hesitant to embrace a waiting attach. In the almost trivial usage of 'attach to for native_id in sys.argv:
job = psij.Job()
ex.attach(job) When the attach for the first job start it will have to wait until the executor pulls for the state update. Then it takes a fraction of a second to loop around and attach the second job. At that point the executor just completed the state pull and will have to wait a full I agree with a documentation fix being the more appropriate solution. |
@hategan: Thanks for the writeup! I also agree that the documentation fix is probably the most viable solution here. That said, one small comment.
Based on my reading of the code, it seems that adding job.wait(
target_states=[
JobState.QUEUED,
JobState.CANCELED,
JobState.ACTIVE,
JobState.FAILED,
JobState.COMPLETED,
]
) My assumption is based on this code block where the Lines 193 to 199 in 331d446
Of course, as we noted earlier, it's possible that the Regardless, this is a minor detail. The key thing is simply to highlight that this kind of user-defined |
You are correct. It turns out that psij-python does not correctly implement the specification, which basically says that I'm also seeing that the ordering of states is broken even in the specification, since If the job is not in the queue any more and PSI/J cannot find certain files related to it, it is supposed to be marked as |
Ah, okay! Glad we were able to put the pieces together and spot some things in need of an update. Thanks for your attention to detail there!
I'm not sure that's the behavior I currently see (although maybe it's related to the above issues). If I query for a job after it's done and gone from the queue, I get a Edit: I guess this is expected because there's no way for |
Yes, please. I started #401 for this.
If you have a native id, it made it to the queue at some point. If it's not in the queue any more, it either finished or failed at some point. So the code is supposed to assume that a missing job is COMPLETED (or failed if that level of detail is otherwise available). |
Super useful to know! Thanks! I'll report more info in #401 later this afternoon when I can do some more thorough tests with reproducible examples. |
I tried submitting a Slurm job "manually" and getting the ID. Using this native ID, I used the following code to get the job state.
When doing this within the first ~2-3 seconds after the job was submitted, I get back
NEW
even though the job is markedQ
in the queue (and is submitted because otherwise I wouldn't have had the native ID). If I add a sleep timer of 4 seconds, it returnsQUEUED
every time as expected, but I'm worried that this might not be a general solution because if the filesystem is slow it might change that.Here is a complete demonstration that works on Perlmutter:
Is this the expected behavior, or is this something to be addressed? I, naturally, get the same behavior when using a separate Python process that uses PSI/J to submit the Slurm job.
Sidenote: This feature of retrieving the job state should also be added to the documentation somewhere.
The text was updated successfully, but these errors were encountered: