-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Destroy runner on allocation stopped #401
Conversation
7f0384e
to
aa26cd8
Compare
Codecov Report
@@ Coverage Diff @@
## main #401 +/- ##
==========================================
+ Coverage 75.28% 75.65% +0.37%
==========================================
Files 37 37
Lines 3415 3488 +73
==========================================
+ Hits 2571 2639 +68
- Misses 681 685 +4
- Partials 163 164 +1
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for investigating and finding a solution here! I just tested this PR on staging, but am unsure whether everything works as expected. I attached the poseidon.log
and codeocean.log for further inspection.
My main concerns / questions would be:
- Why do we see that the runner
10-41ce448a-1526-11ee-9ed6-fa163e5d9f68
used for my code execution first is gone and cannot be deleted (410 for DELETE), but shortly after is restarted? Would I end up with too many runners here? - Why do I see
Execution canceled by context
andExecution terminated after SIGQUIT
in the Poseidon log? - Can we improve the error being sent to CodeOcean? Currently, CodeOcean receives
{\"type\":\"error\",\"data\":\"Error executing the request\"}
and therefore statesRunner error while running a submission: Execution terminated with an unknown reason
. Therefore, the CodeOcean also informs the learner about a generic runner error, rather than choosing the specific "OOM killed" error message.
5af71d3
to
5189fb5
Compare
Yes, this is the one case I know of where we currently tolerate this behavior. The additional runner will be corrected by the check of #380.
This is a Debug statement that informs us that the context for the execution has been canceled. We see it because the
When we cancel an execution, we send
Yes, although it includes some forwarding through the layers (5189fb5). |
Nice, thanks for your improvements and response. I tested the execution once again, and was able to confirm that CodeOcean received the cause for the runner error. Now, the next step is probably to identify this error in CodeOcean correctly, so that users are informed, right? I am currently thinking of having a specific error handling there:
What do you think? During my testing, I also noticed another thing that changed: Today, I not only received the
I also find this more than surprising. Sure, the debug message doesn't hurt (despite making a wrong statement), but do you have an idea whether we could / should prevent this? Is it something to worry about, because some other side effects are being triggered?
Okay, makes sense. And of course, we still have #381 and #383 in case we wanted to avoid this increase. |
5189fb5
to
bc04609
Compare
👍
CodeOcean-side?
The exit code 137 is typical for an Therefore, it is not 100% precise. Alternatively, we could also try to match a part of the Nomad output
It's not 100% precise, but the best solution so far. I'll add a test for Poseidon.
Actually yes because this debug message is quite low level. It is thrown on the level of the actual Nomad Execution request. It is thrown twice because we have two connections to Nomad. One for the
No, actually we perform only additional side effects (requesting Nomad to delete the runner) when this message is not shown.
We could prevent this as we introduced the destroy reason. We can add a condition in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, thanks! This comment, once again, contains multiple points, so I am adding some numbers:
-
Now, the next step is probably to identify this error in CodeOcean correctly, so that users are informed, right?
Yes, I am referring to the CodeOcean side and made a proposal with Handle OutOfMemory runner errors with gVisor codeocean#1766.
-
Therefore, it is not 100% precise.
a) Can you detail why it is not 100% precise? Just because the
SIGKILL
received might be caused by something else as well (meaning that we would "overestimate" the OOM cases)?b) During my testing, however, I also noticed that the current OOM "discovery" is not 100% precise. I deployed my aforementioned PR to staging and tested with this exercise, containing your Pyhton example:
import os os.system("dd if=/dev/zero of=/dev/null bs=500M")
However, when running (or scoring) it multiple times, some get correctly identified as OOM, while others just show an exit code of 128 (as received by Poseidon, so no special handling involved in CodeOcean). Are we just out of luck or are we overseeing something?
(PS: Don't wonder about the flash message shown in CodeOcean being too light, this is a known issue and tracked somewhere else).
-
One for the
StdOut
and one for theStdErr
.Thanks for detailing, this makes sense!
-
We can add a condition in
handleExitOrContextDone
...To me, this sounds pretty logical (and thus like a good idea), but you are not completely convinced, are you? My reason behind this would be to ensure we are not further interacting with the allocation being already deleted / dead / ..., thus reducing our impact if something unexpected happens thereafter.
bc04609
to
abfdbc8
Compare
Yes, I thought of just this unusual case
Here, you discovered a race condition between the Nomad execution returning and the Nomad allocation being stopped. Expected behavior (Allocation is stopped first):
Discovered behavior (Execution returns first):
The second case is unfortunate as Nomad returns to us just the exit code 128 without a further error.
|
2a) Okay, fair point. But for now, I would just say: Leave a suitable comment in the code (that this might happen) and that's it. Usually, no one else should kill the process, so I would accept this possibility. 2b) My first idea was similar to your first one (just mapping exit code 128), since the exit code 128 is not that usual... Nevertheless, there might be some false positives. That's why I would prefer a solution that's a bit more clever, and the ones you suggested fit into this category. Still, I am not sure about solution 2; to me it sounds more like a potential combination with solution 4, for example. How would you identify the OOM killed reason for solution 2? And what should we do when this Sentry error occurs? I have the feeling that there won't be much new information we would gain. Otherwise, I think solution 3 or 4 would be fine; is it possible to combine these somehow? I wouldn't like to wait a fix time of 1 second (but up to one second or so would be fine). Therefore, what about keeping the WebSocket connection to CodeOcean open in case an execution returned with exit code 128. Then, wait up to a second, if the runner gets destroyed / the context canceled. If this is the case, we pretend that the execution got OOM killed. If not, we just mirror the exit code 128 to CodeOcean and accept this (potential) delay. As I see from the log file, usually the context should be canceled after about 200ms, so that should be fine. |
abfdbc8
to
53e01be
Compare
However, I'm happy with our current implementation (53e01be) combining solution 1, 3, and 4 :) |
I like that too, and it works as expected -- thanks! However, during testing of 53e01be deployed to staging, I also saw another error message: {"type":"error","data":"Error executing the request"} This trace (poseidon.log, codeocean.log) and this one are examples of said behavior. Besides trying further, I was not able to reproduce it again, but was still wondering about it. Am I just out of luck? Actually, the error cases represent my first and maybe fifth try, the following 20 worked without an error. Still, when checking the error details, it seems like the removal of the execution might have happened too early / "unexpectedly". |
53e01be
to
9d2ed1c
Compare
I have the feeling, I am out of luck reproducing this error. In my recent 100 executions using our OOM payload it did not appear :/ |
9d2ed1c
to
a41b62e
Compare
It's not just you. As I've said before, I only saw this error twice during my the first five or ten executions. Even after trying multiple times again, I wasn't able to reproduce it either. I used a similar OOM payload and do not have any idea what triggered the error.
That sounds great, I would suggest to wait for another occurrence of the bug and then use the additional debug data to identify the root cause! |
Destroying the runner when Nomad informs us about its allocation being stopped, fixes the error of executions running into their timeout even if the allocation was stopped long ago.
Before, Nomad executions often got stopped because the runner was deleted. With the previous commit, we cover the exception to this behaviour by stopping the execution Poseidon-side. These different approaches lead to different context error messages. In this commit, we move the check of the passed timeout, to respond with the corresponding client message again.
in order to return a specific error for OOM Killed Executions.
due to the Nomad request exiting before the allocation is stopped. We catch this behavior by introducing a time period for the allocation being stopped iff the exit code is 128.
a41b62e
to
8c4c125
Compare
Closes #397
Destroying the runner when Nomad informs us about its allocation being stopped, fixes the error of executions running into their timeout even if the allocation was stopped long ago.