Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Destroy runner on allocation stopped #401

Merged
merged 4 commits into from
Jul 21, 2023
Merged

Destroy runner on allocation stopped #401

merged 4 commits into from
Jul 21, 2023

Conversation

mpass99
Copy link
Contributor

@mpass99 mpass99 commented Jun 26, 2023

Closes #397

Destroying the runner when Nomad informs us about its allocation being stopped, fixes the error of executions running into their timeout even if the allocation was stopped long ago.

@mpass99 mpass99 mentioned this pull request Jun 26, 2023
@mpass99 mpass99 self-assigned this Jun 26, 2023
@mpass99 mpass99 requested a review from MrSerth June 26, 2023 16:10
@mpass99 mpass99 marked this pull request as draft June 26, 2023 16:55
@codecov
Copy link

codecov bot commented Jun 27, 2023

Codecov Report

Merging #401 (8c4c125) into main (40a5f2e) will increase coverage by 0.37%.
The diff coverage is 91.05%.

@@            Coverage Diff             @@
##             main     #401      +/-   ##
==========================================
+ Coverage   75.28%   75.65%   +0.37%     
==========================================
  Files          37       37              
  Lines        3415     3488      +73     
==========================================
+ Hits         2571     2639      +68     
- Misses        681      685       +4     
- Partials      163      164       +1     
Impacted Files Coverage Δ
internal/runner/inactivity_timer.go 87.03% <ø> (ø)
internal/runner/runner.go 100.00% <ø> (ø)
internal/runner/aws_runner.go 58.33% <50.00%> (ø)
internal/runner/nomad_manager.go 82.78% <79.16%> (-2.22%) ⬇️
internal/runner/nomad_runner.go 82.13% <92.98%> (+2.60%) ⬆️
internal/nomad/nomad.go 83.18% <96.15%> (+0.43%) ⬆️
internal/api/ws/codeocean_writer.go 78.20% <100.00%> (+0.57%) ⬆️
internal/nomad/job.go 95.04% <100.00%> (+0.54%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@mpass99 mpass99 marked this pull request as ready for review June 27, 2023 14:50
Copy link
Member

@MrSerth MrSerth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for investigating and finding a solution here! I just tested this PR on staging, but am unsure whether everything works as expected. I attached the poseidon.log
and codeocean.log for further inspection.

My main concerns / questions would be:

  • Why do we see that the runner 10-41ce448a-1526-11ee-9ed6-fa163e5d9f68 used for my code execution first is gone and cannot be deleted (410 for DELETE), but shortly after is restarted? Would I end up with too many runners here?
  • Why do I see Execution canceled by context and Execution terminated after SIGQUIT in the Poseidon log?
  • Can we improve the error being sent to CodeOcean? Currently, CodeOcean receives {\"type\":\"error\",\"data\":\"Error executing the request\"} and therefore states Runner error while running a submission: Execution terminated with an unknown reason. Therefore, the CodeOcean also informs the learner about a generic runner error, rather than choosing the specific "OOM killed" error message.

@mpass99
Copy link
Contributor Author

mpass99 commented Jun 28, 2023

Why do we see that the runner 10-41ce448a-1526-11ee-9ed6-fa163e5d9f68 used for my code execution first is gone and cannot be deleted (410 for DELETE), but shortly after is restarted?

  • When the allocation is OOM Killed, we receive the Runner stopped event. In this case we remove the runner (from Poseidons memory).
  • The DELETE request is unsuccessful as the runner is already removed.
  • Then (/ directly after the failure), the allocation is restarted. We handle this case as we would handle a new runner.

Would I end up with too many runners here?

Yes, this is the one case I know of where we currently tolerate this behavior. The additional runner will be corrected by the check of #380.

Why do I see Execution canceled by context

This is a Debug statement that informs us that the context for the execution has been canceled. We see it because the Runner stopped handling removes the runner (and therefore cancels its open contexts).

and Execution terminated after SIGQUIT in the Poseidon log?

When we cancel an execution, we send SIGQUIT to it. Then, the process either stops itself or ignores the signal. Its a good question why the execution stopped after SIGQUIT in this case where the allocation is not running. A guess would be that the input triggered some check/cleanup functionality within Nomad.

Can we improve the error being sent to CodeOcean?

Yes, although it includes some forwarding through the layers (5189fb5).

@MrSerth
Copy link
Member

MrSerth commented Jun 29, 2023

Nice, thanks for your improvements and response. I tested the execution once again, and was able to confirm that CodeOcean received the cause for the runner error. Now, the next step is probably to identify this error in CodeOcean correctly, so that users are informed, right? I am currently thinking of having a specific error handling there:

  • If the execution fails with an error and the error data are the allocation was OOM Killed, I would probably "transform" this to an {"type": "exit", "data": 137}, which would match the behavior without gVisor. Still, I am not sure whether I like the string comparison with the error data (I would need an exact match, so that we should also add a very specific test case to Poseidon (with a comment!) that would fail otherwise.
  • This transformation might not be ideal, but it would also prevent CodeOcean from requesting a DELETE.

What do you think?


During my testing, I also noticed another thing that changed: Today, I not only received the Execution canceled by context and Execution terminated after SIGQUIT, but the first one (Execution canceled by context) actually appeared twice. Is this expected (I assume, it's not)?

time="2023-06-29T15:11:22.089079Z" level=debug msg="Execution canceled by context" package=nomad runnerID=10-de1f9785-168e-11ee-b9b1-fa163e5d9f68
time="2023-06-29T15:11:22.089218Z" level=debug msg="Execution canceled by context" package=nomad runnerID=10-de1f9785-168e-11ee-b9b1-fa163e5d9f68
time="2023-06-29T15:11:22.089252Z" level=debug msg="Execution terminated after SIGQUIT" package=runner runner=10-de1f9785-168e-11ee-b9b1-fa163e5d9f68

Its a good question why the execution stopped after SIGQUIT in this case where the allocation is not running. A guess would be that the input triggered some check/cleanup functionality within Nomad.

I also find this more than surprising. Sure, the debug message doesn't hurt (despite making a wrong statement), but do you have an idea whether we could / should prevent this? Is it something to worry about, because some other side effects are being triggered?

The additional runner will be corrected by the check of #380.

Okay, makes sense. And of course, we still have #381 and #383 in case we wanted to avoid this increase.

@mpass99
Copy link
Contributor Author

mpass99 commented Jul 5, 2023

Now, the next step is probably to identify this error in CodeOcean correctly, so that users are informed, right?

👍

I would probably "transform" this

CodeOcean-side?

to an {"type": "exit", "data": 137}

The exit code 137 is typical for an OOM Killed process, but just means that the process received a SIGKILL.

Therefore, it is not 100% precise. Alternatively, we could also try to match a part of the Nomad output waiting on pid 22: waiting on PID 22 in sandbox "78a639259beb2416fc0661b05bf41c977e57383925c9b6b06a1f1f843a3acf45": urpc method \"containerManager.WaitPID\" failed: EOF\r\n. But that would be quite unreliable (in case of changes in Nomad, other causes of this error, and different error messages).

What do you think?

It's not 100% precise, but the best solution so far. I'll add a test for Poseidon.


Is this expected?

Actually yes because this debug message is quite low level. It is thrown on the level of the actual Nomad Execution request. It is thrown twice because we have two connections to Nomad. One for the StdOut and one for the StdErr.

Is it something to worry about, because some other side effects are being triggered?

No, actually we perform only additional side effects (requesting Nomad to delete the runner) when this message is not shown.

do you have an idea whether we could / should prevent this?

We could prevent this as we introduced the destroy reason. We can add a condition in handleExitOrContextDone if the reason is ErrAllocationStopped or ErrOOMKilled and then return as we know that the allocation is not running anymore. I'm unsure if we should as this condition would just prevent the SIGQUIT being send to the allocation and the log message 🤷 If you like, I'll add it 👌

Copy link
Member

@MrSerth MrSerth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, thanks! This comment, once again, contains multiple points, so I am adding some numbers:

  1. Now, the next step is probably to identify this error in CodeOcean correctly, so that users are informed, right?

    Yes, I am referring to the CodeOcean side and made a proposal with Handle OutOfMemory runner errors with gVisor codeocean#1766.

  2. Therefore, it is not 100% precise.

    a) Can you detail why it is not 100% precise? Just because the SIGKILL received might be caused by something else as well (meaning that we would "overestimate" the OOM cases)?

    b) During my testing, however, I also noticed that the current OOM "discovery" is not 100% precise. I deployed my aforementioned PR to staging and tested with this exercise, containing your Pyhton example:

    import os
    
    os.system("dd if=/dev/zero of=/dev/null bs=500M")

    However, when running (or scoring) it multiple times, some get correctly identified as OOM, while others just show an exit code of 128 (as received by Poseidon, so no special handling involved in CodeOcean). Are we just out of luck or are we overseeing something?

    (PS: Don't wonder about the flash message shown in CodeOcean being too light, this is a known issue and tracked somewhere else).

  3. One for the StdOut and one for the StdErr.

    Thanks for detailing, this makes sense!

  4. We can add a condition in handleExitOrContextDone ...

    To me, this sounds pretty logical (and thus like a good idea), but you are not completely convinced, are you? My reason behind this would be to ensure we are not further interacting with the allocation being already deleted / dead / ..., thus reducing our impact if something unexpected happens thereafter.

internal/nomad/nomad.go Outdated Show resolved Hide resolved
@mpass99
Copy link
Contributor Author

mpass99 commented Jul 7, 2023

2a) Can you detail why it is not 100% precise? Just because the SIGKILL received might be caused by something else as well (meaning that we would "overestimate" the OOM cases)?

Yes, I thought of just this unusual case

2b) During my testing, however, I also noticed that the current OOM "discovery" is not 100% precise.

Here, you discovered a race condition between the Nomad execution returning and the Nomad allocation being stopped.

Expected behavior (Allocation is stopped first):

time="2023-07-07T16:05:42.285565Z" level=info msg="Running execution" executionID=ade0a214-23dd-4d15-98ea-f3a5748b31ec package=api runnerId=10-00b34dec-1ce0-11ee-9b89-fa163e5d9f68
time="2023-07-07T16:05:43.383688Z" level=debug msg="Handle Allocation Event" ClientStatus=pending DesiredStatus=run NextAllocation= PrevAllocation= alloc_id=f024d484-2aeb-f3f7-5a5d-5cee3549faab package=nomad
time="2023-07-07T16:05:43.383793Z" level=debug msg="Runner stopped" id=10-00b34dec-1ce0-11ee-9b89-fa163e5d9f68 package=runner
time="2023-07-07T16:05:43.384094Z" level=info msg="Execution returned" package=api
time="2023-07-07T16:05:43.384162Z" level=debug code=200 duration=1.09869301s method=GET path=/api/v1/runners/10-00b34dec-1ce0-11ee-9b89-fa163e5d9f68/websocket user_agent=

Discovered behavior (Execution returns first):

time="2023-07-07T16:06:13.381294Z" level=info msg="Running execution" executionID=41e9b65a-0861-41ac-b2bd-336940cf396e package=api runnerId=10-1247aad9-1ce0-11ee-9b89-fa163e5d9f68
time="2023-07-07T16:06:14.008184Z" level=info msg="Execution returned" package=api
time="2023-07-07T16:06:14.008693Z" level=debug code=200 duration=629.300799ms method=GET path=/api/v1/runners/10-1247aad9-1ce0-11ee-9b89-fa163e5d9f68/websocket user_agent=
time="2023-07-07T16:06:14.197372Z" level=debug msg="Handle Allocation Event" ClientStatus=pending DesiredStatus=run NextAllocation= PrevAllocation= alloc_id=4c157a69-9314-656b-63c6-ede7f657e8a6 package=nomad
time="2023-07-07T16:06:14.197506Z" level=debug msg="Runner stopped" id=10-1247aad9-1ce0-11ee-9b89-fa163e5d9f68 package=runner

The second case is unfortunate as Nomad returns to us just the exit code 128 without a further error.
I see the following solutions:

  1. Map every execution returning with code 128 as OOM Killed
    • (The user might write an program for returning the code 128. This will lead to displaying a OOM Killed error, but it will not exploit a vulnerability. The inactivity timer of the runner is not stopped and the allocation will be removed when it expired.)
  2. Solution 1 + Alert (Sentry) when the runner is destroyed in the end not because of the OOM Killed reason.
  3. Ignore the exitInfo when it has the exit code 128. Then, Poseidon will wait for the context being cancelled.
  4. Wait 1 second for the allocation to be stopped when the execution returns the exit code 128.

@MrSerth
Copy link
Member

MrSerth commented Jul 14, 2023

2a) Okay, fair point. But for now, I would just say: Leave a suitable comment in the code (that this might happen) and that's it. Usually, no one else should kill the process, so I would accept this possibility.

2b) My first idea was similar to your first one (just mapping exit code 128), since the exit code 128 is not that usual... Nevertheless, there might be some false positives. That's why I would prefer a solution that's a bit more clever, and the ones you suggested fit into this category.

Still, I am not sure about solution 2; to me it sounds more like a potential combination with solution 4, for example. How would you identify the OOM killed reason for solution 2? And what should we do when this Sentry error occurs? I have the feeling that there won't be much new information we would gain.

Otherwise, I think solution 3 or 4 would be fine; is it possible to combine these somehow? I wouldn't like to wait a fix time of 1 second (but up to one second or so would be fine). Therefore, what about keeping the WebSocket connection to CodeOcean open in case an execution returned with exit code 128. Then, wait up to a second, if the runner gets destroyed / the context canceled. If this is the case, we pretend that the execution got OOM killed. If not, we just mirror the exit code 128 to CodeOcean and accept this (potential) delay. As I see from the log file, usually the context should be canceled after about 200ms, so that should be fine.

@mpass99
Copy link
Contributor Author

mpass99 commented Jul 16, 2023

Still, I am not sure about solution 2 I'm neither by now 😅 How would you identify the OOM killed reason for solution 2? When the allocation is stopped, we receive the OOM Killed reason by Nomad. But, either we would need the Nomad runner to wait for following events by Nomad, or we would need the Nomad event handling to know if there was an execution exiting with code 128. Both are not the cleanest implementations.

However, I'm happy with our current implementation (53e01be) combining solution 1, 3, and 4 :)

@MrSerth
Copy link
Member

MrSerth commented Jul 17, 2023

However, I'm happy with our current implementation (53e01be) combining solution 1, 3, and 4 :)

I like that too, and it works as expected -- thanks!

However, during testing of 53e01be deployed to staging, I also saw another error message: {"type":"error","data":"Error executing the request"}

This trace (poseidon.log, codeocean.log) and this one are examples of said behavior. Besides trying further, I was not able to reproduce it again, but was still wondering about it. Am I just out of luck? Actually, the error cases represent my first and maybe fifth try, the following 20 worked without an error. Still, when checking the error details, it seems like the removal of the execution might have happened too early / "unexpectedly".

@mpass99
Copy link
Contributor Author

mpass99 commented Jul 21, 2023

another error message: {"type":"error","data":"Error executing the request"}

I have the feeling, I am out of luck reproducing this error. In my recent 100 executions using our OOM payload it did not appear :/
Do you have further ideas how to reproduce it? Otherwise, we could also proceed, but carefully watch POSEIDON-Z #157 which is triggered by this error. With the added (missing) debug statement about the internal reason about the allocation stopping, it should be possible to further identify the error cause when it occurs.

@MrSerth
Copy link
Member

MrSerth commented Jul 21, 2023

I have the feeling, I am out of luck reproducing this error. In my recent 100 executions using our OOM payload it did not appear :/
Do you have further ideas how to reproduce it?

It's not just you. As I've said before, I only saw this error twice during my the first five or ten executions. Even after trying multiple times again, I wasn't able to reproduce it either. I used a similar OOM payload and do not have any idea what triggered the error.

With the added (missing) debug statement about the internal reason about the allocation stopping, it should be possible to further identify the error cause when it occurs.

That sounds great, I would suggest to wait for another occurrence of the bug and then use the additional debug data to identify the root cause!

Destroying the runner when Nomad informs us about its allocation being stopped, fixes the error of executions running into their timeout even if the allocation was stopped long ago.
Before, Nomad executions often got stopped because the runner was deleted.
With the previous commit, we cover the exception to this behaviour by stopping the execution Poseidon-side.
These different approaches lead to different context error messages.
In this commit, we move the check of the passed timeout, to respond with the corresponding client message again.
in order to return a specific error for OOM Killed Executions.
due to the Nomad request exiting before the allocation is stopped. We catch this behavior by introducing a time period for the allocation being stopped iff the exit code is 128.
@MrSerth MrSerth enabled auto-merge (rebase) July 21, 2023 13:23
@MrSerth MrSerth merged commit 8ef5f4e into main Jul 21, 2023
10 checks passed
@MrSerth MrSerth deleted the fix/#397-oom-killed branch July 21, 2023 13:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

OOM Killed Allocations
2 participants