-
Notifications
You must be signed in to change notification settings - Fork 207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
unexpected worker exit doesn't cause proper kernel panic #5480
Comments
I can't think of an obvious way to expose the worker's PID to the test harness. (I can think of one weird way: announce worker startup/shutdown to the slog, including the PID, and then have the test harness provide a slog publisher hook that ignores everything except that worker-startup event). If that form of test proves too difficult, I'd probably be content with a manual test: start a kernel with one |
I have that exact infrastructure in the loadgen, except I find subprocess and match them to vats based on cmd args. Linux only for now unfortunately, as it currently relies on |
Maybe change |
might already be fixed, @warner to confirm. |
Describe the bug
@michaelfig did an experiment where he used SIGABRT to kill a worker (
v2:http
, in particular). He observed at least one unhelpful error message (the slog code complaining that the vat did a syscall while in the IDLE state), and then saw that the kernel did not panic as expected.One message he saw was:
another was:
I think we're handling unexpected worker death during a delivery, but not while outside of a delivery.
The slogger tracks the state of each vat, and asserts that e.g. syscalls only happen while the vat is in the DELIVERY state, not the IDLE state. (This might be a bit aggressive, especially because the slogger should not be able to crash the kernel, but it is a useful diagnostic to know that the vat is doing something out-of-turn).
I suspect the first error message is happens because the code reading from the worker pipe gets confused: when the worker exits, the pipe becomes "readable", but the next read will be EOF or an empty string or will throw an error, or something. I suspect that the empty string is somehow parsed enough to be treated as a syscall, at least enough to trigger the slogging code, which then complains that we aren't in the right state.
The
manager-subprocess-xsnap.js
code has some sort of async iterator loop that waits for data from the worker subprocess. That loop has some way to signal an error, like "child exited". We need to give that manager code a way to panic the kernel when it sees the child exit (as long as we weren't trying to deliberately exit the child at that time, i.e. we were paging the worker out).Then we want to make sure we don't trigger the slog syscall pathway if/when the "data is available from the child" code gets triggered.
Testing
I think we want a unit test that can start a small kernel, with a single
xsnap
worker, figure out the PID of the worker, kill it with a SIGABRT, then attempt to do acontroller.run()
and confirm that it rejects. We should manually look at the output of this test and confirm that it doesn't emit any spurious slog errors.The text was updated successfully, but these errors were encountered: