Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unexpected worker exit doesn't cause proper kernel panic #5480

Closed
warner opened this issue May 31, 2022 · 5 comments · Fixed by #5890
Closed

unexpected worker exit doesn't cause proper kernel panic #5480

warner opened this issue May 31, 2022 · 5 comments · Fixed by #5890
Assignees
Labels
bug Something isn't working SwingSet package: SwingSet
Milestone

Comments

@warner
Copy link
Member

warner commented May 31, 2022

Describe the bug

@michaelfig did an experiment where he used SIGABRT to kill a worker (v2:http, in particular). He observed at least one unhelpful error message (the slog code complaining that the vat did a syscall while in the IDLE state), and then saw that the kernel did not panic as expected.

One message he saw was:

UnhandledPromiseRejectionWarning: vat (a string) in (a string), not (a string): (a string)
(node:54) PromiseRejectionHandledWarning: Promise rejection was handled asynchronously (rejection id: 21
4)
2022-05-31T20:50:08.842Z SwingSet: kernel: error in kernel.deliver: (RangeError#137)
RangeError#137: vat v2 in delivery , not idle : reentrant delivery?

  at assertOldState (.../swingset-vat/src/kernel/slogger.js:123:8)
  at Object.delivery (.../swingset-vat/src/kernel/slogger.js:154:1)
  at Object.deliverToVat (.../swingset-vat/src/kernel/vat-warehouse.js:275:21)
  at async deliverAndLogToVat (.../swingset-vat/src/kernel/kernel.js:383:22)
  at async deliverRunQueueEvent (.../swingset-vat/src/kernel/kernel.js:938:12)
  at async processDeliveryMessage (.../swingset-vat/src/kernel/kernel.js:975:14)
  at async Object.run (.../swingset-vat/src/kernel/kernel.js:1532:19)
  at async crankScheduler (packages/cosmic-swingset/src/kernel-stats.js:315:5)
  at async processKernel (packages/solo/src/start.js:261:5)
  at async deliverInboundCommand (packages/solo/src/start.js:293:7)

another was:

2022-05-31T20:48:34.309Z SwingSet: kernel: error in kernel.deliver: (ExitSignal#1)
ExitSignal#1: v2:http exited due to signal SIGABRT

  at new ErrorSignal (packages/xsnap/api.js:37:5)
  at ChildProcess.<anonymous> (packages/xsnap/src/xsnap.js:116:22)
  at ChildProcess.emit (node:events:527:28)
  at ChildProcess.emit (node:domain:475:12)

I think we're handling unexpected worker death during a delivery, but not while outside of a delivery.

The slogger tracks the state of each vat, and asserts that e.g. syscalls only happen while the vat is in the DELIVERY state, not the IDLE state. (This might be a bit aggressive, especially because the slogger should not be able to crash the kernel, but it is a useful diagnostic to know that the vat is doing something out-of-turn).

I suspect the first error message is happens because the code reading from the worker pipe gets confused: when the worker exits, the pipe becomes "readable", but the next read will be EOF or an empty string or will throw an error, or something. I suspect that the empty string is somehow parsed enough to be treated as a syscall, at least enough to trigger the slogging code, which then complains that we aren't in the right state.

The manager-subprocess-xsnap.js code has some sort of async iterator loop that waits for data from the worker subprocess. That loop has some way to signal an error, like "child exited". We need to give that manager code a way to panic the kernel when it sees the child exit (as long as we weren't trying to deliberately exit the child at that time, i.e. we were paging the worker out).

Then we want to make sure we don't trigger the slog syscall pathway if/when the "data is available from the child" code gets triggered.

Testing

I think we want a unit test that can start a small kernel, with a single xsnap worker, figure out the PID of the worker, kill it with a SIGABRT, then attempt to do a controller.run() and confirm that it rejects. We should manually look at the output of this test and confirm that it doesn't emit any spurious slog errors.

@warner warner added bug Something isn't working SwingSet package: SwingSet labels May 31, 2022
@warner
Copy link
Member Author

warner commented Jun 1, 2022

I can't think of an obvious way to expose the worker's PID to the test harness. (I can think of one weird way: announce worker startup/shutdown to the slog, including the PID, and then have the test harness provide a slog publisher hook that ignores everything except that worker-startup event).

If that form of test proves too difficult, I'd probably be content with a manual test: start a kernel with one xs-worker vat, sleep for 20s, while it's sleeping go to a different shell and locate the worker process, kill it, then watch the kernel's behavior when the sleep finishes and it tries to do a new c.run().

@mhofman
Copy link
Member

mhofman commented Jun 1, 2022

I can't think of an obvious way to expose the worker's PID to the test harness. (I can think of one weird way: announce worker startup/shutdown to the slog, including the PID, and then have the test harness provide a slog publisher hook that ignores everything except that worker-startup event).

I have that exact infrastructure in the loadgen, except I find subprocess and match them to vats based on cmd args. Linux only for now unfortunately, as it currently relies on /proc (Open issue to relax this constraint: Agoric/testnet-load-generator#41)

@warner
Copy link
Member Author

warner commented Jun 8, 2022

Maybe change packages/xsnap/ to report the PID, and then slog the PID, and then a test could kill() that PID.

@Tartuffo
Copy link
Contributor

might already be fixed, @warner to confirm.

@FUDCo
Copy link
Contributor

FUDCo commented Aug 4, 2022

The tests I did for #2958 seem to confirm that this is fixed. I tried replicating those tests using SIGABRT rather than SIGKILL (not that the signal should matter for this) and got the same (successful) results as before. Once #5890 lands we should mark this issue closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working SwingSet package: SwingSet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants