unexpected worker exit doesn't cause proper kernel panic #5480

warner · 2022-05-31T21:26:52Z

Describe the bug

@michaelfig did an experiment where he used SIGABRT to kill a worker (v2:http, in particular). He observed at least one unhelpful error message (the slog code complaining that the vat did a syscall while in the IDLE state), and then saw that the kernel did not panic as expected.

One message he saw was:

UnhandledPromiseRejectionWarning: vat (a string) in (a string), not (a string): (a string)
(node:54) PromiseRejectionHandledWarning: Promise rejection was handled asynchronously (rejection id: 21
4)
2022-05-31T20:50:08.842Z SwingSet: kernel: error in kernel.deliver: (RangeError#137)
RangeError#137: vat v2 in delivery , not idle : reentrant delivery?

  at assertOldState (.../swingset-vat/src/kernel/slogger.js:123:8)
  at Object.delivery (.../swingset-vat/src/kernel/slogger.js:154:1)
  at Object.deliverToVat (.../swingset-vat/src/kernel/vat-warehouse.js:275:21)
  at async deliverAndLogToVat (.../swingset-vat/src/kernel/kernel.js:383:22)
  at async deliverRunQueueEvent (.../swingset-vat/src/kernel/kernel.js:938:12)
  at async processDeliveryMessage (.../swingset-vat/src/kernel/kernel.js:975:14)
  at async Object.run (.../swingset-vat/src/kernel/kernel.js:1532:19)
  at async crankScheduler (packages/cosmic-swingset/src/kernel-stats.js:315:5)
  at async processKernel (packages/solo/src/start.js:261:5)
  at async deliverInboundCommand (packages/solo/src/start.js:293:7)

another was:

2022-05-31T20:48:34.309Z SwingSet: kernel: error in kernel.deliver: (ExitSignal#1)
ExitSignal#1: v2:http exited due to signal SIGABRT

  at new ErrorSignal (packages/xsnap/api.js:37:5)
  at ChildProcess.<anonymous> (packages/xsnap/src/xsnap.js:116:22)
  at ChildProcess.emit (node:events:527:28)
  at ChildProcess.emit (node:domain:475:12)

I think we're handling unexpected worker death during a delivery, but not while outside of a delivery.

The slogger tracks the state of each vat, and asserts that e.g. syscalls only happen while the vat is in the DELIVERY state, not the IDLE state. (This might be a bit aggressive, especially because the slogger should not be able to crash the kernel, but it is a useful diagnostic to know that the vat is doing something out-of-turn).

I suspect the first error message is happens because the code reading from the worker pipe gets confused: when the worker exits, the pipe becomes "readable", but the next read will be EOF or an empty string or will throw an error, or something. I suspect that the empty string is somehow parsed enough to be treated as a syscall, at least enough to trigger the slogging code, which then complains that we aren't in the right state.

The manager-subprocess-xsnap.js code has some sort of async iterator loop that waits for data from the worker subprocess. That loop has some way to signal an error, like "child exited". We need to give that manager code a way to panic the kernel when it sees the child exit (as long as we weren't trying to deliberately exit the child at that time, i.e. we were paging the worker out).

Then we want to make sure we don't trigger the slog syscall pathway if/when the "data is available from the child" code gets triggered.

Testing

I think we want a unit test that can start a small kernel, with a single xsnap worker, figure out the PID of the worker, kill it with a SIGABRT, then attempt to do a controller.run() and confirm that it rejects. We should manually look at the output of this test and confirm that it doesn't emit any spurious slog errors.

The text was updated successfully, but these errors were encountered:

warner · 2022-06-01T18:07:10Z

I can't think of an obvious way to expose the worker's PID to the test harness. (I can think of one weird way: announce worker startup/shutdown to the slog, including the PID, and then have the test harness provide a slog publisher hook that ignores everything except that worker-startup event).

If that form of test proves too difficult, I'd probably be content with a manual test: start a kernel with one xs-worker vat, sleep for 20s, while it's sleeping go to a different shell and locate the worker process, kill it, then watch the kernel's behavior when the sleep finishes and it tries to do a new c.run().

mhofman · 2022-06-01T18:10:11Z

I can't think of an obvious way to expose the worker's PID to the test harness. (I can think of one weird way: announce worker startup/shutdown to the slog, including the PID, and then have the test harness provide a slog publisher hook that ignores everything except that worker-startup event).

I have that exact infrastructure in the loadgen, except I find subprocess and match them to vats based on cmd args. Linux only for now unfortunately, as it currently relies on /proc (Open issue to relax this constraint: Agoric/testnet-load-generator#41)

warner · 2022-06-08T21:14:06Z

Maybe change packages/xsnap/ to report the PID, and then slog the PID, and then a test could kill() that PID.

Tartuffo · 2022-07-20T18:34:00Z

might already be fixed, @warner to confirm.

FUDCo · 2022-08-04T08:06:47Z

The tests I did for #2958 seem to confirm that this is fixed. I tried replicating those tests using SIGABRT rather than SIGKILL (not that the signal should matter for this) and got the same (successful) results as before. Once #5890 lands we should mark this issue closed.

warner added bug Something isn't working SwingSet package: SwingSet labels May 31, 2022

warner assigned mhofman Jun 8, 2022

Tartuffo added this to the Mainnet 1 milestone Jul 19, 2022

Tartuffo added the MUST-HAVE label Jul 19, 2022

warner mentioned this issue Jul 19, 2022

test our policy to respond to a vat worker subprocess crashing #2958

Closed

FUDCo mentioned this issue Aug 4, 2022

Implement swingset-runner test to validate vat process failure behavior #5890

Merged

mergify bot closed this as completed in #5890 Aug 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unexpected worker exit doesn't cause proper kernel panic #5480

unexpected worker exit doesn't cause proper kernel panic #5480

warner commented May 31, 2022

warner commented Jun 1, 2022

mhofman commented Jun 1, 2022

warner commented Jun 8, 2022

Tartuffo commented Jul 20, 2022

FUDCo commented Aug 4, 2022

unexpected worker exit doesn't cause proper kernel panic #5480

unexpected worker exit doesn't cause proper kernel panic #5480

Comments

warner commented May 31, 2022

Describe the bug

Testing

warner commented Jun 1, 2022

mhofman commented Jun 1, 2022

warner commented Jun 8, 2022

Tartuffo commented Jul 20, 2022

FUDCo commented Aug 4, 2022