Make some host errors non-recoverable in `try_call`. #945

dmkozh · 2023-07-13T00:46:07Z

What

Make some host errors non-recoverable in try_call.

Resolves #876

Why

Internal errors that indicate host implementation/setup bugs - a good chunk of it indicates severely broken host state and we probably don't want to continue execution if that happens. We also would want to learn about these ASAP
Running out of budget - while this would be handled 'automatically' during the normal execution, this makes test experience a bit more consistent (as compiled contracts are not metered)
Accessing entries outside of the footprint - this is probably the most controversial change, but there is really no indication/expectation that storage interface doesn't behave as a key-value database and hence it's pretty easy to write logic that relies on that fact without considering the existence of the footprint. try_call recovering from footprint errors is functionally equivalent to get_contract_data recovering from footprint errors (imagine a simple storage wrapper contract) and we really don't want allowing that.

Known limitations

N/A

- Internal errors that indicate host implementation/setup bugs - a good chunk of it indicates severely broken host state and we probably don't want to continue execution if that happens. We also would want to learn about these ASAP - Running out of budget - while this would be handled 'automatically' during the normal execution, this makes test experience a bit more consistent (as compiled contracts are not metered) - Accessing entries outside of the footprint - this is probably the most controversial change, but there is really no indication/expectation that storage interface doesn't behave as a key-value database and hence it's pretty easy to write logic that relies on that fact without considering the existence of the footprint. `try_call` recovering from footprint errors is functionally equivalent to `get_contract_data` recovering from footprint errors (imagine a simple storage wrapper contract) and we really don't want allowing that.

soroban-env-host/src/host/error.rs

graydon · 2023-07-14T01:04:07Z

Hmm. I'm a bit uneasy about this. A few separate comments:

The budget-exhaustion errors don't need this since rollback doesn't buy you more budget. Once out you're out. So we're going to fault all the way back to the top level in that case anyways.
Footprint errors do not, to me, feel at all different from other unexpected circumstances. It's not always (or necessarily often) a coding bug. Things might just have changed a bit since you did the footprint-preflight -- footprints are a concurrency-control mechanism that offloads racing/retrying to the client, and in a race you lose sometimes. And sometimes you might not want one failed sub-call to trap your transaction. Note that nobody has to call try_call in the first place! It's not the normal code path. Most calls are call and that will propagate failures. We put in try_call specifically for cases where someone wants to isolate a call's errors for some reason. I think it is not our place to decide "this error is a super-duper error and you didn't really mean try_call".
I think it's reasonable to change the wasmi::core::TrapCode::UnreachableCodeReached error mapping (in common) from ScErrorCode::InternalError to ScErrorCode::InvalidAction and reserve InternalError for bugs in wasmi.

I'm a bit on the fence about whether "a true wasmi bug should be recoverable or not". On the one hand it's probably adequately contained by us disposing of the VM (as we'll do in any case). On the other, it's plausible the VM did something to the host that the try_call-using user didn't (or can't) anticipate / compensate for, and maybe it's worth erring on the side of caution.

I think we should get input from more people before landing this. It implies subtle differences in the user's ability to write robust code, either way. cc @MonsieurNicolas, @tomerweller, @paulbellamy

dmkozh · 2023-07-14T01:37:29Z

Once out you're out. So we're going to fault all the way back to the top level in that case anyways.

That's not true at the very least in tests and I'm also not 100% sure if doing something like return other_contract.try_call(...); would necessarily consume any more budget (as at that stage we're just popping the frames). Likely this still will get caught somewhere, but I think it's worth to have this just for the sake of clarity.

Footprint errors do not, to me, feel at all different from other unexpected circumstances.

I think they are different. Accessing a value that is not in the footprint is not a contract programming error. It's an error of whoever has created the transaction. From the contract writer standpoint footprints do not exist and storage has map-like semantics. We don't intentionally expose the state where "the key is maybe in storage, but we don't really know because it's not in the footprint", we just panic. I think that's the correct design choice, but we do break it if try_call allows to handle such failure gracefully.

We put in try_call specifically for cases where someone wants to isolate a call's errors for some reason.

Nothing in the interface prevents users from e.g. using try_call to just handle the Result value coming from the contract they're calling without thinking much about host errors, panics etc. I wouldn't be surprised if it's used in this way without much thought. I really don't want to create a possibility to manipulate the contract logic by modifying the side input that is completely transparent to the contract. Thus I think it's safer to just never let this to succeed.

Another reason is that footprint is not only a concurrency tool, but also a paid resource (as it defines the ledger reads). I think it's safer to enforce it strictly in the same fashion as we strictly enforce every other resource. I don't have a good idea on whether this can be exploited, but I would err on the more conservative side here. I think sandboxing subcontract calls is something we may or may not want to do in vN, but as it's not happening in v1, I'm not sure we should sandbox one arbitrary resource and not the others (e.g. why then not allow to handle running out of memory etc.)

MonsieurNicolas · 2023-07-14T22:56:08Z

FWIW I'm with Dima on this one: we can in a future protocol upgrade relax things, so starting with the more conservative stuff seems fine to me.

On the specifics here:

out of date footprints -> bouncing back to the client (failed transaction) and ask to re-preflight & submit seems to be the safe thing to do.
If footprints are wrong because of buggy or malicious contract, this is no different than contracts that use too much resources. Currently try_call does not protect the caller against this class of problems. We may want to look into this in the future, but this will probably require a different type of try_call with more arguments.
right now try_call is really targeting "expected failures" from the contract (like a payment that fails for some reason)

This allows reserving internal errors for wasm-related logic errors on the host side.

dmkozh · 2023-07-17T19:08:44Z

I think ideally try_call should only recover from contract-induced failures (so return Result(Err(...)) or panic). Getting a host error that is not explicitly generated by a contract should not be recoverable. In order to recover from the host errors we probably need some call_in_sandbox that runs a contract with limited budget and recovers from any errors. The reason is that currently we're mixing two use cases for try_call: handling Result<> return contract value (I would imagine this is what users will want 95+% of the time) and just running arbitrary contracts without breaking (I think this is only useful when unrelated stuff is batched together; FWIW we don't really have an analogue for this in classic). I guess it's currently possible to distinguish between these two via return codes, so that change is not something totally necessary, but I'm also pretty sure that this is a bit of a footgun, as it's very easy to assume that every error comes from the contract logic.

dmkozh requested review from graydon and sisuresh as code owners July 13, 2023 00:46

!fixup cargo fmt

101bbeb

sisuresh reviewed Jul 13, 2023

View reviewed changes

soroban-env-host/src/host/error.rs Outdated Show resolved Hide resolved

Make generic wasmi errors InvalidAction.

dcd9daf

This allows reserving internal errors for wasm-related logic errors on the host side.

dmkozh requested a review from a team as a code owner July 14, 2023 23:27

Merge branch 'main' into host_errors

1889899

sisuresh approved these changes Jul 20, 2023

View reviewed changes

dmkozh merged commit d5df310 into stellar:main Jul 20, 2023
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make some host errors non-recoverable in `try_call`. #945

Make some host errors non-recoverable in `try_call`. #945

dmkozh commented Jul 13, 2023

graydon commented Jul 14, 2023

dmkozh commented Jul 14, 2023 •

edited

Loading

MonsieurNicolas commented Jul 14, 2023

dmkozh commented Jul 17, 2023

Make some host errors non-recoverable in try_call. #945

Make some host errors non-recoverable in try_call. #945

Conversation

dmkozh commented Jul 13, 2023

What

Why

Known limitations

graydon commented Jul 14, 2023

dmkozh commented Jul 14, 2023 • edited Loading

MonsieurNicolas commented Jul 14, 2023

dmkozh commented Jul 17, 2023

Make some host errors non-recoverable in `try_call`. #945

Make some host errors non-recoverable in `try_call`. #945

dmkozh commented Jul 14, 2023 •

edited

Loading