Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make some host errors non-recoverable in try_call. #945

Merged
merged 4 commits into from
Jul 20, 2023

Conversation

dmkozh
Copy link
Contributor

@dmkozh dmkozh commented Jul 13, 2023

What

Make some host errors non-recoverable in try_call.

Resolves #876

Why

  • Internal errors that indicate host implementation/setup bugs - a good chunk of it indicates severely broken host state and we probably don't want to continue execution if that happens. We also would want to learn about these ASAP
  • Running out of budget - while this would be handled 'automatically' during the normal execution, this makes test experience a bit more consistent (as compiled contracts are not metered)
  • Accessing entries outside of the footprint - this is probably the most controversial change, but there is really no indication/expectation that storage interface doesn't behave as a key-value database and hence it's pretty easy to write logic that relies on that fact without considering the existence of the footprint. try_call recovering from footprint errors is functionally equivalent to get_contract_data recovering from footprint errors (imagine a simple storage wrapper contract) and we really don't want allowing that.

Known limitations

N/A

- Internal errors that indicate host implementation/setup bugs - a good chunk of it indicates severely broken host state and we probably don't want to continue execution if that happens. We also would want to learn about these ASAP
- Running out of budget - while this would be handled 'automatically' during the normal execution, this makes test experience a bit more consistent (as compiled contracts are not metered)
- Accessing entries outside of the footprint - this is probably the most controversial change, but there is really no indication/expectation that storage interface doesn't behave as a key-value database and hence it's pretty easy to write logic that relies on that fact without considering the existence of the footprint. `try_call` recovering from footprint errors is functionally equivalent to `get_contract_data` recovering from footprint errors (imagine a simple storage wrapper contract) and we really don't want allowing that.
@graydon
Copy link
Contributor

graydon commented Jul 14, 2023

Hmm. I'm a bit uneasy about this. A few separate comments:

  • The budget-exhaustion errors don't need this since rollback doesn't buy you more budget. Once out you're out. So we're going to fault all the way back to the top level in that case anyways.
  • Footprint errors do not, to me, feel at all different from other unexpected circumstances. It's not always (or necessarily often) a coding bug. Things might just have changed a bit since you did the footprint-preflight -- footprints are a concurrency-control mechanism that offloads racing/retrying to the client, and in a race you lose sometimes. And sometimes you might not want one failed sub-call to trap your transaction. Note that nobody has to call try_call in the first place! It's not the normal code path. Most calls are call and that will propagate failures. We put in try_call specifically for cases where someone wants to isolate a call's errors for some reason. I think it is not our place to decide "this error is a super-duper error and you didn't really mean try_call".
  • I think it's reasonable to change the wasmi::core::TrapCode::UnreachableCodeReached error mapping (in common) from ScErrorCode::InternalError to ScErrorCode::InvalidAction and reserve InternalError for bugs in wasmi.

I'm a bit on the fence about whether "a true wasmi bug should be recoverable or not". On the one hand it's probably adequately contained by us disposing of the VM (as we'll do in any case). On the other, it's plausible the VM did something to the host that the try_call-using user didn't (or can't) anticipate / compensate for, and maybe it's worth erring on the side of caution.

I think we should get input from more people before landing this. It implies subtle differences in the user's ability to write robust code, either way. cc @MonsieurNicolas, @tomerweller, @paulbellamy

@dmkozh
Copy link
Contributor Author

dmkozh commented Jul 14, 2023

Once out you're out. So we're going to fault all the way back to the top level in that case anyways.

That's not true at the very least in tests and I'm also not 100% sure if doing something like return other_contract.try_call(...); would necessarily consume any more budget (as at that stage we're just popping the frames). Likely this still will get caught somewhere, but I think it's worth to have this just for the sake of clarity.

Footprint errors do not, to me, feel at all different from other unexpected circumstances.

I think they are different. Accessing a value that is not in the footprint is not a contract programming error. It's an error of whoever has created the transaction. From the contract writer standpoint footprints do not exist and storage has map-like semantics. We don't intentionally expose the state where "the key is maybe in storage, but we don't really know because it's not in the footprint", we just panic. I think that's the correct design choice, but we do break it if try_call allows to handle such failure gracefully.

We put in try_call specifically for cases where someone wants to isolate a call's errors for some reason.

Nothing in the interface prevents users from e.g. using try_call to just handle the Result value coming from the contract they're calling without thinking much about host errors, panics etc. I wouldn't be surprised if it's used in this way without much thought. I really don't want to create a possibility to manipulate the contract logic by modifying the side input that is completely transparent to the contract. Thus I think it's safer to just never let this to succeed.

Another reason is that footprint is not only a concurrency tool, but also a paid resource (as it defines the ledger reads). I think it's safer to enforce it strictly in the same fashion as we strictly enforce every other resource. I don't have a good idea on whether this can be exploited, but I would err on the more conservative side here. I think sandboxing subcontract calls is something we may or may not want to do in vN, but as it's not happening in v1, I'm not sure we should sandbox one arbitrary resource and not the others (e.g. why then not allow to handle running out of memory etc.)

@MonsieurNicolas
Copy link
Contributor

FWIW I'm with Dima on this one: we can in a future protocol upgrade relax things, so starting with the more conservative stuff seems fine to me.

On the specifics here:

  • out of date footprints -> bouncing back to the client (failed transaction) and ask to re-preflight & submit seems to be the safe thing to do.
  • If footprints are wrong because of buggy or malicious contract, this is no different than contracts that use too much resources. Currently try_call does not protect the caller against this class of problems. We may want to look into this in the future, but this will probably require a different type of try_call with more arguments.
  • right now try_call is really targeting "expected failures" from the contract (like a payment that fails for some reason)

This allows reserving internal errors for wasm-related logic errors on the host side.
@dmkozh dmkozh requested a review from a team as a code owner July 14, 2023 23:27
@dmkozh
Copy link
Contributor Author

dmkozh commented Jul 17, 2023

I think ideally try_call should only recover from contract-induced failures (so return Result(Err(...)) or panic). Getting a host error that is not explicitly generated by a contract should not be recoverable. In order to recover from the host errors we probably need some call_in_sandbox that runs a contract with limited budget and recovers from any errors. The reason is that currently we're mixing two use cases for try_call: handling Result<> return contract value (I would imagine this is what users will want 95+% of the time) and just running arbitrary contracts without breaking (I think this is only useful when unrelated stuff is batched together; FWIW we don't really have an analogue for this in classic). I guess it's currently possible to distinguish between these two via return codes, so that change is not something totally necessary, but I'm also pretty sure that this is a bit of a footgun, as it's very easy to assume that every error comes from the contract logic.

@dmkozh dmkozh merged commit d5df310 into stellar:main Jul 20, 2023
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make some host errors non-recoverable
5 participants