Production runtime should use a VM for isolation/security #742

alecthomas · 2024-01-06T11:52:02Z

Runners currently execute user code directly on the same host that they run on. In k8s this is not terrible, but ideally FTL would execute user code inside a VM to completely isolate it. This would also allow us to restrict inbound/outbound network, and so on.

Useful references:

Firecracker Go SDK
firectl CLI using the above SDK
Tutorial on using firectl to start a VM
QEMU vs. Firecracker (HN discussion)
QEMU MicroVM support.
Cloud Hypervisor is another option, though apparently quite immature.

stuartwdouglas · 2024-09-17T01:41:51Z

I don't think this is relevant now we have moved to Kube as the primary target.

alecthomas · 2024-09-17T03:07:50Z

It is still relevant because the runner and the user code are seperate security domains, and the runner is a policy enforcement point. The user code should not have access to anything but the runner "proxy" ports.

stuartwdouglas · 2024-09-17T03:23:24Z

In that case from a kube PoV you probably want the enforcement part of the runner in one container, and the actual user runtime in a different container inside the pod. AFAIK you can't really do this sort of isolation if they are all inside the same container.

alecthomas · 2024-09-17T04:53:24Z

That could be a good first step, and maybe sufficient long term, but we should run that past @AlexSzlavik. From everything I've read, containers are not a reliable security boundary. But perhaps now that we're all-in on Kubernetes we can combine this with other Kubernetes security approaches, like routing policies, and coupled with our own policy enforcement that might be fine.

alecthomas · 2024-09-18T12:04:33Z

Chatted to Alex about this earlier today and we think the Runner could be a sidecar, with the user code proxying everything through it. Presumably the user container can be locked down such that it can't access anything except for the Runner.

One issue is that because we currently route everything through the Controller, the Runner needs to be able to differentiate between traffic originating from the user module and all other traffic in order to avoid routing loops. There's code in place to do that, but it's likely bitrotted, and also requires changes to each runtime, so the JVM runtime probably doesn't support this currently. Needs testing.

alecthomas · 2024-09-18T12:06:28Z

One other thing that occurred to me is that we'll need to split the "runner" into two containers - the ftl-runner itself and the image that user code runs on (ie. what is currently the ftl-runner image).

AlexSzlavik · 2024-09-18T20:37:44Z

I feel like, running VMs in place of containers directly would be a challenge in K8s. Some googling around seemed to indicate that it's doable, but would definitely be a specialized deployment strategy. I'd be concerned that this approach would get in the way of adoption of FTL.

If we aren't doing this, a sidecar model (a la envoy) makes sense to me. I guess this means that we'd have to split the runner image into 2 right? The "edge" runner sidecar and the main "workload" runner. The former is responsible for interfacing with the cluster while the latter is responsible for launching the user code. That component would probably also act as a bridge to the "edge" runner. The main reason to seperate these, is that we want isolation of user code from ftl cluster internal components. We wouldn't want user code to be able to assume the capabilities of an FTL component.

Have we considered what a future FTL deployment, in a common production grade K8s deployment might look like? If the state of the art involves istio or other additional components, should we design for them now? Or at least make sure that we can interoperate with them?

stuartwdouglas · 2024-09-18T21:16:36Z

I feel like, running VMs in place of containers directly would be a challenge in K8s. Some googling around seemed to indicate that it's doable, but would definitely be a specialized deployment strategy. I'd be concerned that this approach would get in the way of adoption of FTL.

All the approaches out there are fairly immature, and definitly specialized. I evaluated this a couple of years ago and ended up needed to write my own VM provisioner to support multi platform builds on kube rather than using existing systems. Things may have gotten better since then, but it is still not something that we could require.

If we aren't doing this, a sidecar model (a la envoy) makes sense to me. I guess this means that we'd have to split the runner image into 2 right? The "edge" runner sidecar and the main "workload" runner. The former is responsible for interfacing with the cluster while the latter is responsible for launching the user code. That component would probably also act as a bridge to the "edge" runner. The main reason to seperate these, is that we want isolation of user code from ftl cluster internal components. We wouldn't want user code to be able to assume the capabilities of an FTL component.

This is doable. We want to avoid image building by the user so it is slightly tricky, but it is doable. If we are going to require an OCI registry for artifacts anyway once possibility is to have the controller generate the image (not via a docker build, directly through the OCI registry). Another possibility is to have a shared volume between the sidecar and the runner container, and transfer the user code over the shared volume.

Have we considered what a future FTL deployment, in a common production grade K8s deployment might look like? If the state of the art involves istio or other additional components, should we design for them now? Or at least make sure that we can interoperate with them?

We definitely will need istio, we should be thinking about this.

alecthomas · 2024-09-19T01:57:21Z

Looping in @tlongwell-block for his thoughts.

github-project-automation bot added this to FTL Jan 6, 2024

github-project-automation bot moved this to Todo in FTL Jan 6, 2024

alecthomas changed the title ~~Production runtime should use Firecracker for isolation/security~~ Production runtime should use a VM for isolation/security Jan 6, 2024

alecthomas mentioned this issue Feb 7, 2024

Dashboard #728

Open

stuartwdouglas self-assigned this Sep 18, 2024

stuartwdouglas removed their assignment Dec 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Production runtime should use a VM for isolation/security #742

Production runtime should use a VM for isolation/security #742

alecthomas commented Jan 6, 2024 •

edited

Loading

stuartwdouglas commented Sep 17, 2024

alecthomas commented Sep 17, 2024 •

edited

Loading

stuartwdouglas commented Sep 17, 2024

alecthomas commented Sep 17, 2024

alecthomas commented Sep 18, 2024

alecthomas commented Sep 18, 2024

AlexSzlavik commented Sep 18, 2024

stuartwdouglas commented Sep 18, 2024

alecthomas commented Sep 19, 2024

Production runtime should use a VM for isolation/security #742

Production runtime should use a VM for isolation/security #742

Comments

alecthomas commented Jan 6, 2024 • edited Loading

stuartwdouglas commented Sep 17, 2024

alecthomas commented Sep 17, 2024 • edited Loading

stuartwdouglas commented Sep 17, 2024

alecthomas commented Sep 17, 2024

alecthomas commented Sep 18, 2024

alecthomas commented Sep 18, 2024

AlexSzlavik commented Sep 18, 2024

stuartwdouglas commented Sep 18, 2024

alecthomas commented Sep 19, 2024

alecthomas commented Jan 6, 2024 •

edited

Loading

alecthomas commented Sep 17, 2024 •

edited

Loading