This section shows system container security features by way of example.
The User Guide describes this topic deeper detail.
We show the following system container isolation features:
-
Linux user namespace
-
Linux capabilities
-
Immutable Mountpoints
-
Exclusive Userns ID Mappings (Sysbox-EE only)
First let's deploy a system container:
$ docker run --runtime=sysbox-runc --rm -it --hostname syscont debian:latest
root@syscont:/#
Nestybox system containers always use the Linux user-namespace (in fact they use all Linux namespaces) to provide strong isolation between the system container and the host.
Let's verify this by comparing the namespaces between a process inside the system container and a process on the host.
From the system container:
root@syscont:/# ls -l /proc/self/ns/
total 0
lrwxrwxrwx 1 root root 0 Oct 23 22:06 cgroup -> 'cgroup:[4026532563]'
lrwxrwxrwx 1 root root 0 Oct 23 22:06 ipc -> 'ipc:[4026532506]'
lrwxrwxrwx 1 root root 0 Oct 23 22:06 mnt -> 'mnt:[4026532504]'
lrwxrwxrwx 1 root root 0 Oct 23 22:06 net -> 'net:[4026532509]'
lrwxrwxrwx 1 root root 0 Oct 23 22:06 pid -> 'pid:[4026532507]'
lrwxrwxrwx 1 root root 0 Oct 23 22:06 pid_for_children -> 'pid:[4026532507]'
lrwxrwxrwx 1 root root 0 Oct 23 22:06 user -> 'user:[4026532503]'
lrwxrwxrwx 1 root root 0 Oct 23 22:06 uts -> 'uts:[4026532505]'
Now from the host:
ls -l /proc/self/ns
total 0
lrwxrwxrwx 1 chino chino 0 Oct 23 22:07 cgroup -> 'cgroup:[4026531835]'
lrwxrwxrwx 1 chino chino 0 Oct 23 22:07 ipc -> 'ipc:[4026531839]'
lrwxrwxrwx 1 chino chino 0 Oct 23 22:07 mnt -> 'mnt:[4026531840]'
lrwxrwxrwx 1 chino chino 0 Oct 23 22:07 net -> 'net:[4026531992]'
lrwxrwxrwx 1 chino chino 0 Oct 23 22:07 pid -> 'pid:[4026531836]'
lrwxrwxrwx 1 chino chino 0 Oct 23 22:07 pid_for_children -> 'pid:[4026531836]'
lrwxrwxrwx 1 chino chino 0 Oct 23 22:07 user -> 'user:[4026531837]'
lrwxrwxrwx 1 chino chino 0 Oct 23 22:07 uts -> 'uts:[4026531838]'
You can see the system container uses dedicated namespaces, including the user and cgroup namespaces. It has no namespaces in common with the host, which gives it stronger isolation compared to regular Docker containers.
Because system containers use the Linux user-namespace, it means that the root user in the container only has privileges on resources assigned to the container, but none otherwise.
You can verify this by typing the following inside the system container:
root@syscont:/# cat /proc/self/uid_map
0 268994208 65536
root@syscont:/# cat /proc/self/gid_map
0 268994208 65536
This means that user-IDs in the range [0:65535] inside the container are mapped to a range of unprivileged user-IDs on the host (chosen by Sysbox). In this example they map to the host user-ID range [268994208 : 268994208+65535].
Sysbox assigns all containers the same user-ID mappings. Sysbox Enterprise Edition (Sysbox-EE) improves on this as described in the next section.
Now, let's check the capabilities of a process created by the root user inside the system container:
root@syscont:/# grep Cap /proc/self/status
CapInh: 0000003fffffffff
CapPrm: 0000003fffffffff
CapEff: 0000003fffffffff
CapBnd: 0000003fffffffff
CapAmb: 0000003fffffffff
As shown, a root process inside the system container has all capabilities enabled, but those capabilities only take effect with respect to host resources assigned to the system container.
Contrast this to a regular Docker container. A root process in such a
container has a reduced set of capabilities (typically CapEff: 00000000a80425fb
) and does not use the Linux user namespace. This has
two drawbacks:
-
The container's root process is limited in what it can do within the container.
-
The container's root process has those same capabilities on the host, which poses a higher security risk should the process escape the container's chroot jail.
System containers overcome both of these drawbacks.
Filesystem mounts that make up the system container's rootfs jail (i.e., mounts setup at container creation time into the container's root filesystem) are considered special, meaning that Sysbox places restrictions on the operations that may be done on them from within the container.
This ensures processes inside the container can't modify those mounts in a way
that would weaken or break container isolation, even though these processes may
be running as root with full capabilities inside the container and thus have
access to the mount
and umount
syscalls.
We call these "immutable mounts". The User Guide has a full description of this feature. Here we just show an example of what this feature does.
- Deploy a system container with a read-only mount of host volume
myvol
:
$ docker run --runtime=sysbox-runc -it --rm --hostname=syscont -v myvol:/mnt/myvol:ro ubuntu
root@syscont:/#
- List the container's initial mounts; these are all considered immutable because they are setup at container creation time by Sysbox.
root@syscont:/# findmnt
TARGET SOURCE FSTYPE OPTIONS
/ . shiftfs rw,relatime
|-/sys sysfs sysfs rw,nosuid,nodev,noexec,relatime
| |-/sys/firmware tmpfs tmpfs ro,relatime,uid=165536,gid=165536
| |-/sys/fs/cgroup tmpfs tmpfs rw,nosuid,nodev,noexec,relatime,mode=755,uid=165536,gid=165536
| | |-/sys/fs/cgroup/systemd systemd cgroup rw,nosuid,nodev,noexec,relatime,xattr,name=systemd
| | |-/sys/fs/cgroup/memory cgroup cgroup rw,nosuid,nodev,noexec,relatime,memory
| | |-/sys/fs/cgroup/cpuset cgroup cgroup rw,nosuid,nodev,noexec,relatime,cpuset
| | |-/sys/fs/cgroup/blkio cgroup cgroup rw,nosuid,nodev,noexec,relatime,blkio
| | |-/sys/fs/cgroup/net_cls,net_prio cgroup cgroup rw,nosuid,nodev,noexec,relatime,net_cls,net_prio
| | |-/sys/fs/cgroup/perf_event cgroup cgroup rw,nosuid,nodev,noexec,relatime,perf_event
| | |-/sys/fs/cgroup/hugetlb cgroup cgroup rw,nosuid,nodev,noexec,relatime,hugetlb
| | |-/sys/fs/cgroup/cpu,cpuacct cgroup cgroup rw,nosuid,nodev,noexec,relatime,cpu,cpuacct
| | |-/sys/fs/cgroup/freezer cgroup cgroup rw,nosuid,nodev,noexec,relatime,freezer
| | |-/sys/fs/cgroup/devices cgroup cgroup rw,nosuid,nodev,noexec,relatime,devices
| | |-/sys/fs/cgroup/pids cgroup cgroup rw,nosuid,nodev,noexec,relatime,pids
| | `-/sys/fs/cgroup/rdma cgroup cgroup rw,nosuid,nodev,noexec,relatime,rdma
| |-/sys/kernel/config tmpfs tmpfs rw,nosuid,nodev,noexec,relatime,size=1024k,uid=165536,gid=165536
| |-/sys/kernel/debug tmpfs tmpfs rw,nosuid,nodev,noexec,relatime,size=1024k,uid=165536,gid=165536
| |-/sys/kernel/tracing tmpfs tmpfs rw,nosuid,nodev,noexec,relatime,size=1024k,uid=165536,gid=165536
| `-/sys/module/nf_conntrack/parameters/hashsize sysboxfs[/sys/module/nf_conntrack/parameters/hashsize] fuse rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permissions,allow_other
|-/proc proc proc rw,nosuid,nodev,noexec,relatime
| |-/proc/bus proc[/bus] proc ro,relatime
| |-/proc/fs proc[/fs] proc ro,relatime
| |-/proc/irq proc[/irq] proc ro,relatime
| |-/proc/sysrq-trigger proc[/sysrq-trigger] proc ro,relatime
| |-/proc/asound tmpfs tmpfs ro,relatime,uid=165536,gid=165536
| |-/proc/acpi tmpfs tmpfs ro,relatime,uid=165536,gid=165536
| |-/proc/keys udev[/null] devtmpfs rw,nosuid,noexec,relatime,size=1971180k,nr_inodes=492795,mode=755
| |-/proc/timer_list udev[/null] devtmpfs rw,nosuid,noexec,relatime,size=1971180k,nr_inodes=492795,mode=755
| |-/proc/sched_debug udev[/null] devtmpfs rw,nosuid,noexec,relatime,size=1971180k,nr_inodes=492795,mode=755
| |-/proc/scsi tmpfs tmpfs ro,relatime,uid=165536,gid=165536
| |-/proc/swaps sysboxfs[/proc/swaps] fuse rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permissions,allow_other
| |-/proc/sys sysboxfs[/proc/sys] fuse rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permissions,allow_other
| `-/proc/uptime sysboxfs[/proc/uptime] fuse rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permissions,allow_other
|-/dev tmpfs tmpfs rw,nosuid,size=65536k,mode=755,uid=165536,gid=165536
| |-/dev/console devpts[/0] devpts rw,nosuid,noexec,relatime,gid=165541,mode=620,ptmxmode=666
| |-/dev/mqueue mqueue mqueue rw,nosuid,nodev,noexec,relatime
| |-/dev/pts devpts devpts rw,nosuid,noexec,relatime,gid=165541,mode=620,ptmxmode=666
| |-/dev/shm shm tmpfs rw,nosuid,nodev,noexec,relatime,size=65536k,uid=165536,gid=165536
| |-/dev/kmsg udev[/null] devtmpfs rw,nosuid,noexec,relatime,size=1971180k,nr_inodes=492795,mode=755
| |-/dev/null udev[/null] devtmpfs rw,nosuid,noexec,relatime,size=1971180k,nr_inodes=492795,mode=755
| |-/dev/random udev[/random] devtmpfs rw,nosuid,noexec,relatime,size=1971180k,nr_inodes=492795,mode=755
| |-/dev/full udev[/full] devtmpfs rw,nosuid,noexec,relatime,size=1971180k,nr_inodes=492795,mode=755
| |-/dev/tty udev[/tty] devtmpfs rw,nosuid,noexec,relatime,size=1971180k,nr_inodes=492795,mode=755
| |-/dev/zero udev[/zero] devtmpfs rw,nosuid,noexec,relatime,size=1971180k,nr_inodes=492795,mode=755
| `-/dev/urandom udev[/urandom] devtmpfs rw,nosuid,noexec,relatime,size=1971180k,nr_inodes=492795,mode=755
|-/mnt/myvol /var/lib/docker/volumes/myvol/_data shiftfs ro,relatime
|-/etc/resolv.conf /var/lib/docker/containers/080fb5dbe347a947accf7ba27a545ce11d937f02f02ec0059535cfc065d04ea0[/resolv.conf] shiftfs rw,relatime
|-/etc/hostname /var/lib/docker/containers/080fb5dbe347a947accf7ba27a545ce11d937f02f02ec0059535cfc065d04ea0[/hostname] shiftfs rw,relatime
|-/etc/hosts /var/lib/docker/containers/080fb5dbe347a947accf7ba27a545ce11d937f02f02ec0059535cfc065d04ea0[/hosts] shiftfs rw,relatime
|-/var/lib/docker /dev/sda2[/var/lib/sysbox/docker/080fb5dbe347a947accf7ba27a545ce11d937f02f02ec0059535cfc065d04ea0] ext4 rw,relatime
|-/var/lib/kubelet /dev/sda2[/var/lib/sysbox/kubelet/080fb5dbe347a947accf7ba27a545ce11d937f02f02ec0059535cfc065d04ea0] ext4 rw,relatime
|-/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs /dev/sda2[/var/lib/sysbox/containerd/080fb5dbe347a947accf7ba27a545ce11d937f02f02ec0059535cfc065d04ea0] ext4 rw,relatime
|-/usr/src/linux-headers-5.4.0-65 /usr/src/linux-headers-5.4.0-65 shiftfs ro,relatime
|-/usr/src/linux-headers-5.4.0-65-generic /usr/src/linux-headers-5.4.0-65-generic shiftfs ro,relatime
`-/usr/lib/modules/5.4.0-65-generic /lib/modules/5.4.0-65-generic shiftfs ro,relatime
- Let's try to remount one of those immutable read-only mounts to read-write:
root@syscont:/# mount -o remount,rw,bind /mnt/myvol
mount: /mnt/myvol: permission denied.
This operation fails, even though the process doing the remount is root inside the container with all capabilities enabled:
root@syscont:/# cat /proc/self/status | egrep "Uid|Gid|CapEff"
Uid: 0 0 0 0
Gid: 0 0 0 0
CapEff: 0000003fffffffff
The reason the operation fails is that Sysbox detects the remount operation occurs over an immutable read-only mount. Thus, it can't be remounted read-write from inside the container (as doing so would essentially break container isolation).
Under the covers, Sysbox does this by trapping the mount system call and vetting the action.
- On the other hand, immutable read-write mounts can be remounted read-only. This is allowed because such a remount places a stronger restriction on the immutable mount, rather than a weaker one:
root@syscont:/# mount -o remount,ro,bind /etc/resolv.conf
root@syscont:/# findmnt | grep resolv.conf
|-/etc/resolv.conf /var/lib/docker/containers/080fb5dbe347a947accf7ba27a545ce11d937f02f02ec0059535cfc065d04ea0[/resolv.conf] shiftfs ro,relatime
- The mount restrictions also apply to bind-mounts sourced from initial mounts inside the container. For example:
root@syscont:/# mkdir /root/headers
root@syscont:/# mount --bind /usr/src/linux-headers-5.4.0-65 /root/headers
root@syscont:/# mount -o remount,bind,rw /root/headers
mount: /root/headers: permission denied.
This fails because the source of the bind mount (/usr/src/linux-headers-5.4.0-65
)
is an initial read-only mount of the container. Since the bind mount simply
makes that initial read-only mount show up somewhere else in the container's
filesystem, the bind mount is also considered immutable.
- Finally, the mount restrictions do not apply to new mounts created inside the container. Those are not initial mounts, so no restrictions are placed on them:
root@syscont:/# mkdir /root/tmp
root@syscont:/# mount -t tmpfs -o ro,size=10M tmpfs /root/tmp
root@syscont:/# mount -o remount,rw,bind /root/tmp
This works without problem because the mount at /root/tmp
is a new mount
(i.e., it was not setup by Sysbox at container creation time). Thus it's
not considered immutable.
In the prior example with only looked at remount operations. But how about unmounts?
By default, unmounts of initial mounts are not restricted in the same way that remount operations are. The reason for this is that system container images often have a process manager inside, and some process managers (in particular systemd) unmount all mountpoints inside the container during container stop. If Sysbox where to restrict these unmounts, the process manager will report errors during container stop such as:
[FAILED] Failed unmounting /etc/hostname.
[FAILED] Failed unmounting /etc/hosts.
[FAILED] Failed unmounting /etc/resolv.conf.
[FAILED] Failed unmounting /usr/lib/modules/5.4.0-62-generic.
[FAILED] Failed unmounting /usr/src/linux-headers-5.4.0-62.
[FAILED] Failed unmounting /usr/src/linux-headers-5.4.0-62-generic.
Note that allowing unmounts of immutable mounts is typically not a security concern, because the unmount normally exposes the underlying contents of the system container's image, and this image will likely not have sensitive data that is masked by the mounts. For example:
root@syscont:/# grep nameserver /etc/resolv.conf
nameserver 75.75.75.75
nameserver 75.75.76.76
root@syscont:/# umount /etc/resolv.conf
root@syscont:/# grep nameserver /etc/resolv.conf
(empty)
Having said this, this behavior can be changed by setting the sysbox-fs config
option allow-immutable-unmounts=false
. When this option is set, Sysbox does
restrict unmounts on all immutable mounts:
root@syscont:/# umount /etc/resolv.conf
umount: /etc/resolv.conf: must be superuser to unmount.
Of course, mounts that are not immutable are not affected by this restriction:
root@syscont:/# mkdir /root/tmp
root@syscont:/# mount -t tmpfs -o ro,size=10M tmpfs /root/tmp
root@syscont:/# umount /root/tmp
The user-guide has more info on this.
Sysbox-EE improves on Sysbox-CE by automatically assigning each system container exclusive user-ID and group-ID mappings. This further isolates system containers from the host and from each other.
For example, if you launch a couple of containers with Docker and Sysbox-EE, you'll see they are both assigned exclusive user-ID and group-ID mappings.
In the first container:
root@syscont:/# cat /proc/self/uid_map
0 268994208 65536
root@syscont:/# cat /proc/self/gid_map
0 268994208 65536
In the second container:
$ docker run --runtime=sysbox-runc --rm -it --hostname syscont2 debian:latest
root@syscont2:/# cat /proc/self/uid_map
0 269059744 65536
root@syscont2:/# cat /proc/self/gid_map
0 269059744 65536
This capability of Sysbox-EE provides enhanced cross-container isolation: if a process in one container somehow escapes the container, it will find itself with no permissions to access any data in other containers or on the host.
More info on this can be found in the Sysbox User Guide.