SELinux is protecting the file system, and the host from attack from inside of a container.
The initial SELinux policy for containers was written for a tool called virt-sandbox, which used libvirt, specifically libvirt-lxc, to launch containers. This first type was called svirt_lxc_t
and it is not allowed to have network access.
The successor of svirt_lxc_t
is called svirt_lxc_net_t
and allows full network access.
The type for content that the svirt_lxc
types could manage is named svirt_sandbox_file_t
.
This SELinux policy was later adopted by Docker and the aliases container_t
and container_file_t
were created.
The container policy is defined in the container-selinux package.
By default containers run with the SELinux type container_t
whether this is a container launched by just about any container engine (e.g. podman, cri-o, docker, buildah, moby).
SELinux only allows a container_t
to read/write/execute files labeled container_file_t
.
The Docker daemon and Podman are usually running as container_runtime_t
, and the default label for content in /var/lib/docker
and /var/lib/containers
is container_var_lib_t
.
Using correct SELinux label to parasite socket
If running on a system with SELinux enabled the socket for the communication between parasite daemon and the main CRIU process needs to be correctly labeled.
In the case of Podman, CRIU is started from runc and it is running as container_runtime_t
.
The parasite code will be running with the same context as the container process (container_t
).
CRIU interacts with the parasite code via a Unix socket and allowing a container process to connect via socket to the outside of the container is not desirable. Thus, CRIU first obtains the context of the root container process and tells SELinux to label the created socket with the same label as the root container process.
For this to work it is necessary to have the correct SELinux policies installed. For Fedora based systems this is part of the container-selinux package.
Note that the current implementation assumes all processes CRIU that are to be checkpointed are labeled with the same SELinux context, which is the default behaviour for most container engines.
In the case when a child process has a different label an additional SELinux policies might be required.
Checkpoint and restore any SELinux process label
For successful container checkpoint and restore on a SELinux enabled host it is necessary that the restored container has the same process context as before checkpointing.
During dump CRIU stores any process label to be restored and for processes started from the command-line which are usually running in the unconfined_t
this just works. For containers
an additional policy is needed, which is provided by the latest container-selinux package. This policy allows CRIU (when running as container_runtime_t
) to transition the restored process to container_t
.
Restoring a process that is running under systemd's control (unconfined_service_t
) without additional policies is likely to fail because CRIU will be not allowed to change the context of the restored process.
For each checkpoint/restore use case on SELinux enabled systems, besides container processes and command-line/shell processes, a dyntransition permission must be granted between the old and new security contexts.
Restoring a multi-threaded process with SELinux
SELinux does not always support changing the process context of a multi-threaded process. The context change of a running multi-threaded process is allowed only if the new security context is bounded by the old security context.
To be able to restore a process without the need to have the new security context bounded by the old security context, CRIU sets the SELinux process context before creating the threads. Thus, all threads are created with the process context of the main process.