Changes

4,382 bytes added ,  15:03, 27 January 2022
no edit summary
Line 38: Line 38:  
There is also a case when master_id is non-zero but there is no corresponding parent sharing group, this means that outside of dumped container there is mount with matching shared_id - external slavery detected. For this case we just collect sibling sharing groups in list with empty parent link. Also we detect source path from which the master_id would be inherited either from some mountpoint-external mount or from root container mount.
 
There is also a case when master_id is non-zero but there is no corresponding parent sharing group, this means that outside of dumped container there is mount with matching shared_id - external slavery detected. For this case we just collect sibling sharing groups in list with empty parent link. Also we detect source path from which the master_id would be inherited either from some mountpoint-external mount or from root container mount.
   −
to be continued...
+
==== Actual restore of mounts ====
 +
 
 +
Actual restore of mounts in original mount engine starts with prepare_mnt_ns() function, when mount-v2 is enabled we pass controll from it to prepare_mnt_ns_v2() instead. It consists of several stages:
 +
 
 +
1) We pre-create mount namespaces for each restored mount namespace in pre_create_mount_namespaces(). These namespaces appear almost empty: they contain tmpfs as their root, they have root yard path created in it with another tmpfs mounted in it, and"namespace" path for assembling tree of mounts in it created in corresponding subdirectory of root yard mount. Surely we also save nsfs fds to each mount namespace to be able to reenter them later.
 +
 
 +
2) In populate_mnt_ns_v2() we reuse mnt_tree_for_each() walk over mount tree from original mount engine and so we walk mounts in tree order with addition of temporary skipping mounts and their descendants with can_mount_now_v2() in case they depend from other mounts, restarting the walk for them later. The can_mount_now_v2() is basically skipping mounts which should be restored as bindmounts but their source is not ready yet, this is true for bindmounts of root, external or plugin mounts or non-fsroot mounts.
 +
 
 +
3) In the mentioned walk over mounts forest in do_mount_one_v2() we determine if the newly created mount is directory one or a file one in detect_is_dir(), we just open its mountpoint path relative to parent "plain" mountpoint and do stat. That's why it is important to use mnt_tree_for_each() as it insures that parent is already "plain" mounted.
 +
 
 +
4) In the mentioned walk over mounts forest in do_mount_one_v2() we create "plain" mountpoint for a new mount, empty file or directory based on the previous step.
 +
 
 +
5) In the mentioned walk over mounts forest in do_mount_one_v2() we actually create new mount, either we create completely new mount or device-external in do_new_mount_v2() if it's supported, or bind container root mount in do_mount_root_v2() from the still visible host mount tree, or bind mountpoint-external mount in do_bind_mount_v2() and similarly bind any mount for which superblock is already created by other mount beforehand and we can just bind it in do_bind_mount_v2(). These functions act similar to ones in original mount engine but simplified as they don't need to care about inheriting sharing groups.
 +
 
 +
6) The do_bind_mount_v2() is improved to do bindmount via open_tree() + move_mount() with flags allowing not to traverse symlinks or autofs mounts.
 +
 
 +
7) Also we cross-namespace bindmount the newly created mount to restored mount namespace to the same "plain" mountpoint in do_mount_in_right_mntns(). So that we initially have a mount which would be visible after restore, this would be required in future to be able to restore bindmounted unix sockets on the right mount.
 +
 
 +
8) Now after the walk we don't plan to do bindmounts anymore so we set unbindable flags on mounts.
 +
 
 +
9) Next we assemble mount trees in each restored mount namespace in assemble_mount_namespaces() by again reusing move_mount_to_tree() to have tree order of moving mounts into proper places in mount tree. Also we open fds on the mountpoint: one mp_fd_id before moving and another mnt_fd_id after, so that we can access files on each mount later from final mntns via those fds.
 +
 
 +
10) Finally we do restore sharing groups on the assembled mount forest in restore_mount_sharing_options(). It walks each root sharing group and their descendants with dfs tree walk. It creates sharing for the first mount in the sharing group and then sets the same sharing on all other mounts in this group.
 +
 
 +
Sharing creation for first mount is two step:
 +
 
 +
a) If mount has master_id we either copy shared_id from parent sharing group or from external source and then make mount slave thus converting it to right master_id.
 +
b) Next if mount has shared_id we just make us shared, creating right shared_id.
 +
 
 +
We need to use userns_call() for MOVE_MOUNT_SET_GROUP to have all right permissions for copying sharing (move_mount_set_group()). Also we need to resolve external paths given by user to their actual mountpoint, we do so with openat2(RESOLVE_NO_XDEV) in resolve_mountpoint, this also only works from userns_call().
 +
 
 +
11) We remove sources of deleted mounts making them actually deleted (from "service" mount namespace), as moving deleted mounts is not allowed and just to simplify things we do it at the last step.