Mount-v2

From CRIU
Revision as of 11:33, 27 January 2022 by Ptikhomirov (talk | contribs)
Jump to navigation Jump to search

Mount-v2 CRIU algorithm

Introduction

After we've merged MOVE_MOUNT_SET_GROUP feature to mainstream linux v5.15 torvalds/linux@9ffb14e now we can use it to restore sharing groups of mounts without the need to care about inheriting those groups when create mounts, we can just set sharing groups at later stage and before that construct mount trees with private mounts.

Restoring propagation right with conservative approach of both creating mounts and inheriting propagation groups looks like mission impossible task for us due to many problems:

  • Criu knows nothing about the initial history or order of mount tree creation;
  • Propagation can create tons of mounts;
  • Propagation may change parent mounts for existing mount tree;
  • "Mount trap" - propagation may cover initial mount;
  • "Non-uniform" propagation - there are different tricks with mount order and temporary children-"lock" mounts, which create mount trees which can't be restored without those tricks;
  • "Cross-namespace" sharing groups creation need to be ordered with mount namespace creation right;
  • Sharing groups vs mount tree order inversion can be very complex to restore and require multiple auxiliary. (see example below)

See my talks about it on Linux Plumbers Conference:

And here is the example of order inversion where multiple temporary mounts needed to achieve the result:

Mounts-inverse-order-example.gif

Mount-v2 description

New mount-v2 algorithm is integrated deeply in the original one, so that dumping of mounts is done exactly the same for original mount engine and new one. So mount-v2 series has preparatory steps related to bindmount detection, external mounts detection and helper mounts handling to make the original mount code more robust, to make it easier to reuse it in mount-v2.

Plain mountpoints

One of main differences of mount-v2 comparing to original is that mounts are initially created "plain", for instance if we had MOUNT with mnt_id=1000 and ns_mountpoint="/mount/point/path", original mount engine would originally mount this MOUNT in the mount tree to <criu_root_yard>/<mntns>/mount/point/path so that if this mount had PARENT mount with mnt_id=999 and ns_mountpoint="/mount/point" corresponding mount for PARENT would be created in <criu_root_yard>/<mntns>/mount/point thus restoring parent-child relationship between them initially. For mount-v2 MOUNT would be first mounted to <criu_root_yard>/mnt-1000 and PARENT would be mounted to <criu_root_yard>/mnt-999 so that on the first stage we only create mounts and then on separate second stage handle the tree assembling separately. This way we can have useful heuristics like on the second stage we can create overmounts after mounts they overmount, and on the first stage we can create external mounts before their bindmounts and these two do not clinch with each other.

But it is not so simple actually because we do not want to rewrite all the code for instance for restoring mount content or restoring ghost and remap files, which used mountpoint paths in "tree" format. So in all places where it does not matter (where we do not access <criu_root_yard>/<mntns>/... paths) we switched from using mount_info->mountpoint to mount_info->ns_mountpoint and in all places where we actually needed "tree" format paths we replace them with service_mountpoint() helper which would return "tree" paths for original mount engine and "plain" paths for mount-v2. This way we can safely switch from one to another.

Resolving sharing groups

Just after reading mounts from images in read_mnt_ns_img() when mount-v2 is enabled we have an additional step to collect sharing group information from mounts and turn it to sharing groups forest graph. First, we just walk over all mounts and create sharing group for each mount with unique shared_id + master_id pair, also we sew all mounts to corresponding sharing group with same id pair. Second, we walk over all sharing groups which has non-zero master_id and lookup the corresponding parent sharing groups and connect them with a tree.

There is also a case when master_id is non-zero but there is no corresponding parent sharing group, this means that outside of dumped container there is mount with matching shared_id - external slavery detected. For this case we just collect sibling sharing groups in list with empty parent link. Also we detect source path from which the master_id would be inherited either from some mountpoint-external mount or from root container mount.

to be continued...