Mount-v2

From CRIU
Jump to navigation Jump to search

Mount-v2 CRIU algorithm

Introduction

After we've merged MOVE_MOUNT_SET_GROUP feature to mainstream linux v5.15 torvalds/linux@9ffb14e now we can use it to restore sharing groups of mounts without the need to care about inheriting those groups when create mounts, we can just set sharing groups at later stage and before that construct mount trees with private mounts.

Restoring propagation right with conservative approach of both creating mounts and inheriting propagation groups looks like mission impossible task for us due to many problems:

  • Criu knows nothing about the initial history or order of mount tree creation;
  • Propagation can create tons of mounts;
  • Propagation may change parent mounts for existing mount tree;
  • "Mount trap" - propagation may cover initial mount;
  • "Non-uniform" propagation - there are different tricks with mount order and temporary children-"lock" mounts, which create mount trees which can't be restored without those tricks;
  • "Cross-namespace" sharing groups creation need to be ordered with mount namespace creation right;
  • Sharing groups vs mount tree order inversion can be very complex to restore and require multiple auxiliary. (see example below)

See my talks about it on Linux Plumbers Conference:

And here is the example of order inversion where multiple temporary mounts needed to achieve the result:

Mounts-inverse-order-example.gif

Mount-v2 description

New mount-v2 algorithm is integrated deeply in the original one, so that dumping of mounts is done exactly the same for original mount engine and new one. So mount-v2 series has preparatory steps related to bindmount detection, external mounts detection and helper mounts handling to make the original mount code more robust, to make it easier to reuse it in mount-v2.

One of main differences of mount-v2 comparing to original is that mounts are initially created "plain", for instance if we had MOUNT with mnt_id=1000 and ns_mountpoint="/mount/point/path", original mount engine would originally mount this MOUNT in the mount tree to <criu_root_yard>/<mntns>/mount/point/path so that if this mount had PARENT mount with mnt_id=999 and ns_mountpoint="/mount/point" corresponding mount for PARENT would be created in <criu_root_yard>/<mntns>/mount/point thus restoring parent-child relationship between them initially. For mount-v2 MOUNT would be first mounted to <criu_root_yard>/mnt-1000 and PARENT would be mounted to <criu_root_yard>/mnt-999 so that on the first stage we only create mounts and then on separate second stage handle the tree assembling separately. This way we can have useful heuristics like on the second stage we can create overmounts after mounts they overmount, and on the first stage we can create external mounts before their bindmounts and these two do not clinch with each other.

But it is not so simple actually because we do not want to rewrite all the code for instance for restoring mount content or restoring ghost and remap files, which used mountpoint paths in "tree" format. So in all places where it does not matter (where we do not access <criu_root_yard>/<mntns>/... paths) we switched from using mount_info->mountpoint to mount_info->ns_mountpoint and in all places where we actually needed "tree" format paths we replace them with service_mountpoint() helper which would return "tree" paths for original mount engine and "plain" paths for mount-v2. This way we can safely switch from one to another.

to be continued...