Filesystems pecularities

From CRIU
Revision as of 15:25, 21 August 2014 by Xemul (talk | contribs) (Added nfs problem)
Jump to navigation Jump to search

"All filesystems are equal, but some filesystems are more equal than others."

This page describes how different different filesystems are and how this affects CRIU dump (and restore) process.

BTRFS

When we stat() a file we can get on which device it resides by checking the st_dev value. However, kernel exposes the device value in some more places. In particular, the device is shown in the /proc/$pid/[mounts|mountinfo] files and in the /proc/$pid/s?maps ones. Moreover, the sock-diag subsystem recently added into the kernel, reveals the device (and inode) on which a unix socket is bound.

The problem with btrfs is that it substitutes the real device number with a virtual one in the stat() system call. And once we get this value we cannot compare it to any other device number obtained from other sources, they will always differ (these virtual device numbers are unique).

In order to address this issue, CRIU performs path-to-device resolution in user-space by analysing the information obtained from the /proc/$pid/mountinfo files. The routine in question is mount.c:phys_stat_resolve_dev().

NFS

In Linux files have an attribute called st_nlink -- the amount of names the file has. When a file is removed (which is done with the unlink system call) this counter is decremented and, if it hits zero, the file itself can be removed from disk. Not "is removed", but "can be removed", since the file can be held opened while someone unlinks one. In the latter case the physical removal of the file is delayed till the file is closed.

NFS does special handing in case the nlink value is about to hit zero. The thing is -- if NFS client would send the last unlink request to the server, the latter would just go and kill the file physically, since it doesn't "know" that someone holds this file opened (this information is owned by the client). Instead, client marks the file as "to be deleted on close" and doesn't perform the last nlink decrement immediately. And only when the file is closed and the mentioned flag is seen, the last unlink is sent to the server. And one more thing -- to prevent naming collisions (in simple words -- open() by the old name shouldn't file this old file) NFS also renames the file, giving it a special name ".nfsXXX" where XXX is some unique identifier.

How this affects CRIU? In the article "How hard is it to open a file?" it's said, that CRIU should be able to dump and restore files, that are unlinked, but opened. Briefly: if a file is such, CRIU cannot just save the file's path, as once dumped tasks are killed, the fill would stop existing. So CRIU takes these files into the images. But on NFS there's no such thing as "unlinked" file -- it prevents the nlink count from dropping to zero. For CRIU all NFS files look as alive ones.

To handle this, CRIU checks that a file it dumps resides on NFS (this is simply by checking the statfs's fs_type field). If the file is such CRIU then checks its name to be the ".nfsXXX" one. If both checks succeed the file is treated as "opened and unlinked" one.

AUFS