How to open a file without open system call
Sometimes CRIU meets an inode object (check this article for details of what inode is) without a name. This article describes when this happens and what CRIU does in this case.
When this happens[edit]
There are two nasty API calls in the Linux kernel -- the inotify_init
and the fanotify_init
. Both take a file path as an argument and screw one up. What they do is find an inode object using this path, then put an events generator on it and then forget the path completely. The result of both calls is a file-descriptor pointing to the created event generator object.
When CRIU meets the inotify/fanotify (called fsnotify later for convenience) FD it has to find out the file on which the generator sits. But, since the inode's path is lost, this cannot be done in general case.
Chances to get the name back[edit]
Chances to get the name back exist. To understand when let's dive a little bit more in how Linux manages dentries and inodes.
Inodes and dentries[edit]
So, every file on disk is represented by an inode object. Inode has an ID (inode number), access rights, owner, link count and some more data. Names are only stored in special files called directories -- in directories there's a set of name-to-inode mappings. When accessing a file by its name Linux kernel sequentially reads from disk these mapping tables and for every name found in it creates a dentry object in memory. It's important to know, that dentry is created not only for existing files, but also for non-existing to speed up the ENOENT report for second file lookup. IOW dentries form a cache, which contains records for both present and absent objects on disk.
Since the tree of dentries can gorw infinitely Linux sometimes shrinks one, by freeing the unused dentries. The dentry is unused if no other object references one, and a dentry can be referenced by child dentries and by files (as described in another article).
Having said that, at the time the fsnotify creating happens we have a full dentry chain and the inode sitting in memory. Then the events generator is put on the inode and that's it. Neither inode nor fsnotify object references the dentry, so eventually the whole dentry chain can be shrunk from memory.
So, returning to the "can I get the name back" question. The answer is -- if the dentry cache is still alive -- yes, you can. But CRIU cannot rely on this, since it should also support situations when the dentry cache is not there.
Tmpfs[edit]
One filesystem, however, behaves friendly to this problem. The tempfs one pins the dentries in memory, since it has no other media on which to store the information about files on it. So for tmpfs the name is always at hands.
Opening a file without open()[edit]
Linux provides a way to do this. The way is called open_by_handle_at
system call. Introduced to make the user-space NFS server work, this call allows to open an inode using a blob called handle. The handle is (almost) meaning-less sequence of bytes by which filesystem promises to find the inode and open one. And the handle itself can be generated by the kernel using the inode object. Since fsnotify object references inode we can try to ask the kernel to generate the respective inode's handle. And we did that and patched the kernel to show this handle in the /proc/$pid/fdindo/$fd file for the fsnotify.
So when dumping the fsnotify we read the handle out of proc and save one in the images, and on restore time we call the open_by_handle_at
with the handle value and get the inode back. Then we need to ask the kernel to put the fsnotify on this inode. To do this CRIU calls fsnotify init call on the /proc/self/fd/$fd path. While resolving the path kernel finds the inode opened previously and restores the handle in the proper place. Thus we fool the kernel and put fsnotify on an inode without even knowing its path.
Irmap[edit]
But the problems are still not over. Not all filesystems provide handles. Hopefully yet, but still -- not always we can get a handle out of an inode and an inode out of a handle. This is very nasty situation, since Linux kernel provides no other APIs for getting the inode, only open by path and open by handle. With both ways closed we have to make a detour.
CRIU uses the empiric knowledge where fsnotify-s are typically put by programs (config files and alike) and does filesystem tree scan to find out the name by the inode number. The engine is caller irmap which stands for Inode Reverse MAP. The irmap cache recursively scans the tree starting from "known" locations and remembers all the name-inode pairs it meets. If we later try to irmap some inode which was met during the first scan, no additional FS access would occur, irmap would just report the name back.
Caching the irmap cache[edit]
Since this FS scan can be quite long, this is recommended to be done while tasks are not frozen. So the irmap cache fill is also started on the pre-dump operation, when tasks are not frozen. After the scan the cache is stored in the working dir under the irmap-cache.img name. When CRIU's next pre-dump or final dump is performed, the irmap cache is read back and when required the cached entries are re-validated individually, w/o the full FS re-scan.