Seamless kernel upgrade

Revision as of 00:01, 9 September 2016 by Xemul (talk | contribs)

When replacing a kernel on a box we can do it without stopping critical activity. Checkpoint it, then replace the kernel (e.g. using kexec) then restore services back. In a perfect world the applications memory shouldn't be put to disk image, but should rather be kept in RAM.

Description of the concept

To upgrade a kernel on a running system one may use live kernel patching technology, but it has some limitations. Instead, there's a way to upgrade the kernel by really rebooting into it. The sequence of steps is

  1. Suspend the processes or containers you need to keep
  2. Reboot the kernel into new one using Kexec
  3. Restore the suspended processes and containers from images

This way requires several optimizations.

Keep memory images in memory

Reading memory contents and writing it to disk on step 1 and reading the memory from disk and putting back into memory on step 3 is time consuming. But since we do have the memory to keep the memory images, it would be better to keep memory images in memory and make them stay there while doing Kexec.

For this the kernel patch is required. One approach was to implement a PMEM filesystem. (FIXME: link on lkml patch here).

Don't flush the page cache

Similar applies to disk cache pages -- dropping these pages and re-reading them after reboot from disk slows things down. And similar to the previous optimization, it would be good to keep page cache pages in PMEM.

Don't flush dirty page cache

Disk cache with dirty data should be flushed on disk before doing reboot, but, again, this makes things slower. We'd better keep the dirty pages in memory and flush them later. However, dirty meta data of FS should be written on disk, otherwise on reboot filesystem might want to replay the journal as it will appear to be dirty.

Issues

Problems of this approach are

  • Dirty metadata slows things down
  • Kexec doesn't work on some hardware
  • Kernel patching is needed
  • Kernel boots too slow on many-cores nodes
  • Accessing on-disk files when restoring doesn't hit dentry cache and thus slow

See also