https://criu.org/api.php?action=feedcontributions&user=Stenbom&feedformat=atomCRIU - User contributions [en]2024-03-29T05:24:23ZUser contributionsMediaWiki 1.35.6https://criu.org/index.php?title=TCP_connection&diff=4847TCP connection2019-03-11T18:39:11Z<p>Stenbom: Grammar fixes</p>
<hr />
<div>This page describes how we handle established TCP connections.<br />
<br />
== TCP repair mode in kernel ==<br />
<br />
The <code>TCP_REPAIR</code> socket option was added to the kernel 3.5 to help with C/R for TCP sockets.<br />
<br />
When this option is used, the socket is switched into a special mode, in which any action performed on it<br />
does not result in anything defined by an appropriate protocol actions, but rather directly puts the socket<br />
into the state that the socket is expected to be in at the end of a successfully finished operation.<br />
<br />
For example, calling <code>connect()</code> on a repaired socket just changes its state to <code>ESTABLISHED</code>,<br />
with the peer address set as requested.<br />
The <code>bind()</code> call forcibly binds the socket to a given address (ignoring any potential conflicts).<br />
The <code>close()</code> call closes the socket without any transient <code>FIN_WAIT</code>/<code>TIME_WAIT</code>/etc states,<br />
socket is silently killed.<br />
<br />
=== Sequences ===<br />
<br />
To restore the connection properly, bind() and connect() is not enough. One also needs to restore the<br />
TCP sequence numbers. To do so, the <code>TCP_REPAIR_QUEUE</code> and <code>TCP_QUEUE_SEQ</code> options were introduced.<br />
<br />
The former one selects which queue (input or output) will be repaired and the latter gets/sets the sequence. Note<br />
setting the sequence is only possible on CLOSE-d socket.<br />
<br />
=== Packets in queue ===<br />
<br />
When set the queue to repair as described above, one can call recv or send syscalls on a repaired socket. Both calls<br />
result on peeking or poking data from/to the respective queue. This sounds funny, but yes, for repaired socket one<br />
can receve the outgoing and send the incoming queues. Using the <code>MSG_PEEK</code> flag for <code>recv()</code> is required.<br />
<br />
=== Options ===<br />
<br />
There are 4 options that are negotiated by the socket at the connecting stage. These are<br />
<br />
* mss_clamp -- the maximum size of the segment peer is ready to accept<br />
* snd _scale -- the scale factor for a window<br />
* sack -- whether selective acks are permitted or not<br />
* tstamp -- whether timestamps on packets are supported<br />
<br />
All four can be read with <code>getsockopt()</code> calls to a socket and in order to restore them the <code>TCP_REPAIR_OPTIONS</code> sockoption is introduced.<br />
<br />
== Timestamp ==<br />
"The sender's timestamp clock is used as a source of monotonic non-decreasing values to stamp the segments"(rfc7323). The Linux kernel uses the jiffies counter as the tcp timestamp.<br />
<br />
<code>#define tcp_time_stamp ((__u32)(jiffies))</code><br />
<br />
We add the <code>TCP_TIMESTAMP</code> options to be able to compensate a difference between jiffies counters, when a connection is migrated on another host. When a connection is dumped, criu calls <code>getsockopt(TCP_TIMESTAMP)</code> to get a current timestamp, then on restore it calls <code>setsockopt(TCP_TIMESTAMP)</code> to set this timestamp as a starting point.<br />
<br />
== Checkpoint and restore TCP connection ==<br />
<br />
With the above sockoptions dumping and restoring TCP connection becomes possible. The criu just reads the socket<br />
state and restores it back letting the protocol resurrect the data sequence.<br />
<br />
One thing to note here — while the socket is closed between dump and restore the connection should be "locked", i.e.<br />
no packets from peer should enter the stack, otherwise the RST will be sent by a kernel. In order to do so a simple<br />
netfilter rule is configured that drops all the packets from peer to a socket we're dealing with. This rule sits<br />
in the host netfilter tables after the criu dump command finishes and it should be there when you issue the<br />
criu restore one.<br />
<br />
Another thing to note is -- on restore there should be available the IP address, that was used by the connection.<br />
This is automatically so if restore happens on the same box as dump. In case of hand-made live migration the<br />
IP address should be copied too.<br />
<br />
That said, the command line option <code>--tcp-established</code> should be used when calling criu to explicitly state, that the<br />
caller is aware of this "transitional" state of the netfilter.<br />
<br />
In case the target process lives in NET namespace the connection locking happens the other way. Instead of<br />
per-connection iptables rules the "network-lock"/"network-unlock" [[action scripts]] are called so that the user<br />
could isolate the whole netns from network. Typically this is done by downing the respective veth pair end.<br />
<br />
== States ==<br />
=== TCP_SYN_SENT ===<br />
There is only one difference with TCP_ESTABLISHED, we have to restore a socket and disable the repair mode before calling <code>connect()</code>. The kernel will send a one syn-sent packet with the same initial sequence number and sets the TCP_SYN_SENT state for the socket.<br />
<br />
=== Half-closed sockets ===<br />
A socket is half-closed when it sent or received a fin packet. These sockets are in one for these states: TCP_FIN_WAIT1, TCP_FIN_WAIT2, TCP_CLOSING, TCP_LAST_ACL, TCP_CLOSE_WAIT. To restore these states, we restore a socket into the TCP_ESTABLISHED state and then we call shutfown(SHUT_WR), if a socket has sent a fin packet and we send a fake fin packet, if a socket has received it before. For example, if we want to restore the TCP_FIN_WAIT1 state, we have to call shutfown(SHUT_WR) and we can send a fake ack to the fin packet to restore the TCP_FIN_WAIT2 state.<br />
<br />
== See also ==<br />
* [[Simple TCP pair]]<br />
* [[TCP repair TODO]]<br />
* [[CLI/opt/--tcp-close|Dropping the connection]]<br />
<br />
== External links ==<br />
* http://lwn.net/Articles/495304/<br />
<br />
[[Category:Under the hood]]<br />
[[Category:Sockets]]<br />
[[Category: Editor help needed]]</div>Stenbomhttps://criu.org/index.php?title=Memory_changes_tracking&diff=4773Memory changes tracking2018-12-29T15:51:03Z<p>Stenbom: Spelling and grammar fixes</p>
<hr />
<div>CRIU can detect what memory pages a task (or tasks) has changed since some moment of time. This page describes why this is required, how it works and how to use it.<br />
<br />
== Why do we need to track memory changed ==<br />
<br />
There are several scenarios where detecting what parts of memory has changed is required:<br />
<br />
; [[Incremental dumps]]<br />
: When you take a series of dumps from a process tree, it is a very good optimization not to dump ''all'' the memory every time, but get only those memory pages that has changed since previous dump<br />
<br />
; Smaller freeze time for big applications<br />
: When a task uses a LOT of memory, dumping it may take time and during all this time this task should be frozen. To reduce the freeze time we can<br />
:* get memory from task and start writing it in images<br />
:* freeze task and get only changed memory from it<br />
<br />
; [[Live migration]]<br />
: When doing live migration, a lot of time is used by the procedure of copying tasks' memory to the destination host. Note that the processes are frozen during that time. Acting like in the previous example also reduces the freeze time, i.e. the live migration becomes more live.<br />
<br />
== How we track memory changes ==<br />
<br />
In order to find out which memory pages have changed, we [http://lwn.net/Articles/546966/ patched] the kernel. Tracking the memory changes consists of two steps:<br />
<br />
* ask the kernel to keep track of memory changes (by writing 4 into <code>/proc/$pid/clear_refs</code> file for each $pid we are interested in).<br />
<br />
and, after a while,<br />
<br />
* get the list of modified pages of a process by reading its <code>/proc/$pid/pagemap</code> file and looking at so called ''soft-dirty'' bit in the pagemap entries.<br />
<br />
During the first step, kernel will re-map all the tasks' mapping in read-only manner. If a task then tries to write into any of its pages, a page fault will occur, and the kernel will note which page is being written to. Reading the <code>pagemap</code> file reveals this information.<br />
<br />
== How to use this with CRIU ==<br />
<br />
First of all, the<br />
<br />
# criu check --feature mem_dirty_track<br />
<br />
command should say the feature is supported. The memory changes tracking was initially merged into Linux kernel v3.11, and was further polished until v3.18.<br />
<br />
There are several command line options to use the functionality:<br />
<br />
;<code>--prev-images-dir</code> option<br />
:This option is used to provide the path where images from a previous <code>dump</code> or <code>pre-dump</code> (see below) action reside. If possible, CRIU will dump only the memory pages that have been modified since that time.<br />
<br />
;<code>--track-mem</code> option<br />
:This option makes CRIU to reset memory changes tracker. If done, the next dump <code>--prev-images-dir</code> will have chances to successfully find not changed pages.<br />
<br />
;<code>pre-dump</code> action<br />
:This action dumps only part of the information about processes and does that by keeping tasks frozen for the shortest possible time. The images generated by pre-dump cannot and should not be used for restore. After this action the proper <code>dump</code> should be performed with properly configured <code>--prev-images-dir</code> path.<br />
<br />
== See also ==<br />
* [[Live migration]]<br />
* [[Incremental dumps]]<br />
* [[Memory dumping and restoring#Advanced C/R]]<br />
* [[Directories]]<br />
<br />
== External links ==<br />
* http://lwn.net/Articles/546966/<br />
<br />
[[Category: Under the hood]]<br />
[[Category: Memory]]<br />
[[Category: Live migration]]</div>Stenbomhttps://criu.org/index.php?title=Fdinfo_engine&diff=4772Fdinfo engine2018-12-28T12:55:40Z<p>Stenbom: Spelling and grammar fixes</p>
<hr />
<div>= Masters and slaves =<br />
# A file may be referred to by several file descriptors. The descriptors may belong to a single process or to several processes.<br />
# A group of descriptors referring to the same file is called shared. One of the descriptors is named master, others are slaves.<br />
# Every descriptor is described via struct fdinfo_list_entry (fle).<br />
# One process opens a master fle of a file, while other processes, sharing the file, obtain it using scm_rights. See send_fds() and receive_fds() for details.<br />
<br />
= Per-process file restore =<br />
Every file type is described via structure file_desc. We sequentially call file_desc::ops::open(struct file_desc *d, int *new_fd) method for every master file of a process until all masters are restored. The open methods may return three values:<br />
* 0 -- restore of the master file is successefuly finished;<br />
* 1 -- restore is in progress or it can't be started yet, because of it depends on another files, so the method should be called once again;<br />
* -1 -- restore failed.<br />
<br />
Right after a file is opened for the first time, the open method must return the fd value in the new_fd argument. This allows the common code to send this master to other processes to reopen the master as a slave as soon as possible. At the same time, returning a non-negative new_fd does not mean that the master is restored. The open() callback may return a non-negative new_fd and "1" as return value at the same time.<br />
<br />
Example. Restore of connected unix socket by open() method.<br />
*1)Open a socket, write its file descriptor to new_fd and return 1.<br />
*2)Check if peer socket is open and bound. If it's not so, then return 1 and repeat step "2" in next time.<br />
*3)Connect to the peer and return 0.<br />
<br />
Note: it's also possible to go to step "2" right after new_fd is written.<br />
<br />
The peer, which bind() the socket waits in "2", must notify the socket, when it is bound:<br />
*1)bind(<peer name>);<br />
*2)set_fds_event(<socket pid>);<br />
<br />
= Dependencies =<br />
# Slave TTY can only be created after respective master peer is restored. But now we wait even more -- till all masters are restored.<br />
# CTTY must be created after all other TTYs are restored. For all tty dependencies see tty_deps_restored() for the details.<br />
# Epoll can be created in any time, but it can add a fd in its polling list after the corresponding fle is completely restored. The only exception is a epoll listening other epoll. In this case we wait till listened fle is just created (not restored). See epoll_not_ready_tfd().<br />
# Unix socket must wait a peer before connect to it. See peer_is_not_prepared() for the details.<br />
# TCP sockets have a counter on address use.<br />
# Implementing a new relationships between fle stages, check, that you are not introducing a circular dependence (with existing).<br />
<br />
= Notes =<br />
# Pipes (and fifos), unix sockets and TTYs generate two fds in their ->open callbacls, the 2nd one can conflict with some other fd the task restores and (!) this "2nd one" may require sending to some other task.<br />
<br />
[[Category:Under the hood]]<br />
[[Category:Files]]</div>Stenbomhttps://criu.org/index.php?title=Code_blobs&diff=4771Code blobs2018-12-28T12:48:08Z<p>Stenbom: Spelling and grammar fixes</p>
<hr />
<div>== Summary ==<br />
<br />
There are two moments in time where criu runs in a somewhat strange environment<br />
<br />
* [[Parasite code]] execution<br />
* [[Restorer context|Restore]] of page dumped contents and yield rt-sigreturn to continue execution of the original program<br />
<br />
== Building PIE code blobs for criu ==<br />
<br />
Parasite code executes in the dumpee process context thus it needs to be [http://en.wikipedia.org/wiki/Position-independent_code PIE]<br />
compiled and to have own stack. The same applies to restorer code, which takes place at the very end of restore procedure.<br />
<br />
Thus we need to reserve stack, place it statically somewhere in criu and use it at dump/checkpoint stages. To achieve<br />
this (and still have some human way to edit source code) we use the following tricks<br />
<br />
* Parasite code has own bootstrap code laid in a pure assembler file (parasite_head.S)<br />
* Restorer bootstrap code is done in a simpler way in restorer.c. <br />
<br />
For both cases we generate header files which consist of<br />
<br />
* Functions offsets for export<br />
* C array of binary data, for example<br />
<br />
#define parasite_blob_offset____export_parasite_args 0x000000000000002c<br />
#define parasite_blob_offset____export_parasite_cmd 0x0000000000000028<br />
#define parasite_blob_offset____export_parasite_head_start 0x0000000000000000<br />
#define parasite_blob_offset____export_parasite_stack 0x0000000000006034<br />
<br />
static char parasite_blob[] = {<br />
0x48, 0x8d, 0x25, 0x2d, 0x60, 0x00, 0x00, 0x48,<br />
0x83, 0xec, 0x10, 0x48, 0x83, 0xe4, 0xf0, 0x6a,<br />
0x00, 0x48, 0x89, 0xe5, 0x8b, 0x3d, 0x0e, 0x00,<br />
0x00, 0x00, 0x48, 0x8d, 0x35, 0x0b, 0x00, 0x00,<br />
...<br />
};<br />
<br />
These headers we include in the criu compiled file and then use them for checkpoint/restore.<br />
<br />
Generation of these files is done in several steps<br />
<br />
* All object files needed by are linked into 'built-in.o'<br />
* With help of ld script we move code and data to special layout, i.e. to sections with predefined names and addresses<br />
* With help of objcopy we move the section(s) we need to one binary file<br />
* With help of hexdump we generate C-styled array of data and put it into -blob.h header<br />
<br />
== Example of building procedure ==<br />
<br />
LINK pie/parasite.built-in.o<br />
GEN pie/parasite.built-in.bin.o<br />
GEN pie/parasite.built-in.bin<br />
GEN pie/parasite-blob.h<br />
<br />
LINK pie/restorer.built-in.o<br />
GEN pie/restorer.built-in.bin.o<br />
GEN pie/restorer.built-in.bin<br />
GEN pie/restorer-blob.h<br />
<br />
[[Category:Under the hood]]<br />
[[Category:Editor help needed]]</div>Stenbom