Difference between revisions of "Task-diag"

Latest revision as of 22:03, 24 September 2018

This articles describes a new proposed interface to get information about running processes (roughly same info that is now available from /proc/PID/* files).

Limitations of /proc/PID interface[edit]

Current interface is a bunch of files in /proc/PID. While this appears to be simple and There are a number of problems with it.

Lots of syscalls[edit]

At least three syscalls per each PID are required — open(), read(), and close().

For example, a mere ps ax command performs these 3 syscalls for each of 3 files (stat, status, cmdline) for each process in the system. This results in thousands of syscalls and therefore thousands of user/kernel context switches.

Variety of formats[edit]

There are many different formats used by files in /proc/PID/ hierarchy. Therefore, there is a need to write parser for each such format.

Not enough information[edit]

Example: /proc/PID/fd/ doesn't contain file open flags or current position, so we had to introduce /proc/PID/fdinfo/.

Non-extendable formats[edit]

Some formats in /proc/PID are non-extendable. For example, /proc/PID/maps last column (file name) is optional, therefore there is no way to add more columns without breaking the format.

Slow read due to extra info[edit]

Sometimes getting information is slow due to extra attributes that are not always needed. For example, /proc/PID/smaps contains VmFlags field (which can't be added to /proc/PID/maps, see previous item), but it also contains page stats that take long time to generate.

$ time cat /proc/*/maps > /dev/null
real	0m0.061s
user	0m0.002s
sys	0m0.059s


$ time cat /proc/*/smaps > /dev/null
real	0m0.253s
user	0m0.004s
sys	0m0.247s

Proposed solution[edit]

Proposed is the /proc/task_diag file, which operates based on the following principles:

Transactional: write request, read response

Netlink message format (same as used by sock_diag; binary and extendable)

Ability to specify a set of processes to get info about

TASK_DIAG_DUMP_ALL	dump all processes
TASK_DIAG_DUMP_ALL_THREAD	dump all threads
TASK_DIAG_DUMP_CHILDREN	dump children of a specified task
TASK_DIAG_DUMP_THREAD	dump threads of a specified task
TASK_DIAG_DUMP_ONE	Dump one task

Optimal grouping of attributes
- Any attribute in a group can't affect a response time

The following groups are proposed:

TASK_DIAG_BASE	PID, PPID, PGID, SID, TID, state, comm
TASK_DIAG_CRED	UID, GID, groups, capabilities
TASK_DIAG_STAT	per-task and per-process statistics (same as taskstats, not avail in /proc)
TASK_DIAG_VMA	mapped memory regions and their access permissions (same as maps)
TASK_DIAG_VMA_STAT	memory consumption for each mapping (same as smaps)

Performance measurements[edit]

ps (emulation)[edit]

Get pid, tid, pgid and comm for 50000 processes. The code used is available from here.

Existing interface:

$ time ./task_proc_all a
real    0m0.279s
user    0m0.013s
sys     0m0.255s

New interface:

$ time ./task_diag_all a
real    0m0.051s
user    0m0.001s
sys     0m0.049s

Using perf tool[edit]

The following is a quote from David Ahern email:

> Using the fork test command:
>    10,000 processes; 10k proc with 5 threads = 50,000 tasks
>    reading /proc: 11.3 sec
>    task_diag:      2.2 sec
>
> @7,440 tasks, reading /proc is at 0.77 sec and task_diag at 0.096
>
> 128 instances of sepcjbb, 80,000+ tasks:
>     reading /proc: 32.1 sec
>     task_diag:      3.9 sec
>
> So overall much snappier startup times.

Alternative (bad) solutions[edit]

The following information is only interesting in a historical context.

task_diag netlink socket[edit]

This was the original proposal -- create something very similar to sock_diag (aka tcp_diag aka inet_diag).

It appeared to be a bad one because:

It's not obvious where to get pid and user namespaces
It's impossible to restrict netlink sockets:
- Credentials are saved when a socket is created
- Process can drop privileges, but netlink doesn't care
- The same socket can be used to get process attributes and to set ip addresses

@@ Line 1: / Line 1: @@
+This articles describes a new proposed interface to get information about running processes (roughly same info that is now available from <code>/proc/''PID''/*</code> files).
+== Limitations of /proc/PID interface ==
+Current interface is a bunch of files in /proc/PID. While this appears to be simple and There are a number of problems with it.
+=== Lots of syscalls ===
+At least three syscalls per each PID are required —
+<code>open()</code>, <code>read()</code>, and <code>close()</code>.
+For example, a mere <code>ps ax</code> command performs these 3 syscalls
+for each of 3 files (<code>stat</code>, <code>status</code>, <code>cmdline</code>)
+for each process in the system. This results in thousands of syscalls
+and therefore thousands of user/kernel context switches.
+=== Variety of formats ===
+There are many different formats used by files in <code>/proc/''PID''/</code> hierarchy. Therefore, there is a need to write parser for each such format.
+=== Not enough information ===
+Example: <code>/proc/''PID''/fd/</code> doesn't contain file open flags or current position,
+so we had to introduce <code>/proc/''PID''/fdinfo/</code>.
+=== Non-extendable formats ===
+Some formats in /proc/PID are non-extendable. For example,
+<code>/proc/''PID''/maps</code> last column (file name) is optional,
+therefore there is no way to add more columns without breaking the format.
+=== Slow read due to extra info ===
+Sometimes getting information is slow due to extra attributes
+that are not always needed. For example, <code>/proc/''PID''/smaps</code>
+contains <code>VmFlags</code> field (which can't be added
+to <code>/proc/''PID''/maps</code>, see previous item),
+but it also contains page stats that take long time to generate.
+<pre>
+$ time cat /proc/*/maps > /dev/null
+real	0m0.061s
+user	0m0.002s
+sys	0m0.059s
+$ time cat /proc/*/smaps > /dev/null
+real	0m0.253s
+user	0m0.004s
+sys	0m0.247s
+</pre>
+== Proposed solution ==
+Proposed is the <code>/proc/task_diag</code> file, which operates based on the following principles:
+* Transactional: write request, read response
+* Netlink message format (same as used by sock_diag; binary and extendable)
+* Ability to specify a set of processes to get info about
+{| class="wikitable"
+|| TASK_DIAG_DUMP_ALL || dump all processes
+|-
+|| TASK_DIAG_DUMP_ALL_THREAD || dump all threads
+|-
+|| TASK_DIAG_DUMP_CHILDREN || dump children of a specified task
+|-
+|| TASK_DIAG_DUMP_THREAD || dump threads of a specified task
+|-
+|| TASK_DIAG_DUMP_ONE || Dump one task
+|}
+* Optimal grouping of attributes
+** Any attribute in a group can't affect a response time
+The following groups are proposed:
+{| class="wikitable"
+|| TASK_DIAG_BASE || PID, PPID, PGID, SID, TID, state, comm
+|-
+|| TASK_DIAG_CRED || UID, GID, groups, capabilities
+|-
+|| TASK_DIAG_STAT || per-task and per-process statistics (same as taskstats, not avail in /proc)
+|-
+|| TASK_DIAG_VMA || mapped memory regions and their access permissions (same as maps)
+|-
+|| TASK_DIAG_VMA_STAT || memory consumption for each mapping (same as smaps)
+|-
+|}
+=== Performance measurements ===
+==== ps (emulation) ====
+Get pid, tid, pgid and comm for 50000 processes. The code used is available [https://github.com/avagin/linux-task-diag/tree/task-diag-proc/tools/testing/selftests/task_diag from here].
+Existing interface:
+<pre>
+$ time ./task_proc_all a
+real    0m0.279s
+user    0m0.013s
+sys     0m0.255s
+</pre>
+New interface:
+<pre>
+$ time ./task_diag_all a
+real    0m0.051s
+user    0m0.001s
+sys     0m0.049s
+</pre>
+==== Using perf tool ====
+The following is a quote from David Ahern email:
+<pre>
+> Using the fork test command:
+>    10,000 processes; 10k proc with 5 threads = 50,000 tasks
+>    reading /proc: 11.3 sec
+>    task_diag:      2.2 sec
+>
+> @7,440 tasks, reading /proc is at 0.77 sec and task_diag at 0.096
+>
+> 128 instances of sepcjbb, 80,000+ tasks:
+>     reading /proc: 32.1 sec
+>     task_diag:      3.9 sec
+>
+> So overall much snappier startup times.
+</pre>
+== Alternative (bad) solutions ==
+The following information is only interesting in a historical context.
+=== task_diag netlink socket ===
+This was the original proposal -- create something very similar to sock_diag (aka tcp_diag aka inet_diag).
+It appeared to be a bad one because:
+* It's not obvious where to get pid and user namespaces
+* It's impossible to restrict netlink sockets:
+** Credentials are saved when a socket is created
+** Process can drop privileges, but netlink doesn't care
+** The same socket can be used to get process attributes and to set ip addresses
 == See also ==
-Pending work on the "[[upstream kernel commits]]" pages.
+* [[Upstream kernel commits]]
 [[Category:Development]]
-[[Category:Empty articles]]