Line 1:
Line 1:
+
This articles describes a new proposed interface to get information about running processes (roughly same info that is now available from <code>/proc/''PID''/*</code> files).
+
+
== Limitations of /proc/PID interface ==
+
+
Current interface is a bunch of files in /proc/PID. While this appears to be simple and There are a number of problems with it.
+
+
=== Lots of syscalls ===
+
+
At least three syscalls per each PID are required —
+
<code>open()</code>, <code>read()</code>, and <code>close()</code>.
+
+
For example, a mere <code>ps ax</code> command performs these 3 syscalls
+
for each of 3 files (<code>stat</code>, <code>status</code>, <code>cmdline</code>)
+
for each process in the system. This results in thousands of syscalls
+
and therefore thousands of user/kernel context switches.
+
+
=== Variety of formats ===
+
+
There are many different formats used by files in <code>/proc/''PID''/</code> hierarchy. Therefore, there is a need to write parser for each such format.
+
+
=== Not enough information ===
+
Example: <code>/proc/''PID''/fd/</code> doesn't contain file open flags or current position,
+
so we had to introduce <code>/proc/''PID''/fdinfo/</code>.
+
+
=== Non-extendable formats ===
+
+
Some formats in /proc/PID are non-extendable. For example,
+
<code>/proc/''PID''/maps</code> last column (file name) is optional,
+
therefore there is no way to add more columns without breaking the format.
+
+
=== Slow read due to extra info ===
+
+
Sometimes getting information is slow due to extra attributes
+
that are not always needed. For example, <code>/proc/''PID''/smaps</code>
+
contains <code>VmFlags</code> field (which can't be added
+
to <code>/proc/''PID''/maps</code>, see previous item),
+
but it also contains page stats that take long time to generate.
+
+
<pre>
+
$ time cat /proc/*/maps > /dev/null
+
real 0m0.061s
+
user 0m0.002s
+
sys 0m0.059s
+
+
+
$ time cat /proc/*/smaps > /dev/null
+
real 0m0.253s
+
user 0m0.004s
+
sys 0m0.247s
+
</pre>
+
+
== Proposed solution ==
+
+
Proposed is the <code>/proc/task_diag</code> file, which operates based on the following principles:
+
+
* Transactional: write request, read response
+
+
* Netlink message format (same as used by sock_diag; binary and extendable)
+
+
* Ability to specify a set of processes to get info about
+
** TASK_DIAG_DUMP_ALL: dump all processes
+
** TASK_DIAG_DUMP_ALL_THREAD: dump all threads
+
** TASK_DIAG_DUMP_CHILDREN: dump children of a specified task
+
** TASK_DIAG_DUMP_THREAD: dump threads of a specified task
+
** TASK_DIAG_DUMP_ONE: Dump one task
+
+
* Optimal grouping of attributes
+
** Any attribute in a group can't affect a response time
+
+
The following groups are proposed:
+
* TASK_DIAG_BASE
+
: PID, PGID, SID, TID, comm
+
* TASK_DIAG_CRED
+
: UID, GID, groups, capabilities
+
* TASK_DIAG_STAT
+
: per-task and per-process statistics (same as taskstats, not avail in /proc)
+
* TASK_DIAG_VMA
+
: mapped memory regions and their access permissions (same as maps)
+
* TASK_DIAG_VMA_STAT
+
: memory consumption for each mapping (same as smaps)
+
+
=== Performance measurements ===
+
+
==== Get pid, tid, pgid and comm for 50000 processes ====
+
+
Existing interface:
+
<pre>
+
$ time ./task_proc_all a
+
real 0m0.279s
+
user 0m0.013s
+
sys 0m0.255s
+
</pre>
+
+
New interface:
+
<pre>
+
$ time ./task_diag_all a
+
real 0m0.051s
+
user 0m0.001s
+
sys 0m0.049s
+
</pre>
+
+
==== Using perf tool ====
+
+
The following is a quote from David Ahern email:
+
+
<pre>
+
> Using the fork test command:
+
> 10,000 processes; 10k proc with 5 threads = 50,000 tasks
+
> reading /proc: 11.3 sec
+
> task_diag: 2.2 sec
+
>
+
> @7,440 tasks, reading /proc is at 0.77 sec and task_diag at 0.096
+
>
+
> 128 instances of sepcjbb, 80,000+ tasks:
+
> reading /proc: 32.1 sec
+
> task_diag: 3.9 sec
+
>
+
> So overall much snappier startup times.
+
</pre>
+
+
== Alternative (bad) solutions ==
+
+
The following information is only interesting in a historical context.
+
+
=== task_diag netlink socket ===
+
+
This was the original proposal -- create something very similar to sock_diag (aka tcp_diag aka inet_diag).
+
+
It appeared to be a bad one because:
+
+
* It's not obvious where to get pid and user namespaces
+
* It's impossible to restrict netlink sockets:
+
** Credentials are saved when a socket is created
+
** Process can drop privileges, but netlink doesn't care
+
** The same socket can be used to get process attributes and to set ip addresses
+
== See also ==
== See also ==
−
Pending work on the "[[upstream kernel commits]]" pages.
+
* [[Upstream kernel commits]]
[[Category:Development]]
[[Category:Development]]
−
[[Category:Empty articles]]