Task-diag
This articles describes a new proposed interface to get information about running processes (roughly same info that is now available from /proc/PID/*
files).
Limitations of /proc/PID interface[edit]
Current interface is a bunch of files in /proc/PID. While this appears to be simple and There are a number of problems with it.
Lots of syscalls[edit]
At least three syscalls per each PID are required —
open()
, read()
, and close()
.
For example, a mere ps ax
command performs these 3 syscalls
for each of 3 files (stat
, status
, cmdline
)
for each process in the system. This results in thousands of syscalls
and therefore thousands of user/kernel context switches.
Variety of formats[edit]
There are many different formats used by files in /proc/PID/
hierarchy. Therefore, there is a need to write parser for each such format.
Not enough information[edit]
Example: /proc/PID/fd/
doesn't contain file open flags or current position,
so we had to introduce /proc/PID/fdinfo/
.
Non-extendable formats[edit]
Some formats in /proc/PID are non-extendable. For example,
/proc/PID/maps
last column (file name) is optional,
therefore there is no way to add more columns without breaking the format.
Slow read due to extra info[edit]
Sometimes getting information is slow due to extra attributes
that are not always needed. For example, /proc/PID/smaps
contains VmFlags
field (which can't be added
to /proc/PID/maps
, see previous item),
but it also contains page stats that take long time to generate.
$ time cat /proc/*/maps > /dev/null real 0m0.061s user 0m0.002s sys 0m0.059s $ time cat /proc/*/smaps > /dev/null real 0m0.253s user 0m0.004s sys 0m0.247s
Proposed solution[edit]
Proposed is the /proc/task_diag
file, which operates based on the following principles:
- Transactional: write request, read response
- Netlink message format (same as used by sock_diag; binary and extendable)
- Ability to specify a set of processes to get info about
TASK_DIAG_DUMP_ALL | dump all processes |
TASK_DIAG_DUMP_ALL_THREAD | dump all threads |
TASK_DIAG_DUMP_CHILDREN | dump children of a specified task |
TASK_DIAG_DUMP_THREAD | dump threads of a specified task |
TASK_DIAG_DUMP_ONE | Dump one task |
- Optimal grouping of attributes
- Any attribute in a group can't affect a response time
The following groups are proposed:
TASK_DIAG_BASE | PID, PPID, PGID, SID, TID, state, comm |
TASK_DIAG_CRED | UID, GID, groups, capabilities |
TASK_DIAG_STAT | per-task and per-process statistics (same as taskstats, not avail in /proc) |
TASK_DIAG_VMA | mapped memory regions and their access permissions (same as maps) |
TASK_DIAG_VMA_STAT | memory consumption for each mapping (same as smaps) |
Performance measurements[edit]
ps (emulation)[edit]
Get pid, tid, pgid and comm for 50000 processes. The code used is available from here.
Existing interface:
$ time ./task_proc_all a real 0m0.279s user 0m0.013s sys 0m0.255s
New interface:
$ time ./task_diag_all a real 0m0.051s user 0m0.001s sys 0m0.049s
Using perf tool[edit]
The following is a quote from David Ahern email:
> Using the fork test command: > 10,000 processes; 10k proc with 5 threads = 50,000 tasks > reading /proc: 11.3 sec > task_diag: 2.2 sec > > @7,440 tasks, reading /proc is at 0.77 sec and task_diag at 0.096 > > 128 instances of sepcjbb, 80,000+ tasks: > reading /proc: 32.1 sec > task_diag: 3.9 sec > > So overall much snappier startup times.
Alternative (bad) solutions[edit]
The following information is only interesting in a historical context.
task_diag netlink socket[edit]
This was the original proposal -- create something very similar to sock_diag (aka tcp_diag aka inet_diag).
It appeared to be a bad one because:
- It's not obvious where to get pid and user namespaces
- It's impossible to restrict netlink sockets:
- Credentials are saved when a socket is created
- Process can drop privileges, but netlink doesn't care
- The same socket can be used to get process attributes and to set ip addresses