Task-diag
This articles describes a new proposed interface to get information about running processes (roughly same info that is now available from /proc/PID/*
files).
Limitations of /proc/PID interface
Current interface is a bunch of files in /proc/PID. While this appears to be simple and There are a number of problems with it.
Lots of syscalls
At least three syscalls per each PID are required —
open()
, read()
, and close()
.
For example, a mere ps ax
command performs these 3 syscalls
for each of 3 files (stat
, status
, cmdline
)
for each process in the system. This results in thousands of syscalls
and therefore thousands of user/kernel context switches.
Variety of formats
There are many different formats used by files in /proc/PID/
hierarchy. Therefore, there is a need to write parser for each such format.
Not enough information
Example: /proc/PID/fd/
doesn't contain file open flags or current position,
so we had to introduce /proc/PID/fdinfo/
.
Non-extendable formats
Some formats in /proc/PID are non-extendable. For example,
/proc/PID/maps
last column (file name) is optional,
therefore there is no way to add more columns without breaking the format.
Slow read due to extra info
Sometimes getting information is slow due to extra attributes
that are not always needed. For example, /proc/PID/smaps
contains VmFlags
field (which can't be added
to /proc/PID/maps
, see previous item),
but it also contains page stats that take long time to generate.
$ time cat /proc/*/maps > /dev/null real 0m0.061s user 0m0.002s sys 0m0.059s $ time cat /proc/*/smaps > /dev/null real 0m0.253s user 0m0.004s sys 0m0.247s
Proposed solution
Proposed is the /proc/task_diag
file, which operates based on the following principles:
- Transactional: write request, read response
- Netlink message format (same as used by sock_diag; binary and extendable)
- Ability to specify a set of processes to get info about
TASK_DIAG_DUMP_ALL | dump all processes |
TASK_DIAG_DUMP_ALL_THREAD | dump all threads |
TASK_DIAG_DUMP_CHILDREN | dump children of a specified task |
TASK_DIAG_DUMP_THREAD | dump threads of a specified task |
TASK_DIAG_DUMP_ONE | Dump one task |
- Optimal grouping of attributes
- Any attribute in a group can't affect a response time
The following groups are proposed:
TASK_DIAG_BASE | PID, PGID, SID, TID, comm |
TASK_DIAG_CRED | UID, GID, groups, capabilities |
TASK_DIAG_STAT | per-task and per-process statistics (same as taskstats, not avail in /proc) |
TASK_DIAG_VMA | mapped memory regions and their access permissions (same as maps) |
TASK_DIAG_VMA_STAT | memory consumption for each mapping (same as smaps) |
Performance measurements
ps (emulation)
Get pid, tid, pgid and comm for 50000 processes.
Existing interface:
$ time ./task_proc_all a real 0m0.279s user 0m0.013s sys 0m0.255s
New interface:
$ time ./task_diag_all a real 0m0.051s user 0m0.001s sys 0m0.049s
Using perf tool
The following is a quote from David Ahern email:
> Using the fork test command: > 10,000 processes; 10k proc with 5 threads = 50,000 tasks > reading /proc: 11.3 sec > task_diag: 2.2 sec > > @7,440 tasks, reading /proc is at 0.77 sec and task_diag at 0.096 > > 128 instances of sepcjbb, 80,000+ tasks: > reading /proc: 32.1 sec > task_diag: 3.9 sec > > So overall much snappier startup times.
Alternative (bad) solutions
The following information is only interesting in a historical context.
task_diag netlink socket
This was the original proposal -- create something very similar to sock_diag (aka tcp_diag aka inet_diag).
It appeared to be a bad one because:
- It's not obvious where to get pid and user namespaces
- It's impossible to restrict netlink sockets:
- Credentials are saved when a socket is created
- Process can drop privileges, but netlink doesn't care
- The same socket can be used to get process attributes and to set ip addresses