This articles describes a new proposed interface to get information about running processes (roughly same info that is now available from
- 1 Limitations of /proc/PID interface
- 2 Proposed solution
- 3 Alternative (bad) solutions
- 4 See also
Limitations of /proc/PID interface
Current interface is a bunch of files in /proc/PID. While this appears to be simple and There are a number of problems with it.
Lots of syscalls
At least three syscalls per each PID are required —
For example, a mere
ps ax command performs these 3 syscalls
for each of 3 files (
for each process in the system. This results in thousands of syscalls
and therefore thousands of user/kernel context switches.
Variety of formats
There are many different formats used by files in
/proc/PID/ hierarchy. Therefore, there is a need to write parser for each such format.
Not enough information
/proc/PID/fd/ doesn't contain file open flags or current position,
so we had to introduce
Some formats in /proc/PID are non-extendable. For example,
/proc/PID/maps last column (file name) is optional,
therefore there is no way to add more columns without breaking the format.
Slow read due to extra info
Sometimes getting information is slow due to extra attributes
that are not always needed. For example,
VmFlags field (which can't be added
/proc/PID/maps, see previous item),
but it also contains page stats that take long time to generate.
$ time cat /proc/*/maps > /dev/null real 0m0.061s user 0m0.002s sys 0m0.059s $ time cat /proc/*/smaps > /dev/null real 0m0.253s user 0m0.004s sys 0m0.247s
Proposed is the
/proc/task_diag file, which operates based on the following principles:
- Transactional: write request, read response
- Netlink message format (same as used by sock_diag; binary and extendable)
- Ability to specify a set of processes to get info about
|TASK_DIAG_DUMP_ALL||dump all processes|
|TASK_DIAG_DUMP_ALL_THREAD||dump all threads|
|TASK_DIAG_DUMP_CHILDREN||dump children of a specified task|
|TASK_DIAG_DUMP_THREAD||dump threads of a specified task|
|TASK_DIAG_DUMP_ONE||Dump one task|
- Optimal grouping of attributes
- Any attribute in a group can't affect a response time
The following groups are proposed:
|TASK_DIAG_BASE||PID, PPID, PGID, SID, TID, state, comm|
|TASK_DIAG_CRED||UID, GID, groups, capabilities|
|TASK_DIAG_STAT||per-task and per-process statistics (same as taskstats, not avail in /proc)|
|TASK_DIAG_VMA||mapped memory regions and their access permissions (same as maps)|
|TASK_DIAG_VMA_STAT||memory consumption for each mapping (same as smaps)|
Get pid, tid, pgid and comm for 50000 processes. The code used is available from here.
$ time ./task_proc_all a real 0m0.279s user 0m0.013s sys 0m0.255s
$ time ./task_diag_all a real 0m0.051s user 0m0.001s sys 0m0.049s
Using perf tool
The following is a quote from David Ahern email:
> Using the fork test command: > 10,000 processes; 10k proc with 5 threads = 50,000 tasks > reading /proc: 11.3 sec > task_diag: 2.2 sec > > @7,440 tasks, reading /proc is at 0.77 sec and task_diag at 0.096 > > 128 instances of sepcjbb, 80,000+ tasks: > reading /proc: 32.1 sec > task_diag: 3.9 sec > > So overall much snappier startup times.
Alternative (bad) solutions
The following information is only interesting in a historical context.
This was the original proposal -- create something very similar to sock_diag (aka tcp_diag aka inet_diag).
It appeared to be a bad one because:
- It's not obvious where to get pid and user namespaces
- It's impossible to restrict netlink sockets:
- Credentials are saved when a socket is created
- Process can drop privileges, but netlink doesn't care
- The same socket can be used to get process attributes and to set ip addresses