Task-diag

This articles describes a new proposed interface to get information about running processes (roughly same info that is now available from /proc/PID/* files).

Limitations of /proc/PID interface

Current interface is a bunch of files in /proc/PID. While this appears to be simple and There are a number of problems with it.

Lots of syscalls

At least three syscalls per each PID are required — open(), read(), and close().

For example, a mere ps ax command performs these 3 syscalls for each of 3 files (stat, status, cmdline) for each process in the system. This results in thousands of syscalls and therefore thousands of user/kernel context switches.

Variety of formats

There are many different formats used by files in /proc/PID/ hierarchy. Therefore, there is a need to write parser for each such format.

Not enough information

Example: /proc/PID/fd/ doesn't contain file open flags or current position, so we had to introduce /proc/PID/fdinfo/.

Non-extendable formats

Some formats in /proc/PID are non-extendable. For example, /proc/PID/maps last column (file name) is optional, therefore there is no way to add more columns without breaking the format.

Slow read due to extra info

Sometimes getting information is slow due to extra attributes that are not always needed. For example, /proc/PID/smaps contains VmFlags field (which can't be added to /proc/PID/maps, see previous item), but it also contains page stats that take long time to generate.

$ time cat /proc/*/maps > /dev/null
real	0m0.061s
user	0m0.002s
sys	0m0.059s


$ time cat /proc/*/smaps > /dev/null
real	0m0.253s
user	0m0.004s
sys	0m0.247s

Proposed solution

Proposed is the /proc/task_diag file, which operates based on the following principles:

Transactional: write request, read response

Netlink message format (same as used by sock_diag; binary and extendable)

Ability to specify a set of processes to get info about

TASK_DIAG_DUMP_ALL	dump all processes
TASK_DIAG_DUMP_ALL_THREAD	dump all threads
TASK_DIAG_DUMP_CHILDREN	dump children of a specified task
TASK_DIAG_DUMP_THREAD	dump threads of a specified task
TASK_DIAG_DUMP_ONE	Dump one task

Optimal grouping of attributes
- Any attribute in a group can't affect a response time

The following groups are proposed:

TASK_DIAG_BASE	PID, PGID, SID, TID, comm
TASK_DIAG_CRED	UID, GID, groups, capabilities
TASK_DIAG_STAT	per-task and per-process statistics (same as taskstats, not avail in /proc)
TASK_DIAG_VMA	mapped memory regions and their access permissions (same as maps)
TASK_DIAG_VMA_STAT	memory consumption for each mapping (same as smaps)

Performance measurements

ps (emulation)

Get pid, tid, pgid and comm for 50000 processes. The code used is available from here.

Existing interface:

$ time ./task_proc_all a
real    0m0.279s
user    0m0.013s
sys     0m0.255s

New interface:

$ time ./task_diag_all a
real    0m0.051s
user    0m0.001s
sys     0m0.049s

Using perf tool

The following is a quote from David Ahern email:

> Using the fork test command:
>    10,000 processes; 10k proc with 5 threads = 50,000 tasks
>    reading /proc: 11.3 sec
>    task_diag:      2.2 sec
>
> @7,440 tasks, reading /proc is at 0.77 sec and task_diag at 0.096
>
> 128 instances of sepcjbb, 80,000+ tasks:
>     reading /proc: 32.1 sec
>     task_diag:      3.9 sec
>
> So overall much snappier startup times.

Alternative (bad) solutions

The following information is only interesting in a historical context.

task_diag netlink socket

This was the original proposal -- create something very similar to sock_diag (aka tcp_diag aka inet_diag).

It appeared to be a bad one because:

It's not obvious where to get pid and user namespaces
It's impossible to restrict netlink sockets:
- Credentials are saved when a socket is created
- Process can drop privileges, but netlink doesn't care
- The same socket can be used to get process attributes and to set ip addresses