Difference between revisions of "Task-diag"
(Created page with "== See also == Pending work on the "upstream kernel commits" pages. Category:Development Category:Empty articles") |
(→Proposed solution: + few missing fields to TASK_DIAG_BASE) |
||
(4 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
+ | This articles describes a new proposed interface to get information about running processes (roughly same info that is now available from <code>/proc/''PID''/*</code> files). | ||
+ | |||
+ | == Limitations of /proc/PID interface == | ||
+ | |||
+ | Current interface is a bunch of files in /proc/PID. While this appears to be simple and There are a number of problems with it. | ||
+ | |||
+ | === Lots of syscalls === | ||
+ | |||
+ | At least three syscalls per each PID are required — | ||
+ | <code>open()</code>, <code>read()</code>, and <code>close()</code>. | ||
+ | |||
+ | For example, a mere <code>ps ax</code> command performs these 3 syscalls | ||
+ | for each of 3 files (<code>stat</code>, <code>status</code>, <code>cmdline</code>) | ||
+ | for each process in the system. This results in thousands of syscalls | ||
+ | and therefore thousands of user/kernel context switches. | ||
+ | |||
+ | === Variety of formats === | ||
+ | |||
+ | There are many different formats used by files in <code>/proc/''PID''/</code> hierarchy. Therefore, there is a need to write parser for each such format. | ||
+ | |||
+ | === Not enough information === | ||
+ | Example: <code>/proc/''PID''/fd/</code> doesn't contain file open flags or current position, | ||
+ | so we had to introduce <code>/proc/''PID''/fdinfo/</code>. | ||
+ | |||
+ | === Non-extendable formats === | ||
+ | |||
+ | Some formats in /proc/PID are non-extendable. For example, | ||
+ | <code>/proc/''PID''/maps</code> last column (file name) is optional, | ||
+ | therefore there is no way to add more columns without breaking the format. | ||
+ | |||
+ | === Slow read due to extra info === | ||
+ | |||
+ | Sometimes getting information is slow due to extra attributes | ||
+ | that are not always needed. For example, <code>/proc/''PID''/smaps</code> | ||
+ | contains <code>VmFlags</code> field (which can't be added | ||
+ | to <code>/proc/''PID''/maps</code>, see previous item), | ||
+ | but it also contains page stats that take long time to generate. | ||
+ | |||
+ | <pre> | ||
+ | $ time cat /proc/*/maps > /dev/null | ||
+ | real 0m0.061s | ||
+ | user 0m0.002s | ||
+ | sys 0m0.059s | ||
+ | |||
+ | |||
+ | $ time cat /proc/*/smaps > /dev/null | ||
+ | real 0m0.253s | ||
+ | user 0m0.004s | ||
+ | sys 0m0.247s | ||
+ | </pre> | ||
+ | |||
+ | == Proposed solution == | ||
+ | |||
+ | Proposed is the <code>/proc/task_diag</code> file, which operates based on the following principles: | ||
+ | |||
+ | * Transactional: write request, read response | ||
+ | |||
+ | * Netlink message format (same as used by sock_diag; binary and extendable) | ||
+ | |||
+ | * Ability to specify a set of processes to get info about | ||
+ | {| class="wikitable" | ||
+ | || TASK_DIAG_DUMP_ALL || dump all processes | ||
+ | |- | ||
+ | || TASK_DIAG_DUMP_ALL_THREAD || dump all threads | ||
+ | |- | ||
+ | || TASK_DIAG_DUMP_CHILDREN || dump children of a specified task | ||
+ | |- | ||
+ | || TASK_DIAG_DUMP_THREAD || dump threads of a specified task | ||
+ | |- | ||
+ | || TASK_DIAG_DUMP_ONE || Dump one task | ||
+ | |} | ||
+ | |||
+ | * Optimal grouping of attributes | ||
+ | ** Any attribute in a group can't affect a response time | ||
+ | |||
+ | The following groups are proposed: | ||
+ | |||
+ | {| class="wikitable" | ||
+ | || TASK_DIAG_BASE || PID, PPID, PGID, SID, TID, state, comm | ||
+ | |- | ||
+ | || TASK_DIAG_CRED || UID, GID, groups, capabilities | ||
+ | |- | ||
+ | || TASK_DIAG_STAT || per-task and per-process statistics (same as taskstats, not avail in /proc) | ||
+ | |- | ||
+ | || TASK_DIAG_VMA || mapped memory regions and their access permissions (same as maps) | ||
+ | |- | ||
+ | || TASK_DIAG_VMA_STAT || memory consumption for each mapping (same as smaps) | ||
+ | |- | ||
+ | |} | ||
+ | |||
+ | === Performance measurements === | ||
+ | |||
+ | ==== ps (emulation) ==== | ||
+ | |||
+ | Get pid, tid, pgid and comm for 50000 processes. The code used is available [https://github.com/avagin/linux-task-diag/tree/task-diag-proc/tools/testing/selftests/task_diag from here]. | ||
+ | |||
+ | Existing interface: | ||
+ | <pre> | ||
+ | $ time ./task_proc_all a | ||
+ | real 0m0.279s | ||
+ | user 0m0.013s | ||
+ | sys 0m0.255s | ||
+ | </pre> | ||
+ | |||
+ | New interface: | ||
+ | <pre> | ||
+ | $ time ./task_diag_all a | ||
+ | real 0m0.051s | ||
+ | user 0m0.001s | ||
+ | sys 0m0.049s | ||
+ | </pre> | ||
+ | |||
+ | ==== Using perf tool ==== | ||
+ | |||
+ | The following is a quote from David Ahern email: | ||
+ | |||
+ | <pre> | ||
+ | > Using the fork test command: | ||
+ | > 10,000 processes; 10k proc with 5 threads = 50,000 tasks | ||
+ | > reading /proc: 11.3 sec | ||
+ | > task_diag: 2.2 sec | ||
+ | > | ||
+ | > @7,440 tasks, reading /proc is at 0.77 sec and task_diag at 0.096 | ||
+ | > | ||
+ | > 128 instances of sepcjbb, 80,000+ tasks: | ||
+ | > reading /proc: 32.1 sec | ||
+ | > task_diag: 3.9 sec | ||
+ | > | ||
+ | > So overall much snappier startup times. | ||
+ | </pre> | ||
+ | |||
+ | == Alternative (bad) solutions == | ||
+ | |||
+ | The following information is only interesting in a historical context. | ||
+ | |||
+ | === task_diag netlink socket === | ||
+ | |||
+ | This was the original proposal -- create something very similar to sock_diag (aka tcp_diag aka inet_diag). | ||
+ | |||
+ | It appeared to be a bad one because: | ||
+ | |||
+ | * It's not obvious where to get pid and user namespaces | ||
+ | * It's impossible to restrict netlink sockets: | ||
+ | ** Credentials are saved when a socket is created | ||
+ | ** Process can drop privileges, but netlink doesn't care | ||
+ | ** The same socket can be used to get process attributes and to set ip addresses | ||
+ | |||
== See also == | == See also == | ||
− | + | * [[Upstream kernel commits]] | |
[[Category:Development]] | [[Category:Development]] | ||
− |
Latest revision as of 22:03, 24 September 2018
This articles describes a new proposed interface to get information about running processes (roughly same info that is now available from /proc/PID/*
files).
Limitations of /proc/PID interface[edit]
Current interface is a bunch of files in /proc/PID. While this appears to be simple and There are a number of problems with it.
Lots of syscalls[edit]
At least three syscalls per each PID are required —
open()
, read()
, and close()
.
For example, a mere ps ax
command performs these 3 syscalls
for each of 3 files (stat
, status
, cmdline
)
for each process in the system. This results in thousands of syscalls
and therefore thousands of user/kernel context switches.
Variety of formats[edit]
There are many different formats used by files in /proc/PID/
hierarchy. Therefore, there is a need to write parser for each such format.
Not enough information[edit]
Example: /proc/PID/fd/
doesn't contain file open flags or current position,
so we had to introduce /proc/PID/fdinfo/
.
Non-extendable formats[edit]
Some formats in /proc/PID are non-extendable. For example,
/proc/PID/maps
last column (file name) is optional,
therefore there is no way to add more columns without breaking the format.
Slow read due to extra info[edit]
Sometimes getting information is slow due to extra attributes
that are not always needed. For example, /proc/PID/smaps
contains VmFlags
field (which can't be added
to /proc/PID/maps
, see previous item),
but it also contains page stats that take long time to generate.
$ time cat /proc/*/maps > /dev/null real 0m0.061s user 0m0.002s sys 0m0.059s $ time cat /proc/*/smaps > /dev/null real 0m0.253s user 0m0.004s sys 0m0.247s
Proposed solution[edit]
Proposed is the /proc/task_diag
file, which operates based on the following principles:
- Transactional: write request, read response
- Netlink message format (same as used by sock_diag; binary and extendable)
- Ability to specify a set of processes to get info about
TASK_DIAG_DUMP_ALL | dump all processes |
TASK_DIAG_DUMP_ALL_THREAD | dump all threads |
TASK_DIAG_DUMP_CHILDREN | dump children of a specified task |
TASK_DIAG_DUMP_THREAD | dump threads of a specified task |
TASK_DIAG_DUMP_ONE | Dump one task |
- Optimal grouping of attributes
- Any attribute in a group can't affect a response time
The following groups are proposed:
TASK_DIAG_BASE | PID, PPID, PGID, SID, TID, state, comm |
TASK_DIAG_CRED | UID, GID, groups, capabilities |
TASK_DIAG_STAT | per-task and per-process statistics (same as taskstats, not avail in /proc) |
TASK_DIAG_VMA | mapped memory regions and their access permissions (same as maps) |
TASK_DIAG_VMA_STAT | memory consumption for each mapping (same as smaps) |
Performance measurements[edit]
ps (emulation)[edit]
Get pid, tid, pgid and comm for 50000 processes. The code used is available from here.
Existing interface:
$ time ./task_proc_all a real 0m0.279s user 0m0.013s sys 0m0.255s
New interface:
$ time ./task_diag_all a real 0m0.051s user 0m0.001s sys 0m0.049s
Using perf tool[edit]
The following is a quote from David Ahern email:
> Using the fork test command: > 10,000 processes; 10k proc with 5 threads = 50,000 tasks > reading /proc: 11.3 sec > task_diag: 2.2 sec > > @7,440 tasks, reading /proc is at 0.77 sec and task_diag at 0.096 > > 128 instances of sepcjbb, 80,000+ tasks: > reading /proc: 32.1 sec > task_diag: 3.9 sec > > So overall much snappier startup times.
Alternative (bad) solutions[edit]
The following information is only interesting in a historical context.
task_diag netlink socket[edit]
This was the original proposal -- create something very similar to sock_diag (aka tcp_diag aka inet_diag).
It appeared to be a bad one because:
- It's not obvious where to get pid and user namespaces
- It's impossible to restrict netlink sockets:
- Credentials are saved when a socket is created
- Process can drop privileges, but netlink doesn't care
- The same socket can be used to get process attributes and to set ip addresses