Difference between revisions of "Task-diag"
(Created page with "== See also == Pending work on the "upstream kernel commits" pages. Category:Development Category:Empty articles") |
(→Proposed solution: + few missing fields to TASK_DIAG_BASE) |
||
| (4 intermediate revisions by the same user not shown) | |||
| Line 1: | Line 1: | ||
| + | This articles describes a new proposed interface to get information about running processes (roughly same info that is now available from <code>/proc/''PID''/*</code> files). | ||
| + | |||
| + | == Limitations of /proc/PID interface == | ||
| + | |||
| + | Current interface is a bunch of files in /proc/PID. While this appears to be simple and There are a number of problems with it. | ||
| + | |||
| + | === Lots of syscalls === | ||
| + | |||
| + | At least three syscalls per each PID are required — | ||
| + | <code>open()</code>, <code>read()</code>, and <code>close()</code>. | ||
| + | |||
| + | For example, a mere <code>ps ax</code> command performs these 3 syscalls | ||
| + | for each of 3 files (<code>stat</code>, <code>status</code>, <code>cmdline</code>) | ||
| + | for each process in the system. This results in thousands of syscalls | ||
| + | and therefore thousands of user/kernel context switches. | ||
| + | |||
| + | === Variety of formats === | ||
| + | |||
| + | There are many different formats used by files in <code>/proc/''PID''/</code> hierarchy. Therefore, there is a need to write parser for each such format. | ||
| + | |||
| + | === Not enough information === | ||
| + | Example: <code>/proc/''PID''/fd/</code> doesn't contain file open flags or current position, | ||
| + | so we had to introduce <code>/proc/''PID''/fdinfo/</code>. | ||
| + | |||
| + | === Non-extendable formats === | ||
| + | |||
| + | Some formats in /proc/PID are non-extendable. For example, | ||
| + | <code>/proc/''PID''/maps</code> last column (file name) is optional, | ||
| + | therefore there is no way to add more columns without breaking the format. | ||
| + | |||
| + | === Slow read due to extra info === | ||
| + | |||
| + | Sometimes getting information is slow due to extra attributes | ||
| + | that are not always needed. For example, <code>/proc/''PID''/smaps</code> | ||
| + | contains <code>VmFlags</code> field (which can't be added | ||
| + | to <code>/proc/''PID''/maps</code>, see previous item), | ||
| + | but it also contains page stats that take long time to generate. | ||
| + | |||
| + | <pre> | ||
| + | $ time cat /proc/*/maps > /dev/null | ||
| + | real 0m0.061s | ||
| + | user 0m0.002s | ||
| + | sys 0m0.059s | ||
| + | |||
| + | |||
| + | $ time cat /proc/*/smaps > /dev/null | ||
| + | real 0m0.253s | ||
| + | user 0m0.004s | ||
| + | sys 0m0.247s | ||
| + | </pre> | ||
| + | |||
| + | == Proposed solution == | ||
| + | |||
| + | Proposed is the <code>/proc/task_diag</code> file, which operates based on the following principles: | ||
| + | |||
| + | * Transactional: write request, read response | ||
| + | |||
| + | * Netlink message format (same as used by sock_diag; binary and extendable) | ||
| + | |||
| + | * Ability to specify a set of processes to get info about | ||
| + | {| class="wikitable" | ||
| + | || TASK_DIAG_DUMP_ALL || dump all processes | ||
| + | |- | ||
| + | || TASK_DIAG_DUMP_ALL_THREAD || dump all threads | ||
| + | |- | ||
| + | || TASK_DIAG_DUMP_CHILDREN || dump children of a specified task | ||
| + | |- | ||
| + | || TASK_DIAG_DUMP_THREAD || dump threads of a specified task | ||
| + | |- | ||
| + | || TASK_DIAG_DUMP_ONE || Dump one task | ||
| + | |} | ||
| + | |||
| + | * Optimal grouping of attributes | ||
| + | ** Any attribute in a group can't affect a response time | ||
| + | |||
| + | The following groups are proposed: | ||
| + | |||
| + | {| class="wikitable" | ||
| + | || TASK_DIAG_BASE || PID, PPID, PGID, SID, TID, state, comm | ||
| + | |- | ||
| + | || TASK_DIAG_CRED || UID, GID, groups, capabilities | ||
| + | |- | ||
| + | || TASK_DIAG_STAT || per-task and per-process statistics (same as taskstats, not avail in /proc) | ||
| + | |- | ||
| + | || TASK_DIAG_VMA || mapped memory regions and their access permissions (same as maps) | ||
| + | |- | ||
| + | || TASK_DIAG_VMA_STAT || memory consumption for each mapping (same as smaps) | ||
| + | |- | ||
| + | |} | ||
| + | |||
| + | === Performance measurements === | ||
| + | |||
| + | ==== ps (emulation) ==== | ||
| + | |||
| + | Get pid, tid, pgid and comm for 50000 processes. The code used is available [https://github.com/avagin/linux-task-diag/tree/task-diag-proc/tools/testing/selftests/task_diag from here]. | ||
| + | |||
| + | Existing interface: | ||
| + | <pre> | ||
| + | $ time ./task_proc_all a | ||
| + | real 0m0.279s | ||
| + | user 0m0.013s | ||
| + | sys 0m0.255s | ||
| + | </pre> | ||
| + | |||
| + | New interface: | ||
| + | <pre> | ||
| + | $ time ./task_diag_all a | ||
| + | real 0m0.051s | ||
| + | user 0m0.001s | ||
| + | sys 0m0.049s | ||
| + | </pre> | ||
| + | |||
| + | ==== Using perf tool ==== | ||
| + | |||
| + | The following is a quote from David Ahern email: | ||
| + | |||
| + | <pre> | ||
| + | > Using the fork test command: | ||
| + | > 10,000 processes; 10k proc with 5 threads = 50,000 tasks | ||
| + | > reading /proc: 11.3 sec | ||
| + | > task_diag: 2.2 sec | ||
| + | > | ||
| + | > @7,440 tasks, reading /proc is at 0.77 sec and task_diag at 0.096 | ||
| + | > | ||
| + | > 128 instances of sepcjbb, 80,000+ tasks: | ||
| + | > reading /proc: 32.1 sec | ||
| + | > task_diag: 3.9 sec | ||
| + | > | ||
| + | > So overall much snappier startup times. | ||
| + | </pre> | ||
| + | |||
| + | == Alternative (bad) solutions == | ||
| + | |||
| + | The following information is only interesting in a historical context. | ||
| + | |||
| + | === task_diag netlink socket === | ||
| + | |||
| + | This was the original proposal -- create something very similar to sock_diag (aka tcp_diag aka inet_diag). | ||
| + | |||
| + | It appeared to be a bad one because: | ||
| + | |||
| + | * It's not obvious where to get pid and user namespaces | ||
| + | * It's impossible to restrict netlink sockets: | ||
| + | ** Credentials are saved when a socket is created | ||
| + | ** Process can drop privileges, but netlink doesn't care | ||
| + | ** The same socket can be used to get process attributes and to set ip addresses | ||
| + | |||
== See also == | == See also == | ||
| − | + | * [[Upstream kernel commits]] | |
[[Category:Development]] | [[Category:Development]] | ||
| − | |||
Latest revision as of 22:03, 24 September 2018
This articles describes a new proposed interface to get information about running processes (roughly same info that is now available from /proc/PID/* files).
Limitations of /proc/PID interface[edit]
Current interface is a bunch of files in /proc/PID. While this appears to be simple and There are a number of problems with it.
Lots of syscalls[edit]
At least three syscalls per each PID are required —
open(), read(), and close().
For example, a mere ps ax command performs these 3 syscalls
for each of 3 files (stat, status, cmdline)
for each process in the system. This results in thousands of syscalls
and therefore thousands of user/kernel context switches.
Variety of formats[edit]
There are many different formats used by files in /proc/PID/ hierarchy. Therefore, there is a need to write parser for each such format.
Not enough information[edit]
Example: /proc/PID/fd/ doesn't contain file open flags or current position,
so we had to introduce /proc/PID/fdinfo/.
Non-extendable formats[edit]
Some formats in /proc/PID are non-extendable. For example,
/proc/PID/maps last column (file name) is optional,
therefore there is no way to add more columns without breaking the format.
Slow read due to extra info[edit]
Sometimes getting information is slow due to extra attributes
that are not always needed. For example, /proc/PID/smaps
contains VmFlags field (which can't be added
to /proc/PID/maps, see previous item),
but it also contains page stats that take long time to generate.
$ time cat /proc/*/maps > /dev/null real 0m0.061s user 0m0.002s sys 0m0.059s $ time cat /proc/*/smaps > /dev/null real 0m0.253s user 0m0.004s sys 0m0.247s
Proposed solution[edit]
Proposed is the /proc/task_diag file, which operates based on the following principles:
- Transactional: write request, read response
- Netlink message format (same as used by sock_diag; binary and extendable)
- Ability to specify a set of processes to get info about
| TASK_DIAG_DUMP_ALL | dump all processes |
| TASK_DIAG_DUMP_ALL_THREAD | dump all threads |
| TASK_DIAG_DUMP_CHILDREN | dump children of a specified task |
| TASK_DIAG_DUMP_THREAD | dump threads of a specified task |
| TASK_DIAG_DUMP_ONE | Dump one task |
- Optimal grouping of attributes
- Any attribute in a group can't affect a response time
The following groups are proposed:
| TASK_DIAG_BASE | PID, PPID, PGID, SID, TID, state, comm |
| TASK_DIAG_CRED | UID, GID, groups, capabilities |
| TASK_DIAG_STAT | per-task and per-process statistics (same as taskstats, not avail in /proc) |
| TASK_DIAG_VMA | mapped memory regions and their access permissions (same as maps) |
| TASK_DIAG_VMA_STAT | memory consumption for each mapping (same as smaps) |
Performance measurements[edit]
ps (emulation)[edit]
Get pid, tid, pgid and comm for 50000 processes. The code used is available from here.
Existing interface:
$ time ./task_proc_all a real 0m0.279s user 0m0.013s sys 0m0.255s
New interface:
$ time ./task_diag_all a real 0m0.051s user 0m0.001s sys 0m0.049s
Using perf tool[edit]
The following is a quote from David Ahern email:
> Using the fork test command: > 10,000 processes; 10k proc with 5 threads = 50,000 tasks > reading /proc: 11.3 sec > task_diag: 2.2 sec > > @7,440 tasks, reading /proc is at 0.77 sec and task_diag at 0.096 > > 128 instances of sepcjbb, 80,000+ tasks: > reading /proc: 32.1 sec > task_diag: 3.9 sec > > So overall much snappier startup times.
Alternative (bad) solutions[edit]
The following information is only interesting in a historical context.
task_diag netlink socket[edit]
This was the original proposal -- create something very similar to sock_diag (aka tcp_diag aka inet_diag).
It appeared to be a bad one because:
- It's not obvious where to get pid and user namespaces
- It's impossible to restrict netlink sockets:
- Credentials are saved when a socket is created
- Process can drop privileges, but netlink doesn't care
- The same socket can be used to get process attributes and to set ip addresses