Line 1: |
Line 1: |
| + | This articles describes a new proposed interface to get information about running processes (roughly same info that is now available from <code>/proc/''PID''/*</code> files). |
| + | |
| + | == Limitations of /proc/PID interface == |
| + | |
| + | Current interface is a bunch of files in /proc/PID. While this appears to be simple and There are a number of problems with it. |
| + | |
| + | === Lots of syscalls === |
| + | |
| + | At least three syscalls per each PID are required — |
| + | <code>open()</code>, <code>read()</code>, and <code>close()</code>. |
| + | |
| + | For example, a mere <code>ps ax</code> command performs these 3 syscalls |
| + | for each of 3 files (<code>stat</code>, <code>status</code>, <code>cmdline</code>) |
| + | for each process in the system. This results in thousands of syscalls |
| + | and therefore thousands of user/kernel context switches. |
| + | |
| + | === Variety of formats === |
| + | |
| + | There are many different formats used by files in <code>/proc/''PID''/</code> hierarchy. Therefore, there is a need to write parser for each such format. |
| + | |
| + | === Not enough information === |
| + | Example: <code>/proc/''PID''/fd/</code> doesn't contain file open flags or current position, |
| + | so we had to introduce <code>/proc/''PID''/fdinfo/</code>. |
| + | |
| + | === Non-extendable formats === |
| + | |
| + | Some formats in /proc/PID are non-extendable. For example, |
| + | <code>/proc/''PID''/maps</code> last column (file name) is optional, |
| + | therefore there is no way to add more columns without breaking the format. |
| + | |
| + | === Slow read due to extra info === |
| + | |
| + | Sometimes getting information is slow due to extra attributes |
| + | that are not always needed. For example, <code>/proc/''PID''/smaps</code> |
| + | contains <code>VmFlags</code> field (which can't be added |
| + | to <code>/proc/''PID''/maps</code>, see previous item), |
| + | but it also contains page stats that take long time to generate. |
| + | |
| + | <pre> |
| + | $ time cat /proc/*/maps > /dev/null |
| + | real 0m0.061s |
| + | user 0m0.002s |
| + | sys 0m0.059s |
| + | |
| + | |
| + | $ time cat /proc/*/smaps > /dev/null |
| + | real 0m0.253s |
| + | user 0m0.004s |
| + | sys 0m0.247s |
| + | </pre> |
| + | |
| + | == Proposed solution == |
| + | |
| + | Proposed is the <code>/proc/task_diag</code> file, which operates based on the following principles: |
| + | |
| + | * Transactional: write request, read response |
| + | |
| + | * Netlink message format (same as used by sock_diag; binary and extendable) |
| + | |
| + | * Ability to specify a set of processes to get info about |
| + | ** TASK_DIAG_DUMP_ALL: dump all processes |
| + | ** TASK_DIAG_DUMP_ALL_THREAD: dump all threads |
| + | ** TASK_DIAG_DUMP_CHILDREN: dump children of a specified task |
| + | ** TASK_DIAG_DUMP_THREAD: dump threads of a specified task |
| + | ** TASK_DIAG_DUMP_ONE: Dump one task |
| + | |
| + | * Optimal grouping of attributes |
| + | ** Any attribute in a group can't affect a response time |
| + | |
| + | The following groups are proposed: |
| + | * TASK_DIAG_BASE |
| + | : PID, PGID, SID, TID, comm |
| + | * TASK_DIAG_CRED |
| + | : UID, GID, groups, capabilities |
| + | * TASK_DIAG_STAT |
| + | : per-task and per-process statistics (same as taskstats, not avail in /proc) |
| + | * TASK_DIAG_VMA |
| + | : mapped memory regions and their access permissions (same as maps) |
| + | * TASK_DIAG_VMA_STAT |
| + | : memory consumption for each mapping (same as smaps) |
| + | |
| + | === Performance measurements === |
| + | |
| + | ==== Get pid, tid, pgid and comm for 50000 processes ==== |
| + | |
| + | Existing interface: |
| + | <pre> |
| + | $ time ./task_proc_all a |
| + | real 0m0.279s |
| + | user 0m0.013s |
| + | sys 0m0.255s |
| + | </pre> |
| + | |
| + | New interface: |
| + | <pre> |
| + | $ time ./task_diag_all a |
| + | real 0m0.051s |
| + | user 0m0.001s |
| + | sys 0m0.049s |
| + | </pre> |
| + | |
| + | ==== Using perf tool ==== |
| + | |
| + | The following is a quote from David Ahern email: |
| + | |
| + | <pre> |
| + | > Using the fork test command: |
| + | > 10,000 processes; 10k proc with 5 threads = 50,000 tasks |
| + | > reading /proc: 11.3 sec |
| + | > task_diag: 2.2 sec |
| + | > |
| + | > @7,440 tasks, reading /proc is at 0.77 sec and task_diag at 0.096 |
| + | > |
| + | > 128 instances of sepcjbb, 80,000+ tasks: |
| + | > reading /proc: 32.1 sec |
| + | > task_diag: 3.9 sec |
| + | > |
| + | > So overall much snappier startup times. |
| + | </pre> |
| + | |
| + | == Alternative (bad) solutions == |
| + | |
| + | The following information is only interesting in a historical context. |
| + | |
| + | === task_diag netlink socket === |
| + | |
| + | This was the original proposal -- create something very similar to sock_diag (aka tcp_diag aka inet_diag). |
| + | |
| + | It appeared to be a bad one because: |
| + | |
| + | * It's not obvious where to get pid and user namespaces |
| + | * It's impossible to restrict netlink sockets: |
| + | ** Credentials are saved when a socket is created |
| + | ** Process can drop privileges, but netlink doesn't care |
| + | ** The same socket can be used to get process attributes and to set ip addresses |
| + | |
| == See also == | | == See also == |
| | | |
− | Pending work on the "[[upstream kernel commits]]" pages.
| + | * [[Upstream kernel commits]] |
| | | |
| [[Category:Development]] | | [[Category:Development]] |
− | [[Category:Empty articles]]
| |