aboutsummaryrefslogtreecommitdiff
path: root/Documentation/filesystems/proc.txt
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/filesystems/proc.txt')
-rw-r--r--Documentation/filesystems/proc.txt468
1 files changed, 390 insertions, 78 deletions
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index a4f30faa4f1..ddc531a74d0 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -33,13 +33,18 @@ Table of Contents
2 Modifying System Parameters
3 Per-Process Parameters
- 3.1 /proc/<pid>/oom_adj - Adjust the oom-killer score
+ 3.1 /proc/<pid>/oom_adj & /proc/<pid>/oom_score_adj - Adjust the oom-killer
+ score
3.2 /proc/<pid>/oom_score - Display current oom-killer score
3.3 /proc/<pid>/io - Display the IO accounting fields
3.4 /proc/<pid>/coredump_filter - Core dump filtering settings
3.5 /proc/<pid>/mountinfo - Information about mounts
3.6 /proc/<pid>/comm & /proc/<pid>/task/<tid>/comm
+ 3.7 /proc/<pid>/task/<tid>/children - Information about task children
+ 3.8 /proc/<pid>/fdinfo/<fd> - Information about opened file
+ 4 Configuring procfs
+ 4.1 Mount options
------------------------------------------------------------------------------
Preface
@@ -73,9 +78,9 @@ contact Bodo Bauer at bb@ricochet.net. We'll be happy to add them to this
document.
The latest version of this document is available online at
-http://skaro.nightcrawler.com/~bb/Docs/Proc as HTML version.
+http://tldp.org/LDP/Linux-Filesystem-Hierarchy/html/proc.html
-If the above direction does not works for you, ypu could try the kernel
+If the above direction does not works for you, you could try the kernel
mailing list at linux-kernel@vger.kernel.org and/or try to reach me at
comandante@zaralinux.com.
@@ -135,9 +140,10 @@ Table 1-1: Process specific entries in /proc
statm Process memory status information
status Process status in human readable form
wchan If CONFIG_KALLSYMS is set, a pre-decoded wchan
+ pagemap Page table
stack Report full stack trace, enable via CONFIG_STACKTRACE
smaps a extension based on maps, showing the memory consumption of
- each mapping
+ each mapping and flags associated with it
..............................................................................
For example, to get the status information of a process, all you have to do is
@@ -176,6 +182,7 @@ read the file /proc/PID/status:
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: ffffffffffffffff
+ Seccomp: 0
voluntary_ctxt_switches: 0
nonvoluntary_ctxt_switches: 1
@@ -227,11 +234,12 @@ Table 1-2: Contents of the status files (as of 2.6.30-rc7)
ShdPnd bitmap of shared pending signals for the process
SigBlk bitmap of blocked signals
SigIgn bitmap of ignored signals
- SigCgt bitmap of catched signals
+ SigCgt bitmap of caught signals
CapInh bitmap of inheritable capabilities
CapPrm bitmap of permitted capabilities
CapEff bitmap of effective capabilities
CapBnd bitmap of capabilities bounding set
+ Seccomp seccomp mode, like prctl(PR_GET_SECCOMP, ...)
Cpus_allowed mask of CPUs on which this process may run
Cpus_allowed_list Same as previous, but in "list format"
Mems_allowed mask of memory nodes allowed to this process
@@ -286,13 +294,13 @@ Table 1-4: Contents of the stat files (as of 2.6.30-rc7)
rsslim current limit in bytes on the rss
start_code address above which program text can run
end_code address below which program text can run
- start_stack address of the start of the stack
+ start_stack address of the start of the main process stack
esp current value of ESP
eip current value of EIP
pending bitmap of pending signals
blocked bitmap of blocked signals
sigign bitmap of ignored signals
- sigcatch bitmap of catched signals
+ sigcatch bitmap of caught signals
wchan address where process went to sleep
0 (place holder)
0 (place holder)
@@ -303,9 +311,17 @@ Table 1-4: Contents of the stat files (as of 2.6.30-rc7)
blkio_ticks time spent waiting for block IO
gtime guest time of the task in jiffies
cgtime guest time of the task children in jiffies
+ start_data address above which program data+bss is placed
+ end_data address below which program data+bss is placed
+ start_brk address above which program heap can be expanded with brk()
+ arg_start address above which program command line is placed
+ arg_end address below which program command line is placed
+ env_start address above which program environment is placed
+ env_end address below which program environment is placed
+ exit_code the thread's exit_code in the form reported by the waitpid system call
..............................................................................
-The /proc/PID/map file containing the currently mapped memory regions and
+The /proc/PID/maps file containing the currently mapped memory regions and
their access permissions.
The format is:
@@ -316,9 +332,9 @@ address perms offset dev inode pathname
08049000-0804a000 rw-p 00001000 03:00 8312 /opt/test
0804a000-0806b000 rw-p 00000000 00:00 0 [heap]
a7cb1000-a7cb2000 ---p 00000000 00:00 0
-a7cb2000-a7eb2000 rw-p 00000000 00:00 0 [threadstack:001ff4b4]
+a7cb2000-a7eb2000 rw-p 00000000 00:00 0
a7eb2000-a7eb3000 ---p 00000000 00:00 0
-a7eb3000-a7ed5000 rw-p 00000000 00:00 0
+a7eb3000-a7ed5000 rw-p 00000000 00:00 0 [stack:1001]
a7ed5000-a8008000 r-xp 00000000 03:00 4222 /lib/libc.so.6
a8008000-a800a000 r--p 00133000 03:00 4222 /lib/libc.so.6
a800a000-a800b000 rw-p 00135000 03:00 4222 /lib/libc.so.6
@@ -350,12 +366,39 @@ is not associated with a file:
[heap] = the heap of the program
[stack] = the stack of the main process
+ [stack:1001] = the stack of the thread with tid 1001
[vdso] = the "virtual dynamic shared object",
the kernel system call handler
- [threadstack:xxxxxxxx] = the stack of the thread, xxxxxxxx is the stack size
or if empty, the mapping is anonymous.
+The /proc/PID/task/TID/maps is a view of the virtual memory from the viewpoint
+of the individual tasks of a process. In this file you will see a mapping marked
+as [stack] if that task sees it as a stack. This is a key difference from the
+content of /proc/PID/maps, where you will see all mappings that are being used
+as stack by all of those tasks. Hence, for the example above, the task-level
+map, i.e. /proc/PID/task/TID/maps for thread 1001 will look like this:
+
+08048000-08049000 r-xp 00000000 03:00 8312 /opt/test
+08049000-0804a000 rw-p 00001000 03:00 8312 /opt/test
+0804a000-0806b000 rw-p 00000000 00:00 0 [heap]
+a7cb1000-a7cb2000 ---p 00000000 00:00 0
+a7cb2000-a7eb2000 rw-p 00000000 00:00 0
+a7eb2000-a7eb3000 ---p 00000000 00:00 0
+a7eb3000-a7ed5000 rw-p 00000000 00:00 0 [stack]
+a7ed5000-a8008000 r-xp 00000000 03:00 4222 /lib/libc.so.6
+a8008000-a800a000 r--p 00133000 03:00 4222 /lib/libc.so.6
+a800a000-a800b000 rw-p 00135000 03:00 4222 /lib/libc.so.6
+a800b000-a800e000 rw-p 00000000 00:00 0
+a800e000-a8022000 r-xp 00000000 03:00 14462 /lib/libpthread.so.0
+a8022000-a8023000 r--p 00013000 03:00 14462 /lib/libpthread.so.0
+a8023000-a8024000 rw-p 00014000 03:00 14462 /lib/libpthread.so.0
+a8024000-a8027000 rw-p 00000000 00:00 0
+a8027000-a8043000 r-xp 00000000 03:00 8317 /lib/ld-linux.so.2
+a8043000-a8044000 r--p 0001b000 03:00 8317 /lib/ld-linux.so.2
+a8044000-a8045000 rw-p 0001c000 03:00 8317 /lib/ld-linux.so.2
+aff35000-aff4a000 rw-p 00000000 00:00 0
+ffffe000-fffff000 r-xp 00000000 00:00 0 [vdso]
The /proc/PID/smaps is an extension based on maps, showing the memory
consumption for each of the process's mappings. For each of mappings there
@@ -370,23 +413,69 @@ Shared_Dirty: 0 kB
Private_Clean: 0 kB
Private_Dirty: 0 kB
Referenced: 892 kB
+Anonymous: 0 kB
Swap: 0 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
-
-The first of these lines shows the same information as is displayed for the
-mapping in /proc/PID/maps. The remaining lines show the size of the mapping,
-the amount of the mapping that is currently resident in RAM, the "proportional
-set size” (divide each shared page by the number of processes sharing it), the
-number of clean and dirty shared pages in the mapping, and the number of clean
-and dirty private pages in the mapping. The "Referenced" indicates the amount
-of memory currently marked as referenced or accessed.
+Locked: 374 kB
+VmFlags: rd ex mr mw me de
+
+the first of these lines shows the same information as is displayed for the
+mapping in /proc/PID/maps. The remaining lines show the size of the mapping
+(size), the amount of the mapping that is currently resident in RAM (RSS), the
+process' proportional share of this mapping (PSS), the number of clean and
+dirty private pages in the mapping. Note that even a page which is part of a
+MAP_SHARED mapping, but has only a single pte mapped, i.e. is currently used
+by only one process, is accounted as private and not as shared. "Referenced"
+indicates the amount of memory currently marked as referenced or accessed.
+"Anonymous" shows the amount of memory that does not belong to any file. Even
+a mapping associated with a file may contain anonymous pages: when MAP_PRIVATE
+and a page is modified, the file page is replaced by a private anonymous copy.
+"Swap" shows how much would-be-anonymous memory is also used, but out on
+swap.
+
+"VmFlags" field deserves a separate description. This member represents the kernel
+flags associated with the particular virtual memory area in two letter encoded
+manner. The codes are the following:
+ rd - readable
+ wr - writeable
+ ex - executable
+ sh - shared
+ mr - may read
+ mw - may write
+ me - may execute
+ ms - may share
+ gd - stack segment growns down
+ pf - pure PFN range
+ dw - disabled write to the mapped file
+ lo - pages are locked in memory
+ io - memory mapped I/O area
+ sr - sequential read advise provided
+ rr - random read advise provided
+ dc - do not copy area on fork
+ de - do not expand area on remapping
+ ac - area is accountable
+ nr - swap space is not reserved for the area
+ ht - area uses huge tlb pages
+ nl - non-linear mapping
+ ar - architecture specific flag
+ dd - do not include area into core dump
+ sd - soft-dirty flag
+ mm - mixed map area
+ hg - huge page advise flag
+ nh - no-huge page advise flag
+ mg - mergable advise flag
+
+Note that there is no guarantee that every flag and associated mnemonic will
+be present in all further kernel releases. Things get changed, the flags may
+be vanished or the reverse -- new added.
This file is only present if the CONFIG_MMU kernel configuration option is
enabled.
The /proc/PID/clear_refs is used to reset the PG_Referenced and ACCESSED/YOUNG
-bits on both physical and virtual pages associated with a process.
+bits on both physical and virtual pages associated with a process, and the
+soft-dirty bit on pte (see Documentation/vm/soft-dirty.txt for details).
To clear the bits for all the pages associated with the process
> echo 1 > /proc/PID/clear_refs
@@ -395,8 +484,15 @@ To clear the bits for the anonymous pages associated with the process
To clear the bits for the file mapped pages associated with the process
> echo 3 > /proc/PID/clear_refs
+
+To clear the soft-dirty bit
+ > echo 4 > /proc/PID/clear_refs
+
Any other value written to /proc/PID/clear_refs will have no effect.
+The /proc/pid/pagemap gives the PFN, which can be used to find the pageflags
+using /proc/kpageflags and number of times a page is mapped using
+/proc/kpagecount. For detailed explanation, see Documentation/vm/pagemap.txt.
1.2 Kernel data
---------------
@@ -451,7 +547,7 @@ Table 1-5: Kernel info in /proc
sys See chapter 2
sysvipc Info of SysVIPC Resources (msg, sem, shm) (2.4)
tty Info of tty drivers
- uptime System uptime
+ uptime Wall clock since boot, combined idle time of all cpus
version Kernel version
video bttv info of video resources (2.4)
vmallocinfo Show vmalloced areas
@@ -531,7 +627,7 @@ just those considered 'most important'. The new vectors are:
their statistics are used by kernel developers and interested users to
determine the occurrence of interrupts of the given type.
-The above IRQ vectors are displayed only when relevent. For example,
+The above IRQ vectors are displayed only when relevant. For example,
the threshold vector does not exist on x86_64 platforms. Others are
suppressed when the system is a uniprocessor. As of this writing, only
i386 and x86_64 platforms support the new IRQ vector displays.
@@ -562,17 +658,28 @@ The contents of each smp_affinity file is the same by default:
> cat /proc/irq/0/smp_affinity
ffffffff
+There is an alternate interface, smp_affinity_list which allows specifying
+a cpu range instead of a bitmask:
+
+ > cat /proc/irq/0/smp_affinity_list
+ 1024-1031
+
The default_smp_affinity mask applies to all non-active IRQs, which are the
IRQs which have not yet been allocated/activated, and hence which lack a
/proc/irq/[0-9]* directory.
+The node file on an SMP system shows the node to which the device using the IRQ
+reports itself as being attached. This hardware locality information does not
+include information about any possible driver locality preference.
+
prof_cpu_mask specifies which CPUs are to be profiled by the system wide
-profiler. Default value is ffffffff (all cpus).
+profiler. Default value is ffffffff (all cpus if there are only 32 of them).
The way IRQs are routed is handled by the IO-APIC, and it's Round Robin
between all the CPUs which are allowed to handle it. As usual the kernel has
more info than you and does a better job than you, so the defaults are the
-best choice for almost everyone.
+best choice for almost everyone. [Note this applies only to those IO-APIC's
+that support "Round Robin" interrupt distribution.]
There are three more important subdirectories in /proc: net, scsi, and sys.
The general rule is that the contents, or even the existence of these
@@ -655,9 +762,12 @@ varies by architecture and compile options. The following is from a
> cat /proc/meminfo
+The "Locked" indicates whether the mapping is locked in memory or not.
+
MemTotal: 16344972 kB
MemFree: 13634064 kB
+MemAvailable: 14836172 kB
Buffers: 3656 kB
Cached: 1195708 kB
SwapCached: 0 kB
@@ -685,10 +795,19 @@ Committed_AS: 100056 kB
VmallocTotal: 112216 kB
VmallocUsed: 428 kB
VmallocChunk: 111088 kB
+AnonHugePages: 49152 kB
MemTotal: Total usable ram (i.e. physical ram minus a few reserved
bits and the kernel binary code)
MemFree: The sum of LowFree+HighFree
+MemAvailable: An estimate of how much memory is available for starting new
+ applications, without swapping. Calculated from MemFree,
+ SReclaimable, the size of the file LRU lists, and the low
+ watermarks in each zone.
+ The estimate takes into account that the system needs some
+ page cache to function well, and that not all reclaimable
+ slab will be reclaimable, due to items being in use. The
+ impact of those factors will vary from system to system.
Buffers: Relatively temporary storage for raw disk blocks
shouldn't get tremendously large (20MB or so)
Cached: in-memory cache for files read from the disk (the
@@ -718,6 +837,7 @@ VmallocChunk: 111088 kB
Dirty: Memory which is waiting to get written back to the disk
Writeback: Memory which is actively being written back to the disk
AnonPages: Non-file backed pages mapped into userspace page tables
+AnonHugePages: Non-file backed huge pages mapped into userspace page tables
Mapped: files which have been mmaped, such as libraries
Slab: in-kernel data structures cache
SReclaimable: Part of Slab, that might be reclaimed, such as caches
@@ -734,7 +854,8 @@ WritebackTmp: Memory used by FUSE for temporary writeback buffers
if strict overcommit accounting is enabled (mode 2 in
'vm.overcommit_memory').
The CommitLimit is calculated with the following formula:
- CommitLimit = ('vm.overcommit_ratio' * Physical RAM) + Swap
+ CommitLimit = ([total RAM pages] - [total huge TLB pages]) *
+ overcommit_ratio / 100 + [total swap pages]
For example, on a system with 1G of physical RAM and 7G
of swap with a `vm.overcommit_ratio` of 30 it would
yield a CommitLimit of 7.3G.
@@ -744,16 +865,15 @@ Committed_AS: The amount of memory presently allocated on the system.
The committed memory is a sum of all of the memory which
has been allocated by processes, even if it has not been
"used" by them as of yet. A process which malloc()'s 1G
- of memory, but only touches 300M of it will only show up
- as using 300M of memory even if it has the address space
- allocated for the entire 1G. This 1G is memory which has
- been "committed" to by the VM and can be used at any time
- by the allocating application. With strict overcommit
- enabled on the system (mode 2 in 'vm.overcommit_memory'),
- allocations which would exceed the CommitLimit (detailed
- above) will not be permitted. This is useful if one needs
- to guarantee that processes will not fail due to lack of
- memory once that memory has been successfully allocated.
+ of memory, but only touches 300M of it will show up as
+ using 1G. This 1G is memory which has been "committed" to
+ by the VM and can be used at any time by the allocating
+ application. With strict overcommit enabled on the system
+ (mode 2 in 'vm.overcommit_memory'),allocations which would
+ exceed the CommitLimit (detailed above) will not be permitted.
+ This is useful if one needs to guarantee that processes will
+ not fail due to lack of memory once that memory has been
+ successfully allocated.
VmallocTotal: total size of vmalloc memory area
VmallocUsed: amount of vmalloc area which is used
VmallocChunk: largest contiguous block of vmalloc area which is free
@@ -938,7 +1058,6 @@ Table 1-9: Network info in /proc/net
snmp SNMP data
sockstat Socket statistics
tcp TCP sockets
- tr_rif Token ring RIF routing table
udp UDP sockets
unix UNIX domain sockets
wireless Wireless interface data (Wavelan etc)
@@ -965,7 +1084,7 @@ your system and how much traffic was routed over those devices:
...] 1375103 17405 0 0 0 0 0 0
...] 1703981 5535 0 0 0 3 0 0
-In addition, each Channel Bond interface has it's own directory. For
+In addition, each Channel Bond interface has its own directory. For
example, the bond0 device will have a directory called /proc/net/bond0/.
It will contain information that is specific to that bond, such as the
current slaves of the bond, the link status of the slaves, and how
@@ -1127,8 +1246,9 @@ second). The meanings of the columns are as follows, from left to right:
The "intr" line gives counts of interrupts serviced since boot time, for each
of the possible system interrupts. The first column is the total of all
-interrupts serviced; each subsequent column is the total for that particular
-interrupt.
+interrupts serviced including unnumbered architecture specific interrupts;
+each subsequent column is the total for that particular numbered interrupt.
+Unnumbered interrupts are not shown, only summed into the total.
The "ctxt" line gives the total number of context switches across all CPUs.
@@ -1166,6 +1286,30 @@ Table 1-12: Files in /proc/fs/ext4/<devname>
mb_groups details of multiblock allocator buddy cache of free blocks
..............................................................................
+2.0 /proc/consoles
+------------------
+Shows registered system console lines.
+
+To see which character device lines are currently used for the system console
+/dev/console, you may simply look into the file /proc/consoles:
+
+ > cat /proc/consoles
+ tty0 -WU (ECp) 4:7
+ ttyS0 -W- (Ep) 4:64
+
+The columns are:
+
+ device name of the device
+ operations R = can do read operations
+ W = can do write operations
+ U = can do unblank
+ flags E = it is enabled
+ C = it is preferred console
+ B = it is primary boot console
+ p = it is used for printk buffer
+ b = it is not a TTY but a Braille device
+ a = it is safe to use when cpu is offline
+ major:minor major and minor number of the device separated by a colon
------------------------------------------------------------------------------
Summary
@@ -1214,7 +1358,7 @@ review the kernel documentation in the directory /usr/src/linux/Documentation.
This chapter is heavily based on the documentation included in the pre 2.2
kernels, and became part of it in version 2.2.1 of the Linux kernel.
-Please see: Documentation/sysctls/ directory for descriptions of these
+Please see: Documentation/sysctl/ directory for descriptions of these
entries.
------------------------------------------------------------------------------
@@ -1231,48 +1375,68 @@ of the kernel.
CHAPTER 3: PER-PROCESS PARAMETERS
------------------------------------------------------------------------------
-3.1 /proc/<pid>/oom_adj - Adjust the oom-killer score
-------------------------------------------------------
-
-This file can be used to adjust the score used to select which processes
-should be killed in an out-of-memory situation. Giving it a high score will
-increase the likelihood of this process being killed by the oom-killer. Valid
-values are in the range -16 to +15, plus the special value -17, which disables
-oom-killing altogether for this process.
-
-The process to be killed in an out-of-memory situation is selected among all others
-based on its badness score. This value equals the original memory size of the process
-and is then updated according to its CPU time (utime + stime) and the
-run time (uptime - start time). The longer it runs the smaller is the score.
-Badness score is divided by the square root of the CPU time and then by
-the double square root of the run time.
-
-Swapped out tasks are killed first. Half of each child's memory size is added to
-the parent's score if they do not share the same memory. Thus forking servers
-are the prime candidates to be killed. Having only one 'hungry' child will make
-parent less preferable than the child.
-
-/proc/<pid>/oom_score shows process' current badness score.
-
-The following heuristics are then applied:
- * if the task was reniced, its score doubles
- * superuser or direct hardware access tasks (CAP_SYS_ADMIN, CAP_SYS_RESOURCE
- or CAP_SYS_RAWIO) have their score divided by 4
- * if oom condition happened in one cpuset and checked process does not belong
- to it, its score is divided by 8
- * the resulting score is multiplied by two to the power of oom_adj, i.e.
- points <<= oom_adj when it is positive and
- points >>= -(oom_adj) otherwise
-
-The task with the highest badness score is then selected and its children
-are killed, process itself will be killed in an OOM situation when it does
-not have children or some of them disabled oom like described above.
+3.1 /proc/<pid>/oom_adj & /proc/<pid>/oom_score_adj- Adjust the oom-killer score
+--------------------------------------------------------------------------------
+
+These file can be used to adjust the badness heuristic used to select which
+process gets killed in out of memory conditions.
+
+The badness heuristic assigns a value to each candidate task ranging from 0
+(never kill) to 1000 (always kill) to determine which process is targeted. The
+units are roughly a proportion along that range of allowed memory the process
+may allocate from based on an estimation of its current memory and swap use.
+For example, if a task is using all allowed memory, its badness score will be
+1000. If it is using half of its allowed memory, its score will be 500.
+
+There is an additional factor included in the badness score: the current memory
+and swap usage is discounted by 3% for root processes.
+
+The amount of "allowed" memory depends on the context in which the oom killer
+was called. If it is due to the memory assigned to the allocating task's cpuset
+being exhausted, the allowed memory represents the set of mems assigned to that
+cpuset. If it is due to a mempolicy's node(s) being exhausted, the allowed
+memory represents the set of mempolicy nodes. If it is due to a memory
+limit (or swap limit) being reached, the allowed memory is that configured
+limit. Finally, if it is due to the entire system being out of memory, the
+allowed memory represents all allocatable resources.
+
+The value of /proc/<pid>/oom_score_adj is added to the badness score before it
+is used to determine which task to kill. Acceptable values range from -1000
+(OOM_SCORE_ADJ_MIN) to +1000 (OOM_SCORE_ADJ_MAX). This allows userspace to
+polarize the preference for oom killing either by always preferring a certain
+task or completely disabling it. The lowest possible value, -1000, is
+equivalent to disabling oom killing entirely for that task since it will always
+report a badness score of 0.
+
+Consequently, it is very simple for userspace to define the amount of memory to
+consider for each task. Setting a /proc/<pid>/oom_score_adj value of +500, for
+example, is roughly equivalent to allowing the remainder of tasks sharing the
+same system, cpuset, mempolicy, or memory controller resources to use at least
+50% more memory. A value of -500, on the other hand, would be roughly
+equivalent to discounting 50% of the task's allowed memory from being considered
+as scoring against the task.
+
+For backwards compatibility with previous kernels, /proc/<pid>/oom_adj may also
+be used to tune the badness score. Its acceptable values range from -16
+(OOM_ADJUST_MIN) to +15 (OOM_ADJUST_MAX) and a special value of -17
+(OOM_DISABLE) to disable oom killing entirely for that task. Its value is
+scaled linearly with /proc/<pid>/oom_score_adj.
+
+The value of /proc/<pid>/oom_score_adj may be reduced no lower than the last
+value set by a CAP_SYS_RESOURCE process. To reduce the value any lower
+requires CAP_SYS_RESOURCE.
+
+Caveat: when a parent task is selected, the oom killer will sacrifice any first
+generation children with separate address spaces instead, if possible. This
+avoids servers and important system daemons from being killed and loses the
+minimal amount of work.
+
3.2 /proc/<pid>/oom_score - Display current oom-killer score
-------------------------------------------------------------
This file can be used to check the current score used by the oom-killer is for
-any given <pid>. Use it together with /proc/<pid>/oom_adj to tune which
+any given <pid>. Use it together with /proc/<pid>/oom_score_adj to tune which
process should be killed in an out-of-memory situation.
@@ -1362,7 +1526,7 @@ been accounted as having caused 1MB of write.
In other words: The number of bytes which this process caused to not happen,
by truncating pagecache. A task can cause "negative" IO too. If this task
truncates some dirty pagecache, some IO which another task has been accounted
-for (in it's write_bytes) will not be happening. We _could_ just subtract that
+for (in its write_bytes) will not be happening. We _could_ just subtract that
from the truncating task's write_bytes, but there is information loss in doing
that.
@@ -1467,3 +1631,151 @@ a task to set its own or one of its thread siblings comm value. The comm value
is limited in size compared to the cmdline value, so writing anything longer
then the kernel's TASK_COMM_LEN (currently 16 chars) will result in a truncated
comm value.
+
+
+3.7 /proc/<pid>/task/<tid>/children - Information about task children
+-------------------------------------------------------------------------
+This file provides a fast way to retrieve first level children pids
+of a task pointed by <pid>/<tid> pair. The format is a space separated
+stream of pids.
+
+Note the "first level" here -- if a child has own children they will
+not be listed here, one needs to read /proc/<children-pid>/task/<tid>/children
+to obtain the descendants.
+
+Since this interface is intended to be fast and cheap it doesn't
+guarantee to provide precise results and some children might be
+skipped, especially if they've exited right after we printed their
+pids, so one need to either stop or freeze processes being inspected
+if precise results are needed.
+
+
+3.8 /proc/<pid>/fdinfo/<fd> - Information about opened file
+---------------------------------------------------------------
+This file provides information associated with an opened file. The regular
+files have at least three fields -- 'pos', 'flags' and mnt_id. The 'pos'
+represents the current offset of the opened file in decimal form [see lseek(2)
+for details], 'flags' denotes the octal O_xxx mask the file has been
+created with [see open(2) for details] and 'mnt_id' represents mount ID of
+the file system containing the opened file [see 3.5 /proc/<pid>/mountinfo
+for details].
+
+A typical output is
+
+ pos: 0
+ flags: 0100002
+ mnt_id: 19
+
+The files such as eventfd, fsnotify, signalfd, epoll among the regular pos/flags
+pair provide additional information particular to the objects they represent.
+
+ Eventfd files
+ ~~~~~~~~~~~~~
+ pos: 0
+ flags: 04002
+ mnt_id: 9
+ eventfd-count: 5a
+
+ where 'eventfd-count' is hex value of a counter.
+
+ Signalfd files
+ ~~~~~~~~~~~~~~
+ pos: 0
+ flags: 04002
+ mnt_id: 9
+ sigmask: 0000000000000200
+
+ where 'sigmask' is hex value of the signal mask associated
+ with a file.
+
+ Epoll files
+ ~~~~~~~~~~~
+ pos: 0
+ flags: 02
+ mnt_id: 9
+ tfd: 5 events: 1d data: ffffffffffffffff
+
+ where 'tfd' is a target file descriptor number in decimal form,
+ 'events' is events mask being watched and the 'data' is data
+ associated with a target [see epoll(7) for more details].
+
+ Fsnotify files
+ ~~~~~~~~~~~~~~
+ For inotify files the format is the following
+
+ pos: 0
+ flags: 02000000
+ inotify wd:3 ino:9e7e sdev:800013 mask:800afce ignored_mask:0 fhandle-bytes:8 fhandle-type:1 f_handle:7e9e0000640d1b6d
+
+ where 'wd' is a watch descriptor in decimal form, ie a target file
+ descriptor number, 'ino' and 'sdev' are inode and device where the
+ target file resides and the 'mask' is the mask of events, all in hex
+ form [see inotify(7) for more details].
+
+ If the kernel was built with exportfs support, the path to the target
+ file is encoded as a file handle. The file handle is provided by three
+ fields 'fhandle-bytes', 'fhandle-type' and 'f_handle', all in hex
+ format.
+
+ If the kernel is built without exportfs support the file handle won't be
+ printed out.
+
+ If there is no inotify mark attached yet the 'inotify' line will be omitted.
+
+ For fanotify files the format is
+
+ pos: 0
+ flags: 02
+ mnt_id: 9
+ fanotify flags:10 event-flags:0
+ fanotify mnt_id:12 mflags:40 mask:38 ignored_mask:40000003
+ fanotify ino:4f969 sdev:800013 mflags:0 mask:3b ignored_mask:40000000 fhandle-bytes:8 fhandle-type:1 f_handle:69f90400c275b5b4
+
+ where fanotify 'flags' and 'event-flags' are values used in fanotify_init
+ call, 'mnt_id' is the mount point identifier, 'mflags' is the value of
+ flags associated with mark which are tracked separately from events
+ mask. 'ino', 'sdev' are target inode and device, 'mask' is the events
+ mask and 'ignored_mask' is the mask of events which are to be ignored.
+ All in hex format. Incorporation of 'mflags', 'mask' and 'ignored_mask'
+ does provide information about flags and mask used in fanotify_mark
+ call [see fsnotify manpage for details].
+
+ While the first three lines are mandatory and always printed, the rest is
+ optional and may be omitted if no marks created yet.
+
+
+------------------------------------------------------------------------------
+Configuring procfs
+------------------------------------------------------------------------------
+
+4.1 Mount options
+---------------------
+
+The following mount options are supported:
+
+ hidepid= Set /proc/<pid>/ access mode.
+ gid= Set the group authorized to learn processes information.
+
+hidepid=0 means classic mode - everybody may access all /proc/<pid>/ directories
+(default).
+
+hidepid=1 means users may not access any /proc/<pid>/ directories but their
+own. Sensitive files like cmdline, sched*, status are now protected against
+other users. This makes it impossible to learn whether any user runs
+specific program (given the program doesn't reveal itself by its behaviour).
+As an additional bonus, as /proc/<pid>/cmdline is unaccessible for other users,
+poorly written programs passing sensitive information via program arguments are
+now protected against local eavesdroppers.
+
+hidepid=2 means hidepid=1 plus all /proc/<pid>/ will be fully invisible to other
+users. It doesn't mean that it hides a fact whether a process with a specific
+pid value exists (it can be learned by other means, e.g. by "kill -0 $PID"),
+but it hides process' uid and gid, which may be learned by stat()'ing
+/proc/<pid>/ otherwise. It greatly complicates an intruder's task of gathering
+information about running processes, whether some daemon runs with elevated
+privileges, whether other user runs some sensitive program, whether other users
+run any program at all, etc.
+
+gid= defines a group authorized to learn processes information otherwise
+prohibited by hidepid=. If you use some daemon like identd which needs to learn
+information about processes information, just add identd to this group.