diff options
Diffstat (limited to 'Documentation/filesystems/proc.txt')
| -rw-r--r-- | Documentation/filesystems/proc.txt | 345 |
1 files changed, 310 insertions, 35 deletions
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index e73df2722ff..ddc531a74d0 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -40,7 +40,11 @@ Table of Contents 3.4 /proc/<pid>/coredump_filter - Core dump filtering settings 3.5 /proc/<pid>/mountinfo - Information about mounts 3.6 /proc/<pid>/comm & /proc/<pid>/task/<tid>/comm + 3.7 /proc/<pid>/task/<tid>/children - Information about task children + 3.8 /proc/<pid>/fdinfo/<fd> - Information about opened file + 4 Configuring procfs + 4.1 Mount options ------------------------------------------------------------------------------ Preface @@ -139,7 +143,7 @@ Table 1-1: Process specific entries in /proc pagemap Page table stack Report full stack trace, enable via CONFIG_STACKTRACE smaps a extension based on maps, showing the memory consumption of - each mapping + each mapping and flags associated with it .............................................................................. For example, to get the status information of a process, all you have to do is @@ -178,6 +182,7 @@ read the file /proc/PID/status: CapPrm: 0000000000000000 CapEff: 0000000000000000 CapBnd: ffffffffffffffff + Seccomp: 0 voluntary_ctxt_switches: 0 nonvoluntary_ctxt_switches: 1 @@ -229,11 +234,12 @@ Table 1-2: Contents of the status files (as of 2.6.30-rc7) ShdPnd bitmap of shared pending signals for the process SigBlk bitmap of blocked signals SigIgn bitmap of ignored signals - SigCgt bitmap of catched signals + SigCgt bitmap of caught signals CapInh bitmap of inheritable capabilities CapPrm bitmap of permitted capabilities CapEff bitmap of effective capabilities CapBnd bitmap of capabilities bounding set + Seccomp seccomp mode, like prctl(PR_GET_SECCOMP, ...) Cpus_allowed mask of CPUs on which this process may run Cpus_allowed_list Same as previous, but in "list format" Mems_allowed mask of memory nodes allowed to this process @@ -288,13 +294,13 @@ Table 1-4: Contents of the stat files (as of 2.6.30-rc7) rsslim current limit in bytes on the rss start_code address above which program text can run end_code address below which program text can run - start_stack address of the start of the stack + start_stack address of the start of the main process stack esp current value of ESP eip current value of EIP pending bitmap of pending signals blocked bitmap of blocked signals sigign bitmap of ignored signals - sigcatch bitmap of catched signals + sigcatch bitmap of caught signals wchan address where process went to sleep 0 (place holder) 0 (place holder) @@ -305,6 +311,14 @@ Table 1-4: Contents of the stat files (as of 2.6.30-rc7) blkio_ticks time spent waiting for block IO gtime guest time of the task in jiffies cgtime guest time of the task children in jiffies + start_data address above which program data+bss is placed + end_data address below which program data+bss is placed + start_brk address above which program heap can be expanded with brk() + arg_start address above which program command line is placed + arg_end address below which program command line is placed + env_start address above which program environment is placed + env_end address below which program environment is placed + exit_code the thread's exit_code in the form reported by the waitpid system call .............................................................................. The /proc/PID/maps file containing the currently mapped memory regions and @@ -320,7 +334,7 @@ address perms offset dev inode pathname a7cb1000-a7cb2000 ---p 00000000 00:00 0 a7cb2000-a7eb2000 rw-p 00000000 00:00 0 a7eb2000-a7eb3000 ---p 00000000 00:00 0 -a7eb3000-a7ed5000 rw-p 00000000 00:00 0 +a7eb3000-a7ed5000 rw-p 00000000 00:00 0 [stack:1001] a7ed5000-a8008000 r-xp 00000000 03:00 4222 /lib/libc.so.6 a8008000-a800a000 r--p 00133000 03:00 4222 /lib/libc.so.6 a800a000-a800b000 rw-p 00135000 03:00 4222 /lib/libc.so.6 @@ -352,11 +366,39 @@ is not associated with a file: [heap] = the heap of the program [stack] = the stack of the main process + [stack:1001] = the stack of the thread with tid 1001 [vdso] = the "virtual dynamic shared object", the kernel system call handler or if empty, the mapping is anonymous. +The /proc/PID/task/TID/maps is a view of the virtual memory from the viewpoint +of the individual tasks of a process. In this file you will see a mapping marked +as [stack] if that task sees it as a stack. This is a key difference from the +content of /proc/PID/maps, where you will see all mappings that are being used +as stack by all of those tasks. Hence, for the example above, the task-level +map, i.e. /proc/PID/task/TID/maps for thread 1001 will look like this: + +08048000-08049000 r-xp 00000000 03:00 8312 /opt/test +08049000-0804a000 rw-p 00001000 03:00 8312 /opt/test +0804a000-0806b000 rw-p 00000000 00:00 0 [heap] +a7cb1000-a7cb2000 ---p 00000000 00:00 0 +a7cb2000-a7eb2000 rw-p 00000000 00:00 0 +a7eb2000-a7eb3000 ---p 00000000 00:00 0 +a7eb3000-a7ed5000 rw-p 00000000 00:00 0 [stack] +a7ed5000-a8008000 r-xp 00000000 03:00 4222 /lib/libc.so.6 +a8008000-a800a000 r--p 00133000 03:00 4222 /lib/libc.so.6 +a800a000-a800b000 rw-p 00135000 03:00 4222 /lib/libc.so.6 +a800b000-a800e000 rw-p 00000000 00:00 0 +a800e000-a8022000 r-xp 00000000 03:00 14462 /lib/libpthread.so.0 +a8022000-a8023000 r--p 00013000 03:00 14462 /lib/libpthread.so.0 +a8023000-a8024000 rw-p 00014000 03:00 14462 /lib/libpthread.so.0 +a8024000-a8027000 rw-p 00000000 00:00 0 +a8027000-a8043000 r-xp 00000000 03:00 8317 /lib/ld-linux.so.2 +a8043000-a8044000 r--p 0001b000 03:00 8317 /lib/ld-linux.so.2 +a8044000-a8045000 rw-p 0001c000 03:00 8317 /lib/ld-linux.so.2 +aff35000-aff4a000 rw-p 00000000 00:00 0 +ffffe000-fffff000 r-xp 00000000 00:00 0 [vdso] The /proc/PID/smaps is an extension based on maps, showing the memory consumption for each of the process's mappings. For each of mappings there @@ -375,8 +417,10 @@ Anonymous: 0 kB Swap: 0 kB KernelPageSize: 4 kB MMUPageSize: 4 kB +Locked: 374 kB +VmFlags: rd ex mr mw me de -The first of these lines shows the same information as is displayed for the +the first of these lines shows the same information as is displayed for the mapping in /proc/PID/maps. The remaining lines show the size of the mapping (size), the amount of the mapping that is currently resident in RAM (RSS), the process' proportional share of this mapping (PSS), the number of clean and @@ -390,11 +434,48 @@ and a page is modified, the file page is replaced by a private anonymous copy. "Swap" shows how much would-be-anonymous memory is also used, but out on swap. +"VmFlags" field deserves a separate description. This member represents the kernel +flags associated with the particular virtual memory area in two letter encoded +manner. The codes are the following: + rd - readable + wr - writeable + ex - executable + sh - shared + mr - may read + mw - may write + me - may execute + ms - may share + gd - stack segment growns down + pf - pure PFN range + dw - disabled write to the mapped file + lo - pages are locked in memory + io - memory mapped I/O area + sr - sequential read advise provided + rr - random read advise provided + dc - do not copy area on fork + de - do not expand area on remapping + ac - area is accountable + nr - swap space is not reserved for the area + ht - area uses huge tlb pages + nl - non-linear mapping + ar - architecture specific flag + dd - do not include area into core dump + sd - soft-dirty flag + mm - mixed map area + hg - huge page advise flag + nh - no-huge page advise flag + mg - mergable advise flag + +Note that there is no guarantee that every flag and associated mnemonic will +be present in all further kernel releases. Things get changed, the flags may +be vanished or the reverse -- new added. + This file is only present if the CONFIG_MMU kernel configuration option is enabled. The /proc/PID/clear_refs is used to reset the PG_Referenced and ACCESSED/YOUNG -bits on both physical and virtual pages associated with a process. +bits on both physical and virtual pages associated with a process, and the +soft-dirty bit on pte (see Documentation/vm/soft-dirty.txt for details). To clear the bits for all the pages associated with the process > echo 1 > /proc/PID/clear_refs @@ -403,6 +484,10 @@ To clear the bits for the anonymous pages associated with the process To clear the bits for the file mapped pages associated with the process > echo 3 > /proc/PID/clear_refs + +To clear the soft-dirty bit + > echo 4 > /proc/PID/clear_refs + Any other value written to /proc/PID/clear_refs will have no effect. The /proc/pid/pagemap gives the PFN, which can be used to find the pageflags @@ -462,7 +547,7 @@ Table 1-5: Kernel info in /proc sys See chapter 2 sysvipc Info of SysVIPC Resources (msg, sem, shm) (2.4) tty Info of tty drivers - uptime System uptime + uptime Wall clock since boot, combined idle time of all cpus version Kernel version video bttv info of video resources (2.4) vmallocinfo Show vmalloced areas @@ -542,7 +627,7 @@ just those considered 'most important'. The new vectors are: their statistics are used by kernel developers and interested users to determine the occurrence of interrupts of the given type. -The above IRQ vectors are displayed only when relevent. For example, +The above IRQ vectors are displayed only when relevant. For example, the threshold vector does not exist on x86_64 platforms. Others are suppressed when the system is a uniprocessor. As of this writing, only i386 and x86_64 platforms support the new IRQ vector displays. @@ -573,6 +658,12 @@ The contents of each smp_affinity file is the same by default: > cat /proc/irq/0/smp_affinity ffffffff +There is an alternate interface, smp_affinity_list which allows specifying +a cpu range instead of a bitmask: + + > cat /proc/irq/0/smp_affinity_list + 1024-1031 + The default_smp_affinity mask applies to all non-active IRQs, which are the IRQs which have not yet been allocated/activated, and hence which lack a /proc/irq/[0-9]* directory. @@ -582,12 +673,13 @@ reports itself as being attached. This hardware locality information does not include information about any possible driver locality preference. prof_cpu_mask specifies which CPUs are to be profiled by the system wide -profiler. Default value is ffffffff (all cpus). +profiler. Default value is ffffffff (all cpus if there are only 32 of them). The way IRQs are routed is handled by the IO-APIC, and it's Round Robin between all the CPUs which are allowed to handle it. As usual the kernel has more info than you and does a better job than you, so the defaults are the -best choice for almost everyone. +best choice for almost everyone. [Note this applies only to those IO-APIC's +that support "Round Robin" interrupt distribution.] There are three more important subdirectories in /proc: net, scsi, and sys. The general rule is that the contents, or even the existence of these @@ -670,9 +762,12 @@ varies by architecture and compile options. The following is from a > cat /proc/meminfo +The "Locked" indicates whether the mapping is locked in memory or not. + MemTotal: 16344972 kB MemFree: 13634064 kB +MemAvailable: 14836172 kB Buffers: 3656 kB Cached: 1195708 kB SwapCached: 0 kB @@ -700,10 +795,19 @@ Committed_AS: 100056 kB VmallocTotal: 112216 kB VmallocUsed: 428 kB VmallocChunk: 111088 kB +AnonHugePages: 49152 kB MemTotal: Total usable ram (i.e. physical ram minus a few reserved bits and the kernel binary code) MemFree: The sum of LowFree+HighFree +MemAvailable: An estimate of how much memory is available for starting new + applications, without swapping. Calculated from MemFree, + SReclaimable, the size of the file LRU lists, and the low + watermarks in each zone. + The estimate takes into account that the system needs some + page cache to function well, and that not all reclaimable + slab will be reclaimable, due to items being in use. The + impact of those factors will vary from system to system. Buffers: Relatively temporary storage for raw disk blocks shouldn't get tremendously large (20MB or so) Cached: in-memory cache for files read from the disk (the @@ -733,6 +837,7 @@ VmallocChunk: 111088 kB Dirty: Memory which is waiting to get written back to the disk Writeback: Memory which is actively being written back to the disk AnonPages: Non-file backed pages mapped into userspace page tables +AnonHugePages: Non-file backed huge pages mapped into userspace page tables Mapped: files which have been mmaped, such as libraries Slab: in-kernel data structures cache SReclaimable: Part of Slab, that might be reclaimed, such as caches @@ -749,7 +854,8 @@ WritebackTmp: Memory used by FUSE for temporary writeback buffers if strict overcommit accounting is enabled (mode 2 in 'vm.overcommit_memory'). The CommitLimit is calculated with the following formula: - CommitLimit = ('vm.overcommit_ratio' * Physical RAM) + Swap + CommitLimit = ([total RAM pages] - [total huge TLB pages]) * + overcommit_ratio / 100 + [total swap pages] For example, on a system with 1G of physical RAM and 7G of swap with a `vm.overcommit_ratio` of 30 it would yield a CommitLimit of 7.3G. @@ -759,16 +865,15 @@ Committed_AS: The amount of memory presently allocated on the system. The committed memory is a sum of all of the memory which has been allocated by processes, even if it has not been "used" by them as of yet. A process which malloc()'s 1G - of memory, but only touches 300M of it will only show up - as using 300M of memory even if it has the address space - allocated for the entire 1G. This 1G is memory which has - been "committed" to by the VM and can be used at any time - by the allocating application. With strict overcommit - enabled on the system (mode 2 in 'vm.overcommit_memory'), - allocations which would exceed the CommitLimit (detailed - above) will not be permitted. This is useful if one needs - to guarantee that processes will not fail due to lack of - memory once that memory has been successfully allocated. + of memory, but only touches 300M of it will show up as + using 1G. This 1G is memory which has been "committed" to + by the VM and can be used at any time by the allocating + application. With strict overcommit enabled on the system + (mode 2 in 'vm.overcommit_memory'),allocations which would + exceed the CommitLimit (detailed above) will not be permitted. + This is useful if one needs to guarantee that processes will + not fail due to lack of memory once that memory has been + successfully allocated. VmallocTotal: total size of vmalloc memory area VmallocUsed: amount of vmalloc area which is used VmallocChunk: largest contiguous block of vmalloc area which is free @@ -953,7 +1058,6 @@ Table 1-9: Network info in /proc/net snmp SNMP data sockstat Socket statistics tcp TCP sockets - tr_rif Token ring RIF routing table udp UDP sockets unix UNIX domain sockets wireless Wireless interface data (Wavelan etc) @@ -1142,8 +1246,9 @@ second). The meanings of the columns are as follows, from left to right: The "intr" line gives counts of interrupts serviced since boot time, for each of the possible system interrupts. The first column is the total of all -interrupts serviced; each subsequent column is the total for that particular -interrupt. +interrupts serviced including unnumbered architecture specific interrupts; +each subsequent column is the total for that particular numbered interrupt. +Unnumbered interrupts are not shown, only summed into the total. The "ctxt" line gives the total number of context switches across all CPUs. @@ -1181,6 +1286,30 @@ Table 1-12: Files in /proc/fs/ext4/<devname> mb_groups details of multiblock allocator buddy cache of free blocks .............................................................................. +2.0 /proc/consoles +------------------ +Shows registered system console lines. + +To see which character device lines are currently used for the system console +/dev/console, you may simply look into the file /proc/consoles: + + > cat /proc/consoles + tty0 -WU (ECp) 4:7 + ttyS0 -W- (Ep) 4:64 + +The columns are: + + device name of the device + operations R = can do read operations + W = can do write operations + U = can do unblank + flags E = it is enabled + C = it is preferred console + B = it is primary boot console + p = it is used for printk buffer + b = it is not a TTY but a Braille device + a = it is safe to use when cpu is offline + major:minor major and minor number of the device separated by a colon ------------------------------------------------------------------------------ Summary @@ -1229,7 +1358,7 @@ review the kernel documentation in the directory /usr/src/linux/Documentation. This chapter is heavily based on the documentation included in the pre 2.2 kernels, and became part of it in version 2.2.1 of the Linux kernel. -Please see: Documentation/sysctls/ directory for descriptions of these +Please see: Documentation/sysctl/ directory for descriptions of these entries. ------------------------------------------------------------------------------ @@ -1259,8 +1388,8 @@ may allocate from based on an estimation of its current memory and swap use. For example, if a task is using all allowed memory, its badness score will be 1000. If it is using half of its allowed memory, its score will be 500. -There is an additional factor included in the badness score: root -processes are given 3% extra memory over other tasks. +There is an additional factor included in the badness score: the current memory +and swap usage is discounted by 3% for root processes. The amount of "allowed" memory depends on the context in which the oom killer was called. If it is due to the memory assigned to the allocating task's cpuset @@ -1293,14 +1422,12 @@ be used to tune the badness score. Its acceptable values range from -16 (OOM_DISABLE) to disable oom killing entirely for that task. Its value is scaled linearly with /proc/<pid>/oom_score_adj. -Writing to /proc/<pid>/oom_score_adj or /proc/<pid>/oom_adj will change the -other with its scaled value. - -NOTICE: /proc/<pid>/oom_adj is deprecated and will be removed, please see -Documentation/feature-removal-schedule.txt. +The value of /proc/<pid>/oom_score_adj may be reduced no lower than the last +value set by a CAP_SYS_RESOURCE process. To reduce the value any lower +requires CAP_SYS_RESOURCE. Caveat: when a parent task is selected, the oom killer will sacrifice any first -generation children with seperate address spaces instead, if possible. This +generation children with separate address spaces instead, if possible. This avoids servers and important system daemons from being killed and loses the minimal amount of work. @@ -1309,7 +1436,7 @@ minimal amount of work. ------------------------------------------------------------- This file can be used to check the current score used by the oom-killer is for -any given <pid>. Use it together with /proc/<pid>/oom_adj to tune which +any given <pid>. Use it together with /proc/<pid>/oom_score_adj to tune which process should be killed in an out-of-memory situation. @@ -1504,3 +1631,151 @@ a task to set its own or one of its thread siblings comm value. The comm value is limited in size compared to the cmdline value, so writing anything longer then the kernel's TASK_COMM_LEN (currently 16 chars) will result in a truncated comm value. + + +3.7 /proc/<pid>/task/<tid>/children - Information about task children +------------------------------------------------------------------------- +This file provides a fast way to retrieve first level children pids +of a task pointed by <pid>/<tid> pair. The format is a space separated +stream of pids. + +Note the "first level" here -- if a child has own children they will +not be listed here, one needs to read /proc/<children-pid>/task/<tid>/children +to obtain the descendants. + +Since this interface is intended to be fast and cheap it doesn't +guarantee to provide precise results and some children might be +skipped, especially if they've exited right after we printed their +pids, so one need to either stop or freeze processes being inspected +if precise results are needed. + + +3.8 /proc/<pid>/fdinfo/<fd> - Information about opened file +--------------------------------------------------------------- +This file provides information associated with an opened file. The regular +files have at least three fields -- 'pos', 'flags' and mnt_id. The 'pos' +represents the current offset of the opened file in decimal form [see lseek(2) +for details], 'flags' denotes the octal O_xxx mask the file has been +created with [see open(2) for details] and 'mnt_id' represents mount ID of +the file system containing the opened file [see 3.5 /proc/<pid>/mountinfo +for details]. + +A typical output is + + pos: 0 + flags: 0100002 + mnt_id: 19 + +The files such as eventfd, fsnotify, signalfd, epoll among the regular pos/flags +pair provide additional information particular to the objects they represent. + + Eventfd files + ~~~~~~~~~~~~~ + pos: 0 + flags: 04002 + mnt_id: 9 + eventfd-count: 5a + + where 'eventfd-count' is hex value of a counter. + + Signalfd files + ~~~~~~~~~~~~~~ + pos: 0 + flags: 04002 + mnt_id: 9 + sigmask: 0000000000000200 + + where 'sigmask' is hex value of the signal mask associated + with a file. + + Epoll files + ~~~~~~~~~~~ + pos: 0 + flags: 02 + mnt_id: 9 + tfd: 5 events: 1d data: ffffffffffffffff + + where 'tfd' is a target file descriptor number in decimal form, + 'events' is events mask being watched and the 'data' is data + associated with a target [see epoll(7) for more details]. + + Fsnotify files + ~~~~~~~~~~~~~~ + For inotify files the format is the following + + pos: 0 + flags: 02000000 + inotify wd:3 ino:9e7e sdev:800013 mask:800afce ignored_mask:0 fhandle-bytes:8 fhandle-type:1 f_handle:7e9e0000640d1b6d + + where 'wd' is a watch descriptor in decimal form, ie a target file + descriptor number, 'ino' and 'sdev' are inode and device where the + target file resides and the 'mask' is the mask of events, all in hex + form [see inotify(7) for more details]. + + If the kernel was built with exportfs support, the path to the target + file is encoded as a file handle. The file handle is provided by three + fields 'fhandle-bytes', 'fhandle-type' and 'f_handle', all in hex + format. + + If the kernel is built without exportfs support the file handle won't be + printed out. + + If there is no inotify mark attached yet the 'inotify' line will be omitted. + + For fanotify files the format is + + pos: 0 + flags: 02 + mnt_id: 9 + fanotify flags:10 event-flags:0 + fanotify mnt_id:12 mflags:40 mask:38 ignored_mask:40000003 + fanotify ino:4f969 sdev:800013 mflags:0 mask:3b ignored_mask:40000000 fhandle-bytes:8 fhandle-type:1 f_handle:69f90400c275b5b4 + + where fanotify 'flags' and 'event-flags' are values used in fanotify_init + call, 'mnt_id' is the mount point identifier, 'mflags' is the value of + flags associated with mark which are tracked separately from events + mask. 'ino', 'sdev' are target inode and device, 'mask' is the events + mask and 'ignored_mask' is the mask of events which are to be ignored. + All in hex format. Incorporation of 'mflags', 'mask' and 'ignored_mask' + does provide information about flags and mask used in fanotify_mark + call [see fsnotify manpage for details]. + + While the first three lines are mandatory and always printed, the rest is + optional and may be omitted if no marks created yet. + + +------------------------------------------------------------------------------ +Configuring procfs +------------------------------------------------------------------------------ + +4.1 Mount options +--------------------- + +The following mount options are supported: + + hidepid= Set /proc/<pid>/ access mode. + gid= Set the group authorized to learn processes information. + +hidepid=0 means classic mode - everybody may access all /proc/<pid>/ directories +(default). + +hidepid=1 means users may not access any /proc/<pid>/ directories but their +own. Sensitive files like cmdline, sched*, status are now protected against +other users. This makes it impossible to learn whether any user runs +specific program (given the program doesn't reveal itself by its behaviour). +As an additional bonus, as /proc/<pid>/cmdline is unaccessible for other users, +poorly written programs passing sensitive information via program arguments are +now protected against local eavesdroppers. + +hidepid=2 means hidepid=1 plus all /proc/<pid>/ will be fully invisible to other +users. It doesn't mean that it hides a fact whether a process with a specific +pid value exists (it can be learned by other means, e.g. by "kill -0 $PID"), +but it hides process' uid and gid, which may be learned by stat()'ing +/proc/<pid>/ otherwise. It greatly complicates an intruder's task of gathering +information about running processes, whether some daemon runs with elevated +privileges, whether other user runs some sensitive program, whether other users +run any program at all, etc. + +gid= defines a group authorized to learn processes information otherwise +prohibited by hidepid=. If you use some daemon like identd which needs to learn +information about processes information, just add identd to this group. |
