diff options
Diffstat (limited to 'Documentation/sysctl')
| -rw-r--r-- | Documentation/sysctl/00-INDEX | 4 | ||||
| -rw-r--r-- | Documentation/sysctl/ctl_unnumbered.txt | 22 | ||||
| -rw-r--r-- | Documentation/sysctl/fs.txt | 156 | ||||
| -rw-r--r-- | Documentation/sysctl/kernel.txt | 585 | ||||
| -rw-r--r-- | Documentation/sysctl/net.txt | 254 | ||||
| -rw-r--r-- | Documentation/sysctl/vm.txt | 805 |
6 files changed, 1557 insertions, 269 deletions
diff --git a/Documentation/sysctl/00-INDEX b/Documentation/sysctl/00-INDEX index a20a9066dc4..8cf5d493fd0 100644 --- a/Documentation/sysctl/00-INDEX +++ b/Documentation/sysctl/00-INDEX @@ -4,12 +4,12 @@ README - general information about /proc/sys/ sysctl files. abi.txt - documentation for /proc/sys/abi/*. -ctl_unnumbered.txt - - explanation of why one should not add new binary sysctl numbers. fs.txt - documentation for /proc/sys/fs/*. kernel.txt - documentation for /proc/sys/kernel/*. +net.txt + - documentation for /proc/sys/net/*. sunrpc.txt - documentation for /proc/sys/sunrpc/*. vm.txt diff --git a/Documentation/sysctl/ctl_unnumbered.txt b/Documentation/sysctl/ctl_unnumbered.txt deleted file mode 100644 index 23003a8ea3e..00000000000 --- a/Documentation/sysctl/ctl_unnumbered.txt +++ /dev/null @@ -1,22 +0,0 @@ - -Except for a few extremely rare exceptions user space applications do not use -the binary sysctl interface. Instead everyone uses /proc/sys/... with -readable ascii names. - -Recently the kernel has started supporting setting the binary sysctl value to -CTL_UNNUMBERED so we no longer need to assign a binary sysctl path to allow -sysctls to show up in /proc/sys. - -Assigning binary sysctl numbers is an endless source of conflicts in sysctl.h, -breaking of the user space ABI (because of those conflicts), and maintenance -problems. A complete pass through all of the sysctl users revealed multiple -instances where the sysctl binary interface was broken and had gone undetected -for years. - -So please do not add new binary sysctl numbers. They are unneeded and -problematic. - -If you really need a new binary sysctl number please first merge your sysctl -into the kernel and then as a separate patch allocate a binary sysctl number. - -(ebiederm@xmission.com, June 2007) diff --git a/Documentation/sysctl/fs.txt b/Documentation/sysctl/fs.txt index f99254327ae..88152f214f4 100644 --- a/Documentation/sysctl/fs.txt +++ b/Documentation/sysctl/fs.txt @@ -1,5 +1,6 @@ Documentation for /proc/sys/fs/* kernel version 2.2.10 (c) 1998, 1999, Rik van Riel <riel@nl.linux.org> + (c) 2009, Shen Feng<shen@cn.fujitsu.com> For general info and legal blurb, please look in README. @@ -14,7 +15,12 @@ kernel. Since some of the files _can_ be used to screw up your system, it is advisable to read both documentation and source before actually making adjustments. +1. /proc/sys/fs +---------------------------------------------------------- + Currently, these files are in /proc/sys/fs: +- aio-max-nr +- aio-nr - dentry-state - dquot-max - dquot-nr @@ -26,12 +32,21 @@ Currently, these files are in /proc/sys/fs: - nr_open - overflowuid - overflowgid +- protected_hardlinks +- protected_symlinks - suid_dumpable - super-max - super-nr -Documentation for the files in /proc/sys/fs/binfmt_misc is -in Documentation/binfmt_misc.txt. +============================================================== + +aio-nr & aio-max-nr: + +aio-nr is the running total of the number of events specified on the +io_setup system call for all currently active aio contexts. If aio-nr +reaches aio-max-nr then io_setup will fail with EAGAIN. Note that +raising aio-max-nr does not result in the pre-allocation or re-sizing +of any kernel data structures. ============================================================== @@ -75,21 +90,23 @@ you might want to raise the limit. file-max & file-nr: -The kernel allocates file handles dynamically, but as yet it -doesn't free them again. - The value in file-max denotes the maximum number of file- handles that the Linux kernel will allocate. When you get lots of error messages about running out of file handles, you might want to increase this limit. -The three values in file-nr denote the number of allocated -file handles, the number of unused file handles and the maximum -number of file handles. When the allocated file handles come -close to the maximum, but the number of unused file handles is -significantly greater than 0, you've encountered a peak in your -usage of file handles and you don't need to increase the maximum. - +Historically,the kernel was able to allocate file handles +dynamically, but not to free them again. The three values in +file-nr denote the number of allocated file handles, the number +of allocated but unused file handles, and the maximum number of +file handles. Linux 2.6 always reports 0 as the number of free +file handles -- this is not an error, it just means that the +number of allocated file handles exactly matches the number of +used file handles. + +Attempts to allocate more file descriptors than file-max are +reported with printk, look for "VFS: file-max limit <number> +reached". ============================================================== nr_open: @@ -142,22 +159,68 @@ The default is 65534. ============================================================== +protected_hardlinks: + +A long-standing class of security issues is the hardlink-based +time-of-check-time-of-use race, most commonly seen in world-writable +directories like /tmp. The common method of exploitation of this flaw +is to cross privilege boundaries when following a given hardlink (i.e. a +root process follows a hardlink created by another user). Additionally, +on systems without separated partitions, this stops unauthorized users +from "pinning" vulnerable setuid/setgid files against being upgraded by +the administrator, or linking to special files. + +When set to "0", hardlink creation behavior is unrestricted. + +When set to "1" hardlinks cannot be created by users if they do not +already own the source file, or do not have read/write access to it. + +This protection is based on the restrictions in Openwall and grsecurity. + +============================================================== + +protected_symlinks: + +A long-standing class of security issues is the symlink-based +time-of-check-time-of-use race, most commonly seen in world-writable +directories like /tmp. The common method of exploitation of this flaw +is to cross privilege boundaries when following a given symlink (i.e. a +root process follows a symlink belonging to another user). For a likely +incomplete list of hundreds of examples across the years, please see: +http://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=/tmp + +When set to "0", symlink following behavior is unrestricted. + +When set to "1" symlinks are permitted to be followed only when outside +a sticky world-writable directory, or when the uid of the symlink and +follower match, or when the directory owner matches the symlink's owner. + +This protection is based on the restrictions in Openwall and grsecurity. + +============================================================== + suid_dumpable: This value can be used to query and set the core dump mode for setuid or otherwise protected/tainted binaries. The modes are 0 - (default) - traditional behaviour. Any process which has changed - privilege levels or is execute only will not be dumped + privilege levels or is execute only will not be dumped. 1 - (debug) - all processes dump core when possible. The core dump is owned by the current user and no security is applied. This is intended for system debugging situations only. Ptrace is unchecked. + This is insecure as it allows regular users to examine the memory + contents of privileged processes. 2 - (suidsafe) - any binary which normally would not be dumped is dumped - readable by root only. This allows the end user to remove - such a dump but not access it directly. For security reasons - core dumps in this mode will not overwrite one another or - other files. This mode is appropriate when administrators are - attempting to debug problems in a normal environment. + anyway, but only if the "core_pattern" kernel sysctl is set to + either a pipe handler or a fully qualified path. (For more details + on this limitation, see CVE-2006-2451.) This mode is appropriate + when administrators are attempting to debug problems in a normal + environment, and either have a core dump pipe handler that knows + to treat privileged core dumps with care, or specific directory + defined for catching core dumps. If a core dump happens without + a pipe handler or fully qualifid path, a message will be emitted + to syslog warning about the lack of a correct setting. ============================================================== @@ -178,3 +241,60 @@ requests. aio-max-nr allows you to change the maximum value aio-nr can grow to. ============================================================== + + +2. /proc/sys/fs/binfmt_misc +---------------------------------------------------------- + +Documentation for the files in /proc/sys/fs/binfmt_misc is +in Documentation/binfmt_misc.txt. + + +3. /proc/sys/fs/mqueue - POSIX message queues filesystem +---------------------------------------------------------- + +The "mqueue" filesystem provides the necessary kernel features to enable the +creation of a user space library that implements the POSIX message queues +API (as noted by the MSG tag in the POSIX 1003.1-2001 version of the System +Interfaces specification.) + +The "mqueue" filesystem contains values for determining/setting the amount of +resources used by the file system. + +/proc/sys/fs/mqueue/queues_max is a read/write file for setting/getting the +maximum number of message queues allowed on the system. + +/proc/sys/fs/mqueue/msg_max is a read/write file for setting/getting the +maximum number of messages in a queue value. In fact it is the limiting value +for another (user) limit which is set in mq_open invocation. This attribute of +a queue must be less or equal then msg_max. + +/proc/sys/fs/mqueue/msgsize_max is a read/write file for setting/getting the +maximum message size value (it is every message queue's attribute set during +its creation). + +/proc/sys/fs/mqueue/msg_default is a read/write file for setting/getting the +default number of messages in a queue value if attr parameter of mq_open(2) is +NULL. If it exceed msg_max, the default value is initialized msg_max. + +/proc/sys/fs/mqueue/msgsize_default is a read/write file for setting/getting +the default message size value if attr parameter of mq_open(2) is NULL. If it +exceed msgsize_max, the default value is initialized msgsize_max. + +4. /proc/sys/fs/epoll - Configuration options for the epoll interface +-------------------------------------------------------- + +This directory contains configuration options for the epoll(7) interface. + +max_user_watches +---------------- + +Every epoll file descriptor can store a number of files to be monitored +for event readiness. Each one of these monitored files constitutes a "watch". +This configuration option sets the maximum number of "watches" that are +allowed for each user. +Each "watch" costs roughly 90 bytes on a 32bit kernel, and roughly 160 bytes +on a 64bit one. +The current default value for max_user_watches is the 1/32 of the available +low memory, divided for the "watch" cost in bytes. + diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt index 276a7e63782..c14374e7177 100644 --- a/Documentation/sysctl/kernel.txt +++ b/Documentation/sysctl/kernel.txt @@ -1,5 +1,6 @@ Documentation for /proc/sys/kernel/* kernel version 2.2.10 (c) 1998, 1999, Rik van Riel <riel@nl.linux.org> + (c) 2009, Shen Feng<shen@cn.fujitsu.com> For general info and legal blurb, please look in README. @@ -16,58 +17,76 @@ before actually making adjustments. Currently, these files might (depending on your configuration) show up in /proc/sys/kernel: -- acpi_video_flags + - acct +- acpi_video_flags +- auto_msgmni +- bootloader_type [ X86 only ] +- bootloader_version [ X86 only ] +- callhome [ S390 only ] +- cap_last_cap - core_pattern +- core_pipe_limit - core_uses_pid - ctrl-alt-del -- dentry-state +- dmesg_restrict - domainname - hostname - hotplug -- java-appletviewer [ binfmt_java, obsolete ] -- java-interpreter [ binfmt_java, obsolete ] +- hung_task_panic +- hung_task_check_count +- hung_task_timeout_secs +- hung_task_warnings +- kexec_load_disabled +- kptr_restrict - kstack_depth_to_print [ X86 only ] - l2cr [ PPC only ] - modprobe ==> Documentation/debugging-modules.txt +- modules_disabled +- msg_next_id [ sysv ipc ] - msgmax - msgmnb - msgmni +- nmi_watchdog - osrelease - ostype - overflowgid - overflowuid - panic +- panic_on_oops +- panic_on_unrecovered_nmi +- panic_on_stackoverflow - pid_max - powersave-nap [ PPC only ] - printk +- printk_delay +- printk_ratelimit +- printk_ratelimit_burst - randomize_va_space - real-root-dev ==> Documentation/initrd.txt - reboot-cmd [ SPARC only ] - rtsig-max - rtsig-nr - sem +- sem_next_id [ sysv ipc ] - sg-big-buff [ generic SCSI device (sg) ] +- shm_next_id [ sysv ipc ] +- shm_rmid_forced - shmall - shmmax [ sysv ipc ] - shmmni +- softlockup_all_cpu_backtrace - stop-a [ SPARC only ] - sysrq ==> Documentation/sysrq.txt +- sysctl_writes_strict - tainted - threads-max +- unknown_nmi_panic +- watchdog_thresh - version ============================================================== -acpi_video_flags: - -flags - -See Doc*/kernel/power/video.txt, it allows mode of video boot to be -set during run time. - -============================================================== - acct: highwater lowwater frequency @@ -85,6 +104,76 @@ valid for 30 seconds. ============================================================== +acpi_video_flags: + +flags + +See Doc*/kernel/power/video.txt, it allows mode of video boot to be +set during run time. + +============================================================== + +auto_msgmni: + +Enables/Disables automatic recomputing of msgmni upon memory add/remove +or upon ipc namespace creation/removal (see the msgmni description +above). Echoing "1" into this file enables msgmni automatic recomputing. +Echoing "0" turns it off. auto_msgmni default value is 1. + + +============================================================== + +bootloader_type: + +x86 bootloader identification + +This gives the bootloader type number as indicated by the bootloader, +shifted left by 4, and OR'd with the low four bits of the bootloader +version. The reason for this encoding is that this used to match the +type_of_loader field in the kernel header; the encoding is kept for +backwards compatibility. That is, if the full bootloader type number +is 0x15 and the full version number is 0x234, this file will contain +the value 340 = 0x154. + +See the type_of_loader and ext_loader_type fields in +Documentation/x86/boot.txt for additional information. + +============================================================== + +bootloader_version: + +x86 bootloader version + +The complete bootloader version number. In the example above, this +file will contain the value 564 = 0x234. + +See the type_of_loader and ext_loader_ver fields in +Documentation/x86/boot.txt for additional information. + +============================================================== + +callhome: + +Controls the kernel's callhome behavior in case of a kernel panic. + +The s390 hardware allows an operating system to send a notification +to a service organization (callhome) in case of an operating system panic. + +When the value in this file is 0 (which is the default behavior) +nothing happens in case of a kernel panic. If this value is set to "1" +the complete kernel oops message is send to the IBM customer service +organization in case the mainframe the Linux operating system is running +on has a service contract with IBM. + +============================================================== + +cap_last_cap + +Highest valid capability of the running kernel. Exports +CAP_LAST_CAP from the kernel. + +============================================================== + core_pattern: core_pattern is used to specify a core dumpfile pattern name. @@ -100,12 +189,16 @@ core_pattern is used to specify a core dumpfile pattern name. %<NUL> '%' is dropped %% output one '%' %p pid + %P global pid (init PID namespace) %u uid %g gid + %d dump mode, matches PR_SET_DUMPABLE and + /proc/sys/fs/suid_dumpable %s signal number %t UNIX time of dump %h hostname - %e executable filename + %e executable filename (may be shortened) + %E executable path %<OTHER> both are dropped . If the first character of the pattern is a '|', the kernel will treat the rest of the pattern as a command to run. The core dump will be @@ -113,6 +206,29 @@ core_pattern is used to specify a core dumpfile pattern name. ============================================================== +core_pipe_limit: + +This sysctl is only applicable when core_pattern is configured to pipe +core files to a user space helper (when the first character of +core_pattern is a '|', see above). When collecting cores via a pipe +to an application, it is occasionally useful for the collecting +application to gather data about the crashing process from its +/proc/pid directory. In order to do this safely, the kernel must wait +for the collecting process to exit, so as not to remove the crashing +processes proc files prematurely. This in turn creates the +possibility that a misbehaving userspace collecting process can block +the reaping of a crashed process simply by never exiting. This sysctl +defends against that. It defines how many concurrent crashing +processes may be piped to user space applications in parallel. If +this value is exceeded, then those crashing processes above that value +are noted via the kernel log and their cores are skipped. 0 is a +special value, indicating that unlimited processes may be captured in +parallel, but that no waiting will take place (i.e. the collecting +process is not guaranteed access to /proc/<crashing pid>/). This +value defaults to 0. + +============================================================== + core_uses_pid: The default coredump filename is "core". By setting @@ -138,6 +254,19 @@ to decide what to do with it. ============================================================== +dmesg_restrict: + +This toggle indicates whether unprivileged users are prevented +from using dmesg(8) to view messages from the kernel's log buffer. +When dmesg_restrict is set to (0) there are no restrictions. When +dmesg_restrict is set set to (1), users must have CAP_SYSLOG to use +dmesg(8). + +The kernel config option CONFIG_SECURITY_DMESG_RESTRICT sets the +default value of dmesg_restrict. + +============================================================== + domainname & hostname: These files can be used to set the NIS/YP domainname and the @@ -165,10 +294,79 @@ Default value is "/sbin/hotplug". ============================================================== -l2cr: (PPC only) +hung_task_panic: -This flag controls the L2 cache of G3 processor boards. If -0, the cache is disabled. Enabled if nonzero. +Controls the kernel's behavior when a hung task is detected. +This file shows up if CONFIG_DETECT_HUNG_TASK is enabled. + +0: continue operation. This is the default behavior. + +1: panic immediately. + +============================================================== + +hung_task_check_count: + +The upper bound on the number of tasks that are checked. +This file shows up if CONFIG_DETECT_HUNG_TASK is enabled. + +============================================================== + +hung_task_timeout_secs: + +Check interval. When a task in D state did not get scheduled +for more than this value report a warning. +This file shows up if CONFIG_DETECT_HUNG_TASK is enabled. + +0: means infinite timeout - no checking done. +Possible values to set are in range {0..LONG_MAX/HZ}. + +============================================================== + +hung_task_warnings: + +The maximum number of warnings to report. During a check interval +if a hung task is detected, this value is decreased by 1. +When this value reaches 0, no more warnings will be reported. +This file shows up if CONFIG_DETECT_HUNG_TASK is enabled. + +-1: report an infinite number of warnings. + +============================================================== + +kexec_load_disabled: + +A toggle indicating if the kexec_load syscall has been disabled. This +value defaults to 0 (false: kexec_load enabled), but can be set to 1 +(true: kexec_load disabled). Once true, kexec can no longer be used, and +the toggle cannot be set back to false. This allows a kexec image to be +loaded before disabling the syscall, allowing a system to set up (and +later use) an image without it being altered. Generally used together +with the "modules_disabled" sysctl. + +============================================================== + +kptr_restrict: + +This toggle indicates whether restrictions are placed on +exposing kernel addresses via /proc and other interfaces. + +When kptr_restrict is set to (0), the default, there are no restrictions. + +When kptr_restrict is set to (1), kernel pointers printed using the %pK +format specifier will be replaced with 0's unless the user has CAP_SYSLOG +and effective user and group ids are equal to the real ids. This is +because %pK checks are done at read() time rather than open() time, so +if permissions are elevated between the open() and the read() (e.g via +a setuid binary) then %pK will not leak kernel pointers to unprivileged +users. Note, this is a temporary solution only. The correct long-term +solution is to do the permission checks at open() time. Consider removing +world read permissions from files that use %pK, and using dmesg_restrict +to protect against uses of %pK in dmesg(8) if leaking kernel pointer +values to unprivileged users is a concern. + +When kptr_restrict is set to (2), kernel pointers printed using +%pK will be replaced with 0's regardless of privileges. ============================================================== @@ -179,6 +377,117 @@ kernel stack. ============================================================== +l2cr: (PPC only) + +This flag controls the L2 cache of G3 processor boards. If +0, the cache is disabled. Enabled if nonzero. + +============================================================== + +modules_disabled: + +A toggle value indicating if modules are allowed to be loaded +in an otherwise modular kernel. This toggle defaults to off +(0), but can be set true (1). Once true, modules can be +neither loaded nor unloaded, and the toggle cannot be set back +to false. Generally used with the "kexec_load_disabled" toggle. + +============================================================== + +msg_next_id, sem_next_id, and shm_next_id: + +These three toggles allows to specify desired id for next allocated IPC +object: message, semaphore or shared memory respectively. + +By default they are equal to -1, which means generic allocation logic. +Possible values to set are in range {0..INT_MAX}. + +Notes: +1) kernel doesn't guarantee, that new object will have desired id. So, +it's up to userspace, how to handle an object with "wrong" id. +2) Toggle with non-default value will be set back to -1 by kernel after +successful IPC object allocation. + +============================================================== + +nmi_watchdog: + +Enables/Disables the NMI watchdog on x86 systems. When the value is +non-zero the NMI watchdog is enabled and will continuously test all +online cpus to determine whether or not they are still functioning +properly. Currently, passing "nmi_watchdog=" parameter at boot time is +required for this function to work. + +If LAPIC NMI watchdog method is in use (nmi_watchdog=2 kernel +parameter), the NMI watchdog shares registers with oprofile. By +disabling the NMI watchdog, oprofile may have more registers to +utilize. + +============================================================== + +numa_balancing + +Enables/disables automatic page fault based NUMA memory +balancing. Memory is moved automatically to nodes +that access it often. + +Enables/disables automatic NUMA memory balancing. On NUMA machines, there +is a performance penalty if remote memory is accessed by a CPU. When this +feature is enabled the kernel samples what task thread is accessing memory +by periodically unmapping pages and later trapping a page fault. At the +time of the page fault, it is determined if the data being accessed should +be migrated to a local memory node. + +The unmapping of pages and trapping faults incur additional overhead that +ideally is offset by improved memory locality but there is no universal +guarantee. If the target workload is already bound to NUMA nodes then this +feature should be disabled. Otherwise, if the system overhead from the +feature is too high then the rate the kernel samples for NUMA hinting +faults may be controlled by the numa_balancing_scan_period_min_ms, +numa_balancing_scan_delay_ms, numa_balancing_scan_period_max_ms, +numa_balancing_scan_size_mb, and numa_balancing_settle_count sysctls. + +============================================================== + +numa_balancing_scan_period_min_ms, numa_balancing_scan_delay_ms, +numa_balancing_scan_period_max_ms, numa_balancing_scan_size_mb + +Automatic NUMA balancing scans tasks address space and unmaps pages to +detect if pages are properly placed or if the data should be migrated to a +memory node local to where the task is running. Every "scan delay" the task +scans the next "scan size" number of pages in its address space. When the +end of the address space is reached the scanner restarts from the beginning. + +In combination, the "scan delay" and "scan size" determine the scan rate. +When "scan delay" decreases, the scan rate increases. The scan delay and +hence the scan rate of every task is adaptive and depends on historical +behaviour. If pages are properly placed then the scan delay increases, +otherwise the scan delay decreases. The "scan size" is not adaptive but +the higher the "scan size", the higher the scan rate. + +Higher scan rates incur higher system overhead as page faults must be +trapped and potentially data must be migrated. However, the higher the scan +rate, the more quickly a tasks memory is migrated to a local node if the +workload pattern changes and minimises performance impact due to remote +memory accesses. These sysctls control the thresholds for scan delays and +the number of pages scanned. + +numa_balancing_scan_period_min_ms is the minimum time in milliseconds to +scan a tasks virtual memory. It effectively controls the maximum scanning +rate for each task. + +numa_balancing_scan_delay_ms is the starting "scan delay" used for a task +when it initially forks. + +numa_balancing_scan_period_max_ms is the maximum time in milliseconds to +scan a tasks virtual memory. It effectively controls the minimum scanning +rate for each task. + +numa_balancing_scan_size_mb is how many megabytes worth of pages are +scanned for a given scan. + +============================================================== + osrelease, ostype & version: # cat osrelease @@ -198,10 +507,10 @@ The only way to tune these values is to rebuild the kernel :-) overflowgid & overflowuid: -if your architecture did not always support 32-bit UIDs (i.e. arm, i386, -m68k, sh, and sparc32), a fixed UID and GID will be returned to -applications that use the old 16-bit UID/GID system calls, if the actual -UID or GID would exceed 65535. +if your architecture did not always support 32-bit UIDs (i.e. arm, +i386, m68k, sh, and sparc32), a fixed UID and GID will be returned to +applications that use the old 16-bit UID/GID system calls, if the +actual UID or GID would exceed 65535. These sysctls allow you to change the value of the fixed UID and GID. The default is 65534. @@ -210,9 +519,22 @@ The default is 65534. panic: -The value in this file represents the number of seconds the -kernel waits before rebooting on a panic. When you use the -software watchdog, the recommended setting is 60. +The value in this file represents the number of seconds the kernel +waits before rebooting on a panic. When you use the software watchdog, +the recommended setting is 60. + +============================================================== + +panic_on_unrecovered_nmi: + +The default Linux behaviour on an NMI of either memory or unknown is +to continue operation. For many environments such as scientific +computing it is preferable that the box is taken out and the error +dealt with than an uncorrected parity/ECC error get propagated. + +A small number of systems do generate NMI's for bizarre random reasons +such as power management so the default is off. That sysctl works like +the existing panic controls already in that directory. ============================================================== @@ -227,6 +549,45 @@ Controls the kernel's behaviour when an oops or BUG is encountered. ============================================================== +panic_on_stackoverflow: + +Controls the kernel's behavior when detecting the overflows of +kernel, IRQ and exception stacks except a user stack. +This file shows up if CONFIG_DEBUG_STACKOVERFLOW is enabled. + +0: try to continue operation. + +1: panic immediately. + +============================================================== + +perf_cpu_time_max_percent: + +Hints to the kernel how much CPU time it should be allowed to +use to handle perf sampling events. If the perf subsystem +is informed that its samples are exceeding this limit, it +will drop its sampling frequency to attempt to reduce its CPU +usage. + +Some perf sampling happens in NMIs. If these samples +unexpectedly take too long to execute, the NMIs can become +stacked up next to each other so much that nothing else is +allowed to execute. + +0: disable the mechanism. Do not monitor or correct perf's + sampling rate no matter how CPU time it takes. + +1-100: attempt to throttle perf's sample rate to this + percentage of CPU. Note: the kernel calculates an + "expected" length of each sample event. 100 here means + 100% of that expected length. Even if this is set to + 100, you may still see sample throttling if this + length is exceeded. Set to 0 if you truly do not care + how much CPU is consumed. + +============================================================== + + pid_max: PID allocation wrap value. When the kernel's next PID value @@ -235,6 +596,14 @@ PIDs of value pid_max or larger are not allocated. ============================================================== +ns_last_pid: + +The last pid allocated in the current (the one task using this sysctl +lives in) pid namespace. When selecting a pid for a next task on fork +kernel tries to allocate a number starting from this one. + +============================================================== + powersave-nap: (PPC only) If set, Linux-PPC will use the 'nap' mode of powersaving, @@ -254,7 +623,7 @@ the different loglevels. - console_loglevel: messages with a higher priority than this will be printed to the console -- default_message_level: messages without an explicit priority +- default_message_loglevel: messages without an explicit priority will be printed with this priority - minimum_console_loglevel: minimum (highest) value to which console_loglevel can be set @@ -262,6 +631,14 @@ the different loglevels. ============================================================== +printk_delay: + +Delay each printk message in printk_delay milliseconds + +Value from 0 - 10000 is allowed. + +============================================================== + printk_ratelimit: Some warning messages are rate limited. printk_ratelimit specifies @@ -281,31 +658,35 @@ send before ratelimiting kicks in. ============================================================== -randomize-va-space: +randomize_va_space: This option can be used to select the type of process address space randomization that is used in the system, for architectures that support this feature. -0 - Turn the process address space randomization off by default. +0 - Turn the process address space randomization off. This is the + default for architectures that do not support this feature anyways, + and kernels that are booted with the "norandmaps" parameter. 1 - Make the addresses of mmap base, stack and VDSO page randomized. This, among other things, implies that shared libraries will be - loaded to random addresses. Also for PIE-linked binaries, the location - of code start is randomized. + loaded to random addresses. Also for PIE-linked binaries, the + location of code start is randomized. This is the default if the + CONFIG_COMPAT_BRK option is enabled. - With heap randomization, the situation is a little bit more - complicated. - There a few legacy applications out there (such as some ancient +2 - Additionally enable heap randomization. This is the default if + CONFIG_COMPAT_BRK is disabled. + + There are a few legacy applications out there (such as some ancient versions of libc.so.5 from 1996) that assume that brk area starts - just after the end of the code+bss. These applications break when - start of the brk area is randomized. There are however no known + just after the end of the code+bss. These applications break when + start of the brk area is randomized. There are however no known non-legacy applications that would be broken this way, so for most - systems it is safe to choose full randomization. However there is - a CONFIG_COMPAT_BRK option for systems with ancient and/or broken - binaries, that makes heap non-randomized, but keeps all other - parts of process address space randomized if randomize_va_space - sysctl is turned on. + systems it is safe to choose full randomization. + + Systems with ancient and/or broken binaries should be configured + with CONFIG_COMPAT_BRK enabled, which excludes the heap from process + address space randomization. ============================================================== @@ -340,32 +721,132 @@ are doing anyway :) ============================================================== -shmmax: +shmall: + +This parameter sets the total amount of shared memory pages that +can be used system wide. Hence, SHMALL should always be at least +ceil(shmmax/PAGE_SIZE). + +If you are not sure what the default PAGE_SIZE is on your Linux +system, you can run the following command: + +# getconf PAGE_SIZE + +============================================================== + +shmmax: This value can be used to query and set the run time limit on the maximum shared memory segment size that can be created. -Shared memory segments up to 1Gb are now supported in the +Shared memory segments up to 1Gb are now supported in the kernel. This value defaults to SHMMAX. ============================================================== -softlockup_thresh: +shm_rmid_forced: + +Linux lets you set resource limits, including how much memory one +process can consume, via setrlimit(2). Unfortunately, shared memory +segments are allowed to exist without association with any process, and +thus might not be counted against any resource limits. If enabled, +shared memory segments are automatically destroyed when their attach +count becomes zero after a detach or a process termination. It will +also destroy segments that were created, but never attached to, on exit +from the process. The only use left for IPC_RMID is to immediately +destroy an unattached segment. Of course, this breaks the way things are +defined, so some applications might stop working. Note that this +feature will do you no good unless you also configure your resource +limits (in particular, RLIMIT_AS and RLIMIT_NPROC). Most systems don't +need this. + +Note that if you change this from 0 to 1, already created segments +without users and with a dead originative process will be destroyed. -This value can be used to lower the softlockup tolerance -threshold. The default threshold is 10s. If a cpu is locked up -for 10s, the kernel complains. Valid values are 1-60s. +============================================================== + +sysctl_writes_strict: + +Control how file position affects the behavior of updating sysctl values +via the /proc/sys interface: + + -1 - Legacy per-write sysctl value handling, with no printk warnings. + Each write syscall must fully contain the sysctl value to be + written, and multiple writes on the same sysctl file descriptor + will rewrite the sysctl value, regardless of file position. + 0 - (default) Same behavior as above, but warn about processes that + perform writes to a sysctl file descriptor when the file position + is not 0. + 1 - Respect file position when writing sysctl strings. Multiple writes + will append to the sysctl value buffer. Anything past the max length + of the sysctl value buffer will be ignored. Writes to numeric sysctl + entries must always be at file position 0 and the value must be + fully contained in the buffer sent in the write syscall. + +============================================================== + +softlockup_all_cpu_backtrace: + +This value controls the soft lockup detector thread's behavior +when a soft lockup condition is detected as to whether or not +to gather further debug information. If enabled, each cpu will +be issued an NMI and instructed to capture stack trace. + +This feature is only applicable for architectures which support +NMI. + +0: do nothing. This is the default behavior. + +1: on detection capture more debug information. ============================================================== -tainted: +tainted: Non-zero if the kernel has been tainted. Numeric values, which can be ORed together: - 1 - A module with a non-GPL license has been loaded, this - includes modules with no license. - Set by modutils >= 2.4.9 and module-init-tools. - 2 - A module was force loaded by insmod -f. - Set by modutils >= 2.4.9 and module-init-tools. - 4 - Unsafe SMP processors: SMP with CPUs not designed for SMP. + 1 - A module with a non-GPL license has been loaded, this + includes modules with no license. + Set by modutils >= 2.4.9 and module-init-tools. + 2 - A module was force loaded by insmod -f. + Set by modutils >= 2.4.9 and module-init-tools. + 4 - Unsafe SMP processors: SMP with CPUs not designed for SMP. + 8 - A module was forcibly unloaded from the system by rmmod -f. + 16 - A hardware machine check error occurred on the system. + 32 - A bad page was discovered on the system. + 64 - The user has asked that the system be marked "tainted". This + could be because they are running software that directly modifies + the hardware, or for other reasons. + 128 - The system has died. + 256 - The ACPI DSDT has been overridden with one supplied by the user + instead of using the one provided by the hardware. + 512 - A kernel warning has occurred. +1024 - A module from drivers/staging was loaded. +2048 - The system is working around a severe firmware bug. +4096 - An out-of-tree module has been loaded. +8192 - An unsigned module has been loaded in a kernel supporting module + signature. +============================================================== + +unknown_nmi_panic: + +The value in this file affects behavior of handling NMI. When the +value is non-zero, unknown NMI is trapped and then panic occurs. At +that time, kernel debugging information is displayed on console. + +NMI switch that most IA32 servers have fires unknown NMI up, for +example. If a system hangs up, try pressing the NMI switch. + +============================================================== + +watchdog_thresh: + +This value can be used to control the frequency of hrtimer and NMI +events and the soft and hard lockup thresholds. The default threshold +is 10 seconds. + +The softlockup threshold is (2 * watchdog_thresh). Setting this +tunable to zero will disable lockup detection altogether. + +============================================================== diff --git a/Documentation/sysctl/net.txt b/Documentation/sysctl/net.txt new file mode 100644 index 00000000000..9a0319a8247 --- /dev/null +++ b/Documentation/sysctl/net.txt @@ -0,0 +1,254 @@ +Documentation for /proc/sys/net/* + (c) 1999 Terrehon Bowden <terrehon@pacbell.net> + Bodo Bauer <bb@ricochet.net> + (c) 2000 Jorge Nerin <comandante@zaralinux.com> + (c) 2009 Shen Feng <shen@cn.fujitsu.com> + +For general info and legal blurb, please look in README. + +============================================================== + +This file contains the documentation for the sysctl files in +/proc/sys/net + +The interface to the networking parts of the kernel is located in +/proc/sys/net. The following table shows all possible subdirectories. You may +see only some of them, depending on your kernel's configuration. + + +Table : Subdirectories in /proc/sys/net +.............................................................................. + Directory Content Directory Content + core General parameter appletalk Appletalk protocol + unix Unix domain sockets netrom NET/ROM + 802 E802 protocol ax25 AX25 + ethernet Ethernet protocol rose X.25 PLP layer + ipv4 IP version 4 x25 X.25 protocol + ipx IPX token-ring IBM token ring + bridge Bridging decnet DEC net + ipv6 IP version 6 tipc TIPC +.............................................................................. + +1. /proc/sys/net/core - Network core options +------------------------------------------------------- + +bpf_jit_enable +-------------- + +This enables Berkeley Packet Filter Just in Time compiler. +Currently supported on x86_64 architecture, bpf_jit provides a framework +to speed packet filtering, the one used by tcpdump/libpcap for example. +Values : + 0 - disable the JIT (default value) + 1 - enable the JIT + 2 - enable the JIT and ask the compiler to emit traces on kernel log. + +dev_weight +-------------- + +The maximum number of packets that kernel can handle on a NAPI interrupt, +it's a Per-CPU variable. +Default: 64 + +default_qdisc +-------------- + +The default queuing discipline to use for network devices. This allows +overriding the default queue discipline of pfifo_fast with an +alternative. Since the default queuing discipline is created with the +no additional parameters so is best suited to queuing disciplines that +work well without configuration like stochastic fair queue (sfq), +CoDel (codel) or fair queue CoDel (fq_codel). Don't use queuing disciplines +like Hierarchical Token Bucket or Deficit Round Robin which require setting +up classes and bandwidths. +Default: pfifo_fast + +busy_read +---------------- +Low latency busy poll timeout for socket reads. (needs CONFIG_NET_RX_BUSY_POLL) +Approximate time in us to busy loop waiting for packets on the device queue. +This sets the default value of the SO_BUSY_POLL socket option. +Can be set or overridden per socket by setting socket option SO_BUSY_POLL, +which is the preferred method of enabling. If you need to enable the feature +globally via sysctl, a value of 50 is recommended. +Will increase power usage. +Default: 0 (off) + +busy_poll +---------------- +Low latency busy poll timeout for poll and select. (needs CONFIG_NET_RX_BUSY_POLL) +Approximate time in us to busy loop waiting for events. +Recommended value depends on the number of sockets you poll on. +For several sockets 50, for several hundreds 100. +For more than that you probably want to use epoll. +Note that only sockets with SO_BUSY_POLL set will be busy polled, +so you want to either selectively set SO_BUSY_POLL on those sockets or set +sysctl.net.busy_read globally. +Will increase power usage. +Default: 0 (off) + +rmem_default +------------ + +The default setting of the socket receive buffer in bytes. + +rmem_max +-------- + +The maximum receive socket buffer size in bytes. + +wmem_default +------------ + +The default setting (in bytes) of the socket send buffer. + +wmem_max +-------- + +The maximum send socket buffer size in bytes. + +message_burst and message_cost +------------------------------ + +These parameters are used to limit the warning messages written to the kernel +log from the networking code. They enforce a rate limit to make a +denial-of-service attack impossible. A higher message_cost factor, results in +fewer messages that will be written. Message_burst controls when messages will +be dropped. The default settings limit warning messages to one every five +seconds. + +warnings +-------- + +This controls console messages from the networking stack that can occur because +of problems on the network like duplicate address or bad checksums. Normally, +this should be enabled, but if the problem persists the messages can be +disabled. + +netdev_budget +------------- + +Maximum number of packets taken from all interfaces in one polling cycle (NAPI +poll). In one polling cycle interfaces which are registered to polling are +probed in a round-robin manner. + +netdev_max_backlog +------------------ + +Maximum number of packets, queued on the INPUT side, when the interface +receives packets faster than kernel can process them. + +netdev_tstamp_prequeue +---------------------- + +If set to 0, RX packet timestamps can be sampled after RPS processing, when +the target CPU processes packets. It might give some delay on timestamps, but +permit to distribute the load on several cpus. + +If set to 1 (default), timestamps are sampled as soon as possible, before +queueing. + +optmem_max +---------- + +Maximum ancillary buffer size allowed per socket. Ancillary data is a sequence +of struct cmsghdr structures with appended data. + +2. /proc/sys/net/unix - Parameters for Unix domain sockets +------------------------------------------------------- + +There is only one file in this directory. +unix_dgram_qlen limits the max number of datagrams queued in Unix domain +socket's buffer. It will not take effect unless PF_UNIX flag is specified. + + +3. /proc/sys/net/ipv4 - IPV4 settings +------------------------------------------------------- +Please see: Documentation/networking/ip-sysctl.txt and ipvs-sysctl.txt for +descriptions of these entries. + + +4. Appletalk +------------------------------------------------------- + +The /proc/sys/net/appletalk directory holds the Appletalk configuration data +when Appletalk is loaded. The configurable parameters are: + +aarp-expiry-time +---------------- + +The amount of time we keep an ARP entry before expiring it. Used to age out +old hosts. + +aarp-resolve-time +----------------- + +The amount of time we will spend trying to resolve an Appletalk address. + +aarp-retransmit-limit +--------------------- + +The number of times we will retransmit a query before giving up. + +aarp-tick-time +-------------- + +Controls the rate at which expires are checked. + +The directory /proc/net/appletalk holds the list of active Appletalk sockets +on a machine. + +The fields indicate the DDP type, the local address (in network:node format) +the remote address, the size of the transmit pending queue, the size of the +received queue (bytes waiting for applications to read) the state and the uid +owning the socket. + +/proc/net/atalk_iface lists all the interfaces configured for appletalk.It +shows the name of the interface, its Appletalk address, the network range on +that address (or network number for phase 1 networks), and the status of the +interface. + +/proc/net/atalk_route lists each known network route. It lists the target +(network) that the route leads to, the router (may be directly connected), the +route flags, and the device the route is using. + + +5. IPX +------------------------------------------------------- + +The IPX protocol has no tunable values in proc/sys/net. + +The IPX protocol does, however, provide proc/net/ipx. This lists each IPX +socket giving the local and remote addresses in Novell format (that is +network:node:port). In accordance with the strange Novell tradition, +everything but the port is in hex. Not_Connected is displayed for sockets that +are not tied to a specific remote address. The Tx and Rx queue sizes indicate +the number of bytes pending for transmission and reception. The state +indicates the state the socket is in and the uid is the owning uid of the +socket. + +The /proc/net/ipx_interface file lists all IPX interfaces. For each interface +it gives the network number, the node number, and indicates if the network is +the primary network. It also indicates which device it is bound to (or +Internal for internal networks) and the Frame Type if appropriate. Linux +supports 802.3, 802.2, 802.2 SNAP and DIX (Blue Book) ethernet framing for +IPX. + +The /proc/net/ipx_route table holds a list of IPX routes. For each route it +gives the destination network, the router node (or Directly) and the network +address of the router (or Connected) for internal networks. + +6. TIPC +------------------------------------------------------- + +The TIPC protocol now has a tunable for the receive memory, similar to the +tcp_rmem - i.e. a vector of 3 INTEGERs: (min, default, max) + + # cat /proc/sys/net/tipc/tipc_rmem + 4252725 34021800 68043600 + # + +The max value is set to CONN_OVERLOAD_LIMIT, and the default and min values +are scaled (shifted) versions of that same value. Note that the min value +is not at this point in time used in any meaningful way, but the triplet is +preserved in order to be consistent with things like tcp_rmem. diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index 8a4863c4edd..4415aa91568 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -1,12 +1,13 @@ -Documentation for /proc/sys/vm/* kernel version 2.2.10 +Documentation for /proc/sys/vm/* kernel version 2.6.29 (c) 1998, 1999, Rik van Riel <riel@nl.linux.org> + (c) 2008 Peter W. Morreale <pmorreale@novell.com> For general info and legal blurb, please look in README. ============================================================== This file contains the documentation for the sysctl files in -/proc/sys/vm and is valid for Linux kernel version 2.2. +/proc/sys/vm and is valid for Linux kernel version 2.6.29. The files in this directory can be used to tune the operation of the virtual memory (VM) subsystem of the Linux kernel and @@ -16,178 +17,412 @@ Default values and initialization routines for most of these files can be found in mm/swap.c. Currently, these files are in /proc/sys/vm: -- overcommit_memory -- page-cluster -- dirty_ratio + +- admin_reserve_kbytes +- block_dump +- compact_memory +- dirty_background_bytes - dirty_background_ratio +- dirty_bytes - dirty_expire_centisecs +- dirty_ratio - dirty_writeback_centisecs -- highmem_is_dirtyable (only if CONFIG_HIGHMEM set) +- drop_caches +- extfrag_threshold +- hugepages_treat_as_movable +- hugetlb_shm_group +- laptop_mode +- legacy_va_layout +- lowmem_reserve_ratio - max_map_count +- memory_failure_early_kill +- memory_failure_recovery - min_free_kbytes -- laptop_mode -- block_dump -- drop-caches -- zone_reclaim_mode -- min_unmapped_ratio - min_slab_ratio -- panic_on_oom -- oom_dump_tasks -- oom_kill_allocating_task -- mmap_min_address -- numa_zonelist_order +- min_unmapped_ratio +- mmap_min_addr - nr_hugepages - nr_overcommit_hugepages +- nr_trim_pages (only if CONFIG_MMU=n) +- numa_zonelist_order +- oom_dump_tasks +- oom_kill_allocating_task +- overcommit_kbytes +- overcommit_memory +- overcommit_ratio +- page-cluster +- panic_on_oom +- percpu_pagelist_fraction +- stat_interval +- swappiness +- user_reserve_kbytes +- vfs_cache_pressure +- zone_reclaim_mode ============================================================== -dirty_ratio, dirty_background_ratio, dirty_expire_centisecs, -dirty_writeback_centisecs, highmem_is_dirtyable, -vfs_cache_pressure, laptop_mode, block_dump, swap_token_timeout, -drop-caches, hugepages_treat_as_movable: +admin_reserve_kbytes + +The amount of free memory in the system that should be reserved for users +with the capability cap_sys_admin. + +admin_reserve_kbytes defaults to min(3% of free pages, 8MB) -See Documentation/filesystems/proc.txt +That should provide enough for the admin to log in and kill a process, +if necessary, under the default overcommit 'guess' mode. + +Systems running under overcommit 'never' should increase this to account +for the full Virtual Memory Size of programs used to recover. Otherwise, +root may not be able to log in to recover the system. + +How do you calculate a minimum useful reserve? + +sshd or login + bash (or some other shell) + top (or ps, kill, etc.) + +For overcommit 'guess', we can sum resident set sizes (RSS). +On x86_64 this is about 8MB. + +For overcommit 'never', we can take the max of their virtual sizes (VSZ) +and add the sum of their RSS. +On x86_64 this is about 128MB. + +Changing this takes effect whenever an application requests memory. ============================================================== -overcommit_memory: +block_dump -This value contains a flag that enables memory overcommitment. +block_dump enables block I/O debugging when set to a nonzero value. More +information on block I/O debugging is in Documentation/laptops/laptop-mode.txt. -When this flag is 0, the kernel attempts to estimate the amount -of free memory left when userspace requests more memory. +============================================================== -When this flag is 1, the kernel pretends there is always enough -memory until it actually runs out. +compact_memory -When this flag is 2, the kernel uses a "never overcommit" -policy that attempts to prevent any overcommit of memory. +Available only when CONFIG_COMPACTION is set. When 1 is written to the file, +all zones are compacted such that free memory is available in contiguous +blocks where possible. This can be important for example in the allocation of +huge pages although processes will also directly compact memory as required. -This feature can be very useful because there are a lot of -programs that malloc() huge amounts of memory "just-in-case" -and don't use much of it. +============================================================== -The default value is 0. +dirty_background_bytes -See Documentation/vm/overcommit-accounting and -security/commoncap.c::cap_vm_enough_memory() for more information. +Contains the amount of dirty memory at which the background kernel +flusher threads will start writeback. + +Note: dirty_background_bytes is the counterpart of dirty_background_ratio. Only +one of them may be specified at a time. When one sysctl is written it is +immediately taken into account to evaluate the dirty memory limits and the +other appears as 0 when read. ============================================================== -overcommit_ratio: +dirty_background_ratio -When overcommit_memory is set to 2, the committed address -space is not permitted to exceed swap plus this percentage -of physical RAM. See above. +Contains, as a percentage of total available memory that contains free pages +and reclaimable pages, the number of pages at which the background kernel +flusher threads will start writing out dirty data. + +The total avaiable memory is not equal to total system memory. ============================================================== -page-cluster: +dirty_bytes -The Linux VM subsystem avoids excessive disk seeks by reading -multiple pages on a page fault. The number of pages it reads -is dependent on the amount of memory in your machine. +Contains the amount of dirty memory at which a process generating disk writes +will itself start writeback. -The number of pages the kernel reads in at once is equal to -2 ^ page-cluster. Values above 2 ^ 5 don't make much sense -for swap because we only cluster swap data in 32-page groups. +Note: dirty_bytes is the counterpart of dirty_ratio. Only one of them may be +specified at a time. When one sysctl is written it is immediately taken into +account to evaluate the dirty memory limits and the other appears as 0 when +read. + +Note: the minimum value allowed for dirty_bytes is two pages (in bytes); any +value lower than this limit will be ignored and the old configuration will be +retained. ============================================================== -max_map_count: +dirty_expire_centisecs -This file contains the maximum number of memory map areas a process -may have. Memory map areas are used as a side-effect of calling -malloc, directly by mmap and mprotect, and also when loading shared -libraries. +This tunable is used to define when dirty data is old enough to be eligible +for writeout by the kernel flusher threads. It is expressed in 100'ths +of a second. Data which has been dirty in-memory for longer than this +interval will be written out next time a flusher thread wakes up. -While most applications need less than a thousand maps, certain -programs, particularly malloc debuggers, may consume lots of them, -e.g., up to one or two maps per allocation. +============================================================== -The default value is 65536. +dirty_ratio + +Contains, as a percentage of total available memory that contains free pages +and reclaimable pages, the number of pages at which a process which is +generating disk writes will itself start writing out dirty data. + +The total avaiable memory is not equal to total system memory. ============================================================== -min_free_kbytes: +dirty_writeback_centisecs -This is used to force the Linux VM to keep a minimum number -of kilobytes free. The VM uses this number to compute a pages_min -value for each lowmem zone in the system. Each lowmem zone gets -a number of reserved free pages based proportionally on its size. +The kernel flusher threads will periodically wake up and write `old' data +out to disk. This tunable expresses the interval between those wakeups, in +100'ths of a second. -Some minimal ammount of memory is needed to satisfy PF_MEMALLOC -allocations; if you set this to lower than 1024KB, your system will -become subtly broken, and prone to deadlock under high loads. +Setting this to zero disables periodic writeback altogether. -Setting this too high will OOM your machine instantly. +============================================================== + +drop_caches + +Writing to this will cause the kernel to drop clean caches, as well as +reclaimable slab objects like dentries and inodes. Once dropped, their +memory becomes free. + +To free pagecache: + echo 1 > /proc/sys/vm/drop_caches +To free reclaimable slab objects (includes dentries and inodes): + echo 2 > /proc/sys/vm/drop_caches +To free slab objects and pagecache: + echo 3 > /proc/sys/vm/drop_caches + +This is a non-destructive operation and will not free any dirty objects. +To increase the number of objects freed by this operation, the user may run +`sync' prior to writing to /proc/sys/vm/drop_caches. This will minimize the +number of dirty objects on the system and create more candidates to be +dropped. + +This file is not a means to control the growth of the various kernel caches +(inodes, dentries, pagecache, etc...) These objects are automatically +reclaimed by the kernel when memory is needed elsewhere on the system. + +Use of this file can cause performance problems. Since it discards cached +objects, it may cost a significant amount of I/O and CPU to recreate the +dropped objects, especially if they were under heavy use. Because of this, +use outside of a testing or debugging environment is not recommended. + +You may see informational messages in your kernel log when this file is +used: + + cat (1234): drop_caches: 3 + +These are informational only. They do not mean that anything is wrong +with your system. To disable them, echo 4 (bit 3) into drop_caches. ============================================================== -percpu_pagelist_fraction +extfrag_threshold -This is the fraction of pages at most (high mark pcp->high) in each zone that -are allocated for each per cpu page list. The min value for this is 8. It -means that we don't allow more than 1/8th of pages in each zone to be -allocated in any single per_cpu_pagelist. This entry only changes the value -of hot per cpu pagelists. User can specify a number like 100 to allocate -1/100th of each zone to each per cpu page list. +This parameter affects whether the kernel will compact memory or direct +reclaim to satisfy a high-order allocation. /proc/extfrag_index shows what +the fragmentation index for each order is in each zone in the system. Values +tending towards 0 imply allocations would fail due to lack of memory, +values towards 1000 imply failures are due to fragmentation and -1 implies +that the allocation will succeed as long as watermarks are met. -The batch value of each per cpu pagelist is also updated as a result. It is -set to pcp->high/4. The upper limit of batch is (PAGE_SHIFT * 8) +The kernel will not compact memory in a zone if the +fragmentation index is <= extfrag_threshold. The default value is 500. -The initial value is zero. Kernel does not use this value at boot time to set -the high water marks for each per cpu page list. +============================================================== -=============================================================== +hugepages_treat_as_movable -zone_reclaim_mode: +This parameter controls whether we can allocate hugepages from ZONE_MOVABLE +or not. If set to non-zero, hugepages can be allocated from ZONE_MOVABLE. +ZONE_MOVABLE is created when kernel boot parameter kernelcore= is specified, +so this parameter has no effect if used without kernelcore=. -Zone_reclaim_mode allows someone to set more or less aggressive approaches to -reclaim memory when a zone runs out of memory. If it is set to zero then no -zone reclaim occurs. Allocations will be satisfied from other zones / nodes -in the system. +Hugepage migration is now available in some situations which depend on the +architecture and/or the hugepage size. If a hugepage supports migration, +allocation from ZONE_MOVABLE is always enabled for the hugepage regardless +of the value of this parameter. +IOW, this parameter affects only non-migratable hugepages. -This is value ORed together of +Assuming that hugepages are not migratable in your system, one usecase of +this parameter is that users can make hugepage pool more extensible by +enabling the allocation from ZONE_MOVABLE. This is because on ZONE_MOVABLE +page reclaim/migration/compaction work more and you can get contiguous +memory more likely. Note that using ZONE_MOVABLE for non-migratable +hugepages can do harm to other features like memory hotremove (because +memory hotremove expects that memory blocks on ZONE_MOVABLE are always +removable,) so it's a trade-off responsible for the users. -1 = Zone reclaim on -2 = Zone reclaim writes dirty pages out -4 = Zone reclaim swaps pages +============================================================== -zone_reclaim_mode is set during bootup to 1 if it is determined that pages -from remote zones will cause a measurable performance reduction. The -page allocator will then reclaim easily reusable pages (those page -cache pages that are currently not used) before allocating off node pages. +hugetlb_shm_group -It may be beneficial to switch off zone reclaim if the system is -used for a file server and all of memory should be used for caching files -from disk. In that case the caching effect is more important than -data locality. +hugetlb_shm_group contains group id that is allowed to create SysV +shared memory segment using hugetlb page. -Allowing zone reclaim to write out pages stops processes that are -writing large amounts of data from dirtying pages on other nodes. Zone -reclaim will write out dirty pages if a zone fills up and so effectively -throttle the process. This may decrease the performance of a single process -since it cannot use all of system memory to buffer the outgoing writes -anymore but it preserve the memory on other nodes so that the performance -of other processes running on other nodes will not be affected. +============================================================== -Allowing regular swap effectively restricts allocations to the local -node unless explicitly overridden by memory policies or cpuset -configurations. +laptop_mode + +laptop_mode is a knob that controls "laptop mode". All the things that are +controlled by this knob are discussed in Documentation/laptops/laptop-mode.txt. + +============================================================== + +legacy_va_layout + +If non-zero, this sysctl disables the new 32-bit mmap layout - the kernel +will use the legacy (2.4) layout for all processes. + +============================================================== + +lowmem_reserve_ratio + +For some specialised workloads on highmem machines it is dangerous for +the kernel to allow process memory to be allocated from the "lowmem" +zone. This is because that memory could then be pinned via the mlock() +system call, or by unavailability of swapspace. + +And on large highmem machines this lack of reclaimable lowmem memory +can be fatal. + +So the Linux page allocator has a mechanism which prevents allocations +which _could_ use highmem from using too much lowmem. This means that +a certain amount of lowmem is defended from the possibility of being +captured into pinned user memory. + +(The same argument applies to the old 16 megabyte ISA DMA region. This +mechanism will also defend that region from allocations which could use +highmem or lowmem). + +The `lowmem_reserve_ratio' tunable determines how aggressive the kernel is +in defending these lower zones. + +If you have a machine which uses highmem or ISA DMA and your +applications are using mlock(), or if you are running with no swap then +you probably should change the lowmem_reserve_ratio setting. + +The lowmem_reserve_ratio is an array. You can see them by reading this file. +- +% cat /proc/sys/vm/lowmem_reserve_ratio +256 256 32 +- +Note: # of this elements is one fewer than number of zones. Because the highest + zone's value is not necessary for following calculation. + +But, these values are not used directly. The kernel calculates # of protection +pages for each zones from them. These are shown as array of protection pages +in /proc/zoneinfo like followings. (This is an example of x86-64 box). +Each zone has an array of protection pages like this. + +- +Node 0, zone DMA + pages free 1355 + min 3 + low 3 + high 4 + : + : + numa_other 0 + protection: (0, 2004, 2004, 2004) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + pagesets + cpu: 0 pcp: 0 + : +- +These protections are added to score to judge whether this zone should be used +for page allocation or should be reclaimed. + +In this example, if normal pages (index=2) are required to this DMA zone and +watermark[WMARK_HIGH] is used for watermark, the kernel judges this zone should +not be used because pages_free(1355) is smaller than watermark + protection[2] +(4 + 2004 = 2008). If this protection value is 0, this zone would be used for +normal page requirement. If requirement is DMA zone(index=0), protection[0] +(=0) is used. + +zone[i]'s protection[j] is calculated by following expression. + +(i < j): + zone[i]->protection[j] + = (total sums of present_pages from zone[i+1] to zone[j] on the node) + / lowmem_reserve_ratio[i]; +(i = j): + (should not be protected. = 0; +(i > j): + (not necessary, but looks 0) + +The default values of lowmem_reserve_ratio[i] are + 256 (if zone[i] means DMA or DMA32 zone) + 32 (others). +As above expression, they are reciprocal number of ratio. +256 means 1/256. # of protection pages becomes about "0.39%" of total present +pages of higher zones on the node. + +If you would like to protect more pages, smaller values are effective. +The minimum value is 1 (1/1 -> 100%). + +============================================================== + +max_map_count: + +This file contains the maximum number of memory map areas a process +may have. Memory map areas are used as a side-effect of calling +malloc, directly by mmap and mprotect, and also when loading shared +libraries. + +While most applications need less than a thousand maps, certain +programs, particularly malloc debuggers, may consume lots of them, +e.g., up to one or two maps per allocation. + +The default value is 65536. ============================================================= -min_unmapped_ratio: +memory_failure_early_kill: -This is available only on NUMA kernels. +Control how to kill processes when uncorrected memory error (typically +a 2bit error in a memory module) is detected in the background by hardware +that cannot be handled by the kernel. In some cases (like the page +still having a valid copy on disk) the kernel will handle the failure +transparently without affecting any applications. But if there is +no other uptodate copy of the data it will kill to prevent any data +corruptions from propagating. -A percentage of the total pages in each zone. Zone reclaim will only -occur if more than this percentage of pages are file backed and unmapped. -This is to insure that a minimal amount of local pages is still available for -file I/O even if the node is overallocated. +1: Kill all processes that have the corrupted and not reloadable page mapped +as soon as the corruption is detected. Note this is not supported +for a few types of pages, like kernel internally allocated data or +the swap cache, but works for the majority of user pages. -The default is 1 percent. +0: Only unmap the corrupted page from all processes and only kill a process +who tries to access it. + +The kill is done using a catchable SIGBUS with BUS_MCEERR_AO, so processes can +handle this if they want to. + +This is only active on architectures/platforms with advanced machine +check handling and depends on the hardware capabilities. + +Applications can override this setting individually with the PR_MCE_KILL prctl + +============================================================== + +memory_failure_recovery + +Enable memory failure recovery (when supported by the platform) + +1: Attempt recovery. + +0: Always panic on a memory failure. + +============================================================== + +min_free_kbytes: + +This is used to force the Linux VM to keep a minimum number +of kilobytes free. The VM uses this number to compute a +watermark[WMARK_MIN] value for each lowmem zone in the system. +Each lowmem zone gets a number of reserved free pages based +proportionally on its size. + +Some minimal amount of memory is needed to satisfy PF_MEMALLOC +allocations; if you set this to lower than 1024KB, your system will +become subtly broken, and prone to deadlock under high loads. + +Setting this too high will OOM your machine instantly. ============================================================= @@ -209,82 +444,67 @@ and may not be fast. ============================================================= -panic_on_oom +min_unmapped_ratio: -This enables or disables panic on out-of-memory feature. +This is available only on NUMA kernels. -If this is set to 0, the kernel will kill some rogue process, -called oom_killer. Usually, oom_killer can kill rogue processes and -system will survive. +This is a percentage of the total pages in each zone. Zone reclaim will +only occur if more than this percentage of pages are in a state that +zone_reclaim_mode allows to be reclaimed. -If this is set to 1, the kernel panics when out-of-memory happens. -However, if a process limits using nodes by mempolicy/cpusets, -and those nodes become memory exhaustion status, one process -may be killed by oom-killer. No panic occurs in this case. -Because other nodes' memory may be free. This means system total status -may be not fatal yet. +If zone_reclaim_mode has the value 4 OR'd, then the percentage is compared +against all file-backed unmapped pages including swapcache pages and tmpfs +files. Otherwise, only unmapped pages backed by normal files but not tmpfs +files and similar are considered. -If this is set to 2, the kernel panics compulsorily even on the -above-mentioned. +The default is 1 percent. -The default value is 0. -1 and 2 are for failover of clustering. Please select either -according to your policy of failover. +============================================================== -============================================================= +mmap_min_addr -oom_dump_tasks +This file indicates the amount of address space which a user process will +be restricted from mmapping. Since kernel null dereference bugs could +accidentally operate based on the information in the first couple of pages +of memory userspace processes should not be allowed to write to them. By +default this value is set to 0 and no protections will be enforced by the +security module. Setting this value to something like 64k will allow the +vast majority of applications to work correctly and provide defense in depth +against future potential kernel bugs. -Enables a system-wide task dump (excluding kernel threads) to be -produced when the kernel performs an OOM-killing and includes such -information as pid, uid, tgid, vm size, rss, cpu, oom_adj score, and -name. This is helpful to determine why the OOM killer was invoked -and to identify the rogue task that caused it. +============================================================== -If this is set to zero, this information is suppressed. On very -large systems with thousands of tasks it may not be feasible to dump -the memory state information for each one. Such systems should not -be forced to incur a performance penalty in OOM conditions when the -information may not be desired. +nr_hugepages -If this is set to non-zero, this information is shown whenever the -OOM killer actually kills a memory-hogging task. +Change the minimum size of the hugepage pool. -The default value is 0. +See Documentation/vm/hugetlbpage.txt -============================================================= +============================================================== -oom_kill_allocating_task +nr_overcommit_hugepages -This enables or disables killing the OOM-triggering task in -out-of-memory situations. +Change the maximum size of the hugepage pool. The maximum is +nr_hugepages + nr_overcommit_hugepages. -If this is set to zero, the OOM killer will scan through the entire -tasklist and select a task based on heuristics to kill. This normally -selects a rogue memory-hogging task that frees up a large amount of -memory when killed. +See Documentation/vm/hugetlbpage.txt -If this is set to non-zero, the OOM killer simply kills the task that -triggered the out-of-memory condition. This avoids the expensive -tasklist scan. +============================================================== -If panic_on_oom is selected, it takes precedence over whatever value -is used in oom_kill_allocating_task. +nr_trim_pages -The default value is 0. +This is available only on NOMMU kernels. -============================================================== +This value adjusts the excess page trimming behaviour of power-of-2 aligned +NOMMU mmap allocations. -mmap_min_addr +A value of 0 disables trimming of allocations entirely, while a value of 1 +trims excess pages aggressively. Any value >= 1 acts as the watermark where +trimming of allocations is initiated. -This file indicates the amount of address space which a user process will -be restricted from mmaping. Since kernel null dereference bugs could -accidentally operate based on the information in the first couple of pages -of memory userspace processes should not be allowed to write to them. By -default this value is set to 0 and no protections will be enforced by the -security module. Setting this value to something like 64k will allow the -vast majority of applications to work correctly and provide defense in depth -against future potential kernel bugs. +The default value is 1. + +See Documentation/nommu-mmap.txt for more information. ============================================================== @@ -316,16 +536,16 @@ the DMA zone. Type(A) is called as "Node" order. Type (B) is "Zone" order. "Node order" orders the zonelists by node, then by zone within each node. -Specify "[Nn]ode" for zone order +Specify "[Nn]ode" for node order "Zone Order" orders the zonelists by zone type, then by node within each -zone. Specify "[Zz]one"for zode order. +zone. Specify "[Zz]one" for zone order. Specify "[Dd]efault" to request automatic configuration. Autoconfiguration will select "node" order in following case. (1) if the DMA zone does not exist or (2) if the DMA zone comprises greater than 50% of the available memory or -(3) if any node's DMA zone comprises greater than 60% of its local memory and +(3) if any node's DMA zone comprises greater than 70% of its local memory and the amount of local memory is big enough. Otherwise, "zone" order will be selected. Default order is recommended unless @@ -333,17 +553,252 @@ this is causing problems for your system/application. ============================================================== -nr_hugepages +oom_dump_tasks -Change the minimum size of the hugepage pool. +Enables a system-wide task dump (excluding kernel threads) to be +produced when the kernel performs an OOM-killing and includes such +information as pid, uid, tgid, vm size, rss, nr_ptes, swapents, +oom_score_adj score, and name. This is helpful to determine why the +OOM killer was invoked, to identify the rogue task that caused it, +and to determine why the OOM killer chose the task it did to kill. -See Documentation/vm/hugetlbpage.txt +If this is set to zero, this information is suppressed. On very +large systems with thousands of tasks it may not be feasible to dump +the memory state information for each one. Such systems should not +be forced to incur a performance penalty in OOM conditions when the +information may not be desired. + +If this is set to non-zero, this information is shown whenever the +OOM killer actually kills a memory-hogging task. + +The default value is 1 (enabled). ============================================================== -nr_overcommit_hugepages +oom_kill_allocating_task -Change the maximum size of the hugepage pool. The maximum is -nr_hugepages + nr_overcommit_hugepages. +This enables or disables killing the OOM-triggering task in +out-of-memory situations. -See Documentation/vm/hugetlbpage.txt +If this is set to zero, the OOM killer will scan through the entire +tasklist and select a task based on heuristics to kill. This normally +selects a rogue memory-hogging task that frees up a large amount of +memory when killed. + +If this is set to non-zero, the OOM killer simply kills the task that +triggered the out-of-memory condition. This avoids the expensive +tasklist scan. + +If panic_on_oom is selected, it takes precedence over whatever value +is used in oom_kill_allocating_task. + +The default value is 0. + +============================================================== + +overcommit_kbytes: + +When overcommit_memory is set to 2, the committed address space is not +permitted to exceed swap plus this amount of physical RAM. See below. + +Note: overcommit_kbytes is the counterpart of overcommit_ratio. Only one +of them may be specified at a time. Setting one disables the other (which +then appears as 0 when read). + +============================================================== + +overcommit_memory: + +This value contains a flag that enables memory overcommitment. + +When this flag is 0, the kernel attempts to estimate the amount +of free memory left when userspace requests more memory. + +When this flag is 1, the kernel pretends there is always enough +memory until it actually runs out. + +When this flag is 2, the kernel uses a "never overcommit" +policy that attempts to prevent any overcommit of memory. +Note that user_reserve_kbytes affects this policy. + +This feature can be very useful because there are a lot of +programs that malloc() huge amounts of memory "just-in-case" +and don't use much of it. + +The default value is 0. + +See Documentation/vm/overcommit-accounting and +security/commoncap.c::cap_vm_enough_memory() for more information. + +============================================================== + +overcommit_ratio: + +When overcommit_memory is set to 2, the committed address +space is not permitted to exceed swap plus this percentage +of physical RAM. See above. + +============================================================== + +page-cluster + +page-cluster controls the number of pages up to which consecutive pages +are read in from swap in a single attempt. This is the swap counterpart +to page cache readahead. +The mentioned consecutivity is not in terms of virtual/physical addresses, +but consecutive on swap space - that means they were swapped out together. + +It is a logarithmic value - setting it to zero means "1 page", setting +it to 1 means "2 pages", setting it to 2 means "4 pages", etc. +Zero disables swap readahead completely. + +The default value is three (eight pages at a time). There may be some +small benefits in tuning this to a different value if your workload is +swap-intensive. + +Lower values mean lower latencies for initial faults, but at the same time +extra faults and I/O delays for following faults if they would have been part of +that consecutive pages readahead would have brought in. + +============================================================= + +panic_on_oom + +This enables or disables panic on out-of-memory feature. + +If this is set to 0, the kernel will kill some rogue process, +called oom_killer. Usually, oom_killer can kill rogue processes and +system will survive. + +If this is set to 1, the kernel panics when out-of-memory happens. +However, if a process limits using nodes by mempolicy/cpusets, +and those nodes become memory exhaustion status, one process +may be killed by oom-killer. No panic occurs in this case. +Because other nodes' memory may be free. This means system total status +may be not fatal yet. + +If this is set to 2, the kernel panics compulsorily even on the +above-mentioned. Even oom happens under memory cgroup, the whole +system panics. + +The default value is 0. +1 and 2 are for failover of clustering. Please select either +according to your policy of failover. +panic_on_oom=2+kdump gives you very strong tool to investigate +why oom happens. You can get snapshot. + +============================================================= + +percpu_pagelist_fraction + +This is the fraction of pages at most (high mark pcp->high) in each zone that +are allocated for each per cpu page list. The min value for this is 8. It +means that we don't allow more than 1/8th of pages in each zone to be +allocated in any single per_cpu_pagelist. This entry only changes the value +of hot per cpu pagelists. User can specify a number like 100 to allocate +1/100th of each zone to each per cpu page list. + +The batch value of each per cpu pagelist is also updated as a result. It is +set to pcp->high/4. The upper limit of batch is (PAGE_SHIFT * 8) + +The initial value is zero. Kernel does not use this value at boot time to set +the high water marks for each per cpu page list. If the user writes '0' to this +sysctl, it will revert to this default behavior. + +============================================================== + +stat_interval + +The time interval between which vm statistics are updated. The default +is 1 second. + +============================================================== + +swappiness + +This control is used to define how aggressive the kernel will swap +memory pages. Higher values will increase agressiveness, lower values +decrease the amount of swap. A value of 0 instructs the kernel not to +initiate swap until the amount of free and file-backed pages is less +than the high water mark in a zone. + +The default value is 60. + +============================================================== + +- user_reserve_kbytes + +When overcommit_memory is set to 2, "never overommit" mode, reserve +min(3% of current process size, user_reserve_kbytes) of free memory. +This is intended to prevent a user from starting a single memory hogging +process, such that they cannot recover (kill the hog). + +user_reserve_kbytes defaults to min(3% of the current process size, 128MB). + +If this is reduced to zero, then the user will be allowed to allocate +all free memory with a single process, minus admin_reserve_kbytes. +Any subsequent attempts to execute a command will result in +"fork: Cannot allocate memory". + +Changing this takes effect whenever an application requests memory. + +============================================================== + +vfs_cache_pressure +------------------ + +This percentage value controls the tendency of the kernel to reclaim +the memory which is used for caching of directory and inode objects. + +At the default value of vfs_cache_pressure=100 the kernel will attempt to +reclaim dentries and inodes at a "fair" rate with respect to pagecache and +swapcache reclaim. Decreasing vfs_cache_pressure causes the kernel to prefer +to retain dentry and inode caches. When vfs_cache_pressure=0, the kernel will +never reclaim dentries and inodes due to memory pressure and this can easily +lead to out-of-memory conditions. Increasing vfs_cache_pressure beyond 100 +causes the kernel to prefer to reclaim dentries and inodes. + +Increasing vfs_cache_pressure significantly beyond 100 may have negative +performance impact. Reclaim code needs to take various locks to find freeable +directory and inode objects. With vfs_cache_pressure=1000, it will look for +ten times more freeable objects than there are. + +============================================================== + +zone_reclaim_mode: + +Zone_reclaim_mode allows someone to set more or less aggressive approaches to +reclaim memory when a zone runs out of memory. If it is set to zero then no +zone reclaim occurs. Allocations will be satisfied from other zones / nodes +in the system. + +This is value ORed together of + +1 = Zone reclaim on +2 = Zone reclaim writes dirty pages out +4 = Zone reclaim swaps pages + +zone_reclaim_mode is disabled by default. For file servers or workloads +that benefit from having their data cached, zone_reclaim_mode should be +left disabled as the caching effect is likely to be more important than +data locality. + +zone_reclaim may be enabled if it's known that the workload is partitioned +such that each partition fits within a NUMA node and that accessing remote +memory would cause a measurable performance reduction. The page allocator +will then reclaim easily reusable pages (those page cache pages that are +currently not used) before allocating off node pages. + +Allowing zone reclaim to write out pages stops processes that are +writing large amounts of data from dirtying pages on other nodes. Zone +reclaim will write out dirty pages if a zone fills up and so effectively +throttle the process. This may decrease the performance of a single process +since it cannot use all of system memory to buffer the outgoing writes +anymore but it preserve the memory on other nodes so that the performance +of other processes running on other nodes will not be affected. + +Allowing regular swap effectively restricts allocations to the local +node unless explicitly overridden by memory policies or cpuset +configurations. + +============ End of Document ================================= |
