diff options
Diffstat (limited to 'Documentation/sysctl/kernel.txt')
| -rw-r--r-- | Documentation/sysctl/kernel.txt | 585 |
1 files changed, 533 insertions, 52 deletions
diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt index 276a7e63782..c14374e7177 100644 --- a/Documentation/sysctl/kernel.txt +++ b/Documentation/sysctl/kernel.txt @@ -1,5 +1,6 @@ Documentation for /proc/sys/kernel/* kernel version 2.2.10 (c) 1998, 1999, Rik van Riel <riel@nl.linux.org> + (c) 2009, Shen Feng<shen@cn.fujitsu.com> For general info and legal blurb, please look in README. @@ -16,58 +17,76 @@ before actually making adjustments. Currently, these files might (depending on your configuration) show up in /proc/sys/kernel: -- acpi_video_flags + - acct +- acpi_video_flags +- auto_msgmni +- bootloader_type [ X86 only ] +- bootloader_version [ X86 only ] +- callhome [ S390 only ] +- cap_last_cap - core_pattern +- core_pipe_limit - core_uses_pid - ctrl-alt-del -- dentry-state +- dmesg_restrict - domainname - hostname - hotplug -- java-appletviewer [ binfmt_java, obsolete ] -- java-interpreter [ binfmt_java, obsolete ] +- hung_task_panic +- hung_task_check_count +- hung_task_timeout_secs +- hung_task_warnings +- kexec_load_disabled +- kptr_restrict - kstack_depth_to_print [ X86 only ] - l2cr [ PPC only ] - modprobe ==> Documentation/debugging-modules.txt +- modules_disabled +- msg_next_id [ sysv ipc ] - msgmax - msgmnb - msgmni +- nmi_watchdog - osrelease - ostype - overflowgid - overflowuid - panic +- panic_on_oops +- panic_on_unrecovered_nmi +- panic_on_stackoverflow - pid_max - powersave-nap [ PPC only ] - printk +- printk_delay +- printk_ratelimit +- printk_ratelimit_burst - randomize_va_space - real-root-dev ==> Documentation/initrd.txt - reboot-cmd [ SPARC only ] - rtsig-max - rtsig-nr - sem +- sem_next_id [ sysv ipc ] - sg-big-buff [ generic SCSI device (sg) ] +- shm_next_id [ sysv ipc ] +- shm_rmid_forced - shmall - shmmax [ sysv ipc ] - shmmni +- softlockup_all_cpu_backtrace - stop-a [ SPARC only ] - sysrq ==> Documentation/sysrq.txt +- sysctl_writes_strict - tainted - threads-max +- unknown_nmi_panic +- watchdog_thresh - version ============================================================== -acpi_video_flags: - -flags - -See Doc*/kernel/power/video.txt, it allows mode of video boot to be -set during run time. - -============================================================== - acct: highwater lowwater frequency @@ -85,6 +104,76 @@ valid for 30 seconds. ============================================================== +acpi_video_flags: + +flags + +See Doc*/kernel/power/video.txt, it allows mode of video boot to be +set during run time. + +============================================================== + +auto_msgmni: + +Enables/Disables automatic recomputing of msgmni upon memory add/remove +or upon ipc namespace creation/removal (see the msgmni description +above). Echoing "1" into this file enables msgmni automatic recomputing. +Echoing "0" turns it off. auto_msgmni default value is 1. + + +============================================================== + +bootloader_type: + +x86 bootloader identification + +This gives the bootloader type number as indicated by the bootloader, +shifted left by 4, and OR'd with the low four bits of the bootloader +version. The reason for this encoding is that this used to match the +type_of_loader field in the kernel header; the encoding is kept for +backwards compatibility. That is, if the full bootloader type number +is 0x15 and the full version number is 0x234, this file will contain +the value 340 = 0x154. + +See the type_of_loader and ext_loader_type fields in +Documentation/x86/boot.txt for additional information. + +============================================================== + +bootloader_version: + +x86 bootloader version + +The complete bootloader version number. In the example above, this +file will contain the value 564 = 0x234. + +See the type_of_loader and ext_loader_ver fields in +Documentation/x86/boot.txt for additional information. + +============================================================== + +callhome: + +Controls the kernel's callhome behavior in case of a kernel panic. + +The s390 hardware allows an operating system to send a notification +to a service organization (callhome) in case of an operating system panic. + +When the value in this file is 0 (which is the default behavior) +nothing happens in case of a kernel panic. If this value is set to "1" +the complete kernel oops message is send to the IBM customer service +organization in case the mainframe the Linux operating system is running +on has a service contract with IBM. + +============================================================== + +cap_last_cap + +Highest valid capability of the running kernel. Exports +CAP_LAST_CAP from the kernel. + +============================================================== + core_pattern: core_pattern is used to specify a core dumpfile pattern name. @@ -100,12 +189,16 @@ core_pattern is used to specify a core dumpfile pattern name. %<NUL> '%' is dropped %% output one '%' %p pid + %P global pid (init PID namespace) %u uid %g gid + %d dump mode, matches PR_SET_DUMPABLE and + /proc/sys/fs/suid_dumpable %s signal number %t UNIX time of dump %h hostname - %e executable filename + %e executable filename (may be shortened) + %E executable path %<OTHER> both are dropped . If the first character of the pattern is a '|', the kernel will treat the rest of the pattern as a command to run. The core dump will be @@ -113,6 +206,29 @@ core_pattern is used to specify a core dumpfile pattern name. ============================================================== +core_pipe_limit: + +This sysctl is only applicable when core_pattern is configured to pipe +core files to a user space helper (when the first character of +core_pattern is a '|', see above). When collecting cores via a pipe +to an application, it is occasionally useful for the collecting +application to gather data about the crashing process from its +/proc/pid directory. In order to do this safely, the kernel must wait +for the collecting process to exit, so as not to remove the crashing +processes proc files prematurely. This in turn creates the +possibility that a misbehaving userspace collecting process can block +the reaping of a crashed process simply by never exiting. This sysctl +defends against that. It defines how many concurrent crashing +processes may be piped to user space applications in parallel. If +this value is exceeded, then those crashing processes above that value +are noted via the kernel log and their cores are skipped. 0 is a +special value, indicating that unlimited processes may be captured in +parallel, but that no waiting will take place (i.e. the collecting +process is not guaranteed access to /proc/<crashing pid>/). This +value defaults to 0. + +============================================================== + core_uses_pid: The default coredump filename is "core". By setting @@ -138,6 +254,19 @@ to decide what to do with it. ============================================================== +dmesg_restrict: + +This toggle indicates whether unprivileged users are prevented +from using dmesg(8) to view messages from the kernel's log buffer. +When dmesg_restrict is set to (0) there are no restrictions. When +dmesg_restrict is set set to (1), users must have CAP_SYSLOG to use +dmesg(8). + +The kernel config option CONFIG_SECURITY_DMESG_RESTRICT sets the +default value of dmesg_restrict. + +============================================================== + domainname & hostname: These files can be used to set the NIS/YP domainname and the @@ -165,10 +294,79 @@ Default value is "/sbin/hotplug". ============================================================== -l2cr: (PPC only) +hung_task_panic: -This flag controls the L2 cache of G3 processor boards. If -0, the cache is disabled. Enabled if nonzero. +Controls the kernel's behavior when a hung task is detected. +This file shows up if CONFIG_DETECT_HUNG_TASK is enabled. + +0: continue operation. This is the default behavior. + +1: panic immediately. + +============================================================== + +hung_task_check_count: + +The upper bound on the number of tasks that are checked. +This file shows up if CONFIG_DETECT_HUNG_TASK is enabled. + +============================================================== + +hung_task_timeout_secs: + +Check interval. When a task in D state did not get scheduled +for more than this value report a warning. +This file shows up if CONFIG_DETECT_HUNG_TASK is enabled. + +0: means infinite timeout - no checking done. +Possible values to set are in range {0..LONG_MAX/HZ}. + +============================================================== + +hung_task_warnings: + +The maximum number of warnings to report. During a check interval +if a hung task is detected, this value is decreased by 1. +When this value reaches 0, no more warnings will be reported. +This file shows up if CONFIG_DETECT_HUNG_TASK is enabled. + +-1: report an infinite number of warnings. + +============================================================== + +kexec_load_disabled: + +A toggle indicating if the kexec_load syscall has been disabled. This +value defaults to 0 (false: kexec_load enabled), but can be set to 1 +(true: kexec_load disabled). Once true, kexec can no longer be used, and +the toggle cannot be set back to false. This allows a kexec image to be +loaded before disabling the syscall, allowing a system to set up (and +later use) an image without it being altered. Generally used together +with the "modules_disabled" sysctl. + +============================================================== + +kptr_restrict: + +This toggle indicates whether restrictions are placed on +exposing kernel addresses via /proc and other interfaces. + +When kptr_restrict is set to (0), the default, there are no restrictions. + +When kptr_restrict is set to (1), kernel pointers printed using the %pK +format specifier will be replaced with 0's unless the user has CAP_SYSLOG +and effective user and group ids are equal to the real ids. This is +because %pK checks are done at read() time rather than open() time, so +if permissions are elevated between the open() and the read() (e.g via +a setuid binary) then %pK will not leak kernel pointers to unprivileged +users. Note, this is a temporary solution only. The correct long-term +solution is to do the permission checks at open() time. Consider removing +world read permissions from files that use %pK, and using dmesg_restrict +to protect against uses of %pK in dmesg(8) if leaking kernel pointer +values to unprivileged users is a concern. + +When kptr_restrict is set to (2), kernel pointers printed using +%pK will be replaced with 0's regardless of privileges. ============================================================== @@ -179,6 +377,117 @@ kernel stack. ============================================================== +l2cr: (PPC only) + +This flag controls the L2 cache of G3 processor boards. If +0, the cache is disabled. Enabled if nonzero. + +============================================================== + +modules_disabled: + +A toggle value indicating if modules are allowed to be loaded +in an otherwise modular kernel. This toggle defaults to off +(0), but can be set true (1). Once true, modules can be +neither loaded nor unloaded, and the toggle cannot be set back +to false. Generally used with the "kexec_load_disabled" toggle. + +============================================================== + +msg_next_id, sem_next_id, and shm_next_id: + +These three toggles allows to specify desired id for next allocated IPC +object: message, semaphore or shared memory respectively. + +By default they are equal to -1, which means generic allocation logic. +Possible values to set are in range {0..INT_MAX}. + +Notes: +1) kernel doesn't guarantee, that new object will have desired id. So, +it's up to userspace, how to handle an object with "wrong" id. +2) Toggle with non-default value will be set back to -1 by kernel after +successful IPC object allocation. + +============================================================== + +nmi_watchdog: + +Enables/Disables the NMI watchdog on x86 systems. When the value is +non-zero the NMI watchdog is enabled and will continuously test all +online cpus to determine whether or not they are still functioning +properly. Currently, passing "nmi_watchdog=" parameter at boot time is +required for this function to work. + +If LAPIC NMI watchdog method is in use (nmi_watchdog=2 kernel +parameter), the NMI watchdog shares registers with oprofile. By +disabling the NMI watchdog, oprofile may have more registers to +utilize. + +============================================================== + +numa_balancing + +Enables/disables automatic page fault based NUMA memory +balancing. Memory is moved automatically to nodes +that access it often. + +Enables/disables automatic NUMA memory balancing. On NUMA machines, there +is a performance penalty if remote memory is accessed by a CPU. When this +feature is enabled the kernel samples what task thread is accessing memory +by periodically unmapping pages and later trapping a page fault. At the +time of the page fault, it is determined if the data being accessed should +be migrated to a local memory node. + +The unmapping of pages and trapping faults incur additional overhead that +ideally is offset by improved memory locality but there is no universal +guarantee. If the target workload is already bound to NUMA nodes then this +feature should be disabled. Otherwise, if the system overhead from the +feature is too high then the rate the kernel samples for NUMA hinting +faults may be controlled by the numa_balancing_scan_period_min_ms, +numa_balancing_scan_delay_ms, numa_balancing_scan_period_max_ms, +numa_balancing_scan_size_mb, and numa_balancing_settle_count sysctls. + +============================================================== + +numa_balancing_scan_period_min_ms, numa_balancing_scan_delay_ms, +numa_balancing_scan_period_max_ms, numa_balancing_scan_size_mb + +Automatic NUMA balancing scans tasks address space and unmaps pages to +detect if pages are properly placed or if the data should be migrated to a +memory node local to where the task is running. Every "scan delay" the task +scans the next "scan size" number of pages in its address space. When the +end of the address space is reached the scanner restarts from the beginning. + +In combination, the "scan delay" and "scan size" determine the scan rate. +When "scan delay" decreases, the scan rate increases. The scan delay and +hence the scan rate of every task is adaptive and depends on historical +behaviour. If pages are properly placed then the scan delay increases, +otherwise the scan delay decreases. The "scan size" is not adaptive but +the higher the "scan size", the higher the scan rate. + +Higher scan rates incur higher system overhead as page faults must be +trapped and potentially data must be migrated. However, the higher the scan +rate, the more quickly a tasks memory is migrated to a local node if the +workload pattern changes and minimises performance impact due to remote +memory accesses. These sysctls control the thresholds for scan delays and +the number of pages scanned. + +numa_balancing_scan_period_min_ms is the minimum time in milliseconds to +scan a tasks virtual memory. It effectively controls the maximum scanning +rate for each task. + +numa_balancing_scan_delay_ms is the starting "scan delay" used for a task +when it initially forks. + +numa_balancing_scan_period_max_ms is the maximum time in milliseconds to +scan a tasks virtual memory. It effectively controls the minimum scanning +rate for each task. + +numa_balancing_scan_size_mb is how many megabytes worth of pages are +scanned for a given scan. + +============================================================== + osrelease, ostype & version: # cat osrelease @@ -198,10 +507,10 @@ The only way to tune these values is to rebuild the kernel :-) overflowgid & overflowuid: -if your architecture did not always support 32-bit UIDs (i.e. arm, i386, -m68k, sh, and sparc32), a fixed UID and GID will be returned to -applications that use the old 16-bit UID/GID system calls, if the actual -UID or GID would exceed 65535. +if your architecture did not always support 32-bit UIDs (i.e. arm, +i386, m68k, sh, and sparc32), a fixed UID and GID will be returned to +applications that use the old 16-bit UID/GID system calls, if the +actual UID or GID would exceed 65535. These sysctls allow you to change the value of the fixed UID and GID. The default is 65534. @@ -210,9 +519,22 @@ The default is 65534. panic: -The value in this file represents the number of seconds the -kernel waits before rebooting on a panic. When you use the -software watchdog, the recommended setting is 60. +The value in this file represents the number of seconds the kernel +waits before rebooting on a panic. When you use the software watchdog, +the recommended setting is 60. + +============================================================== + +panic_on_unrecovered_nmi: + +The default Linux behaviour on an NMI of either memory or unknown is +to continue operation. For many environments such as scientific +computing it is preferable that the box is taken out and the error +dealt with than an uncorrected parity/ECC error get propagated. + +A small number of systems do generate NMI's for bizarre random reasons +such as power management so the default is off. That sysctl works like +the existing panic controls already in that directory. ============================================================== @@ -227,6 +549,45 @@ Controls the kernel's behaviour when an oops or BUG is encountered. ============================================================== +panic_on_stackoverflow: + +Controls the kernel's behavior when detecting the overflows of +kernel, IRQ and exception stacks except a user stack. +This file shows up if CONFIG_DEBUG_STACKOVERFLOW is enabled. + +0: try to continue operation. + +1: panic immediately. + +============================================================== + +perf_cpu_time_max_percent: + +Hints to the kernel how much CPU time it should be allowed to +use to handle perf sampling events. If the perf subsystem +is informed that its samples are exceeding this limit, it +will drop its sampling frequency to attempt to reduce its CPU +usage. + +Some perf sampling happens in NMIs. If these samples +unexpectedly take too long to execute, the NMIs can become +stacked up next to each other so much that nothing else is +allowed to execute. + +0: disable the mechanism. Do not monitor or correct perf's + sampling rate no matter how CPU time it takes. + +1-100: attempt to throttle perf's sample rate to this + percentage of CPU. Note: the kernel calculates an + "expected" length of each sample event. 100 here means + 100% of that expected length. Even if this is set to + 100, you may still see sample throttling if this + length is exceeded. Set to 0 if you truly do not care + how much CPU is consumed. + +============================================================== + + pid_max: PID allocation wrap value. When the kernel's next PID value @@ -235,6 +596,14 @@ PIDs of value pid_max or larger are not allocated. ============================================================== +ns_last_pid: + +The last pid allocated in the current (the one task using this sysctl +lives in) pid namespace. When selecting a pid for a next task on fork +kernel tries to allocate a number starting from this one. + +============================================================== + powersave-nap: (PPC only) If set, Linux-PPC will use the 'nap' mode of powersaving, @@ -254,7 +623,7 @@ the different loglevels. - console_loglevel: messages with a higher priority than this will be printed to the console -- default_message_level: messages without an explicit priority +- default_message_loglevel: messages without an explicit priority will be printed with this priority - minimum_console_loglevel: minimum (highest) value to which console_loglevel can be set @@ -262,6 +631,14 @@ the different loglevels. ============================================================== +printk_delay: + +Delay each printk message in printk_delay milliseconds + +Value from 0 - 10000 is allowed. + +============================================================== + printk_ratelimit: Some warning messages are rate limited. printk_ratelimit specifies @@ -281,31 +658,35 @@ send before ratelimiting kicks in. ============================================================== -randomize-va-space: +randomize_va_space: This option can be used to select the type of process address space randomization that is used in the system, for architectures that support this feature. -0 - Turn the process address space randomization off by default. +0 - Turn the process address space randomization off. This is the + default for architectures that do not support this feature anyways, + and kernels that are booted with the "norandmaps" parameter. 1 - Make the addresses of mmap base, stack and VDSO page randomized. This, among other things, implies that shared libraries will be - loaded to random addresses. Also for PIE-linked binaries, the location - of code start is randomized. + loaded to random addresses. Also for PIE-linked binaries, the + location of code start is randomized. This is the default if the + CONFIG_COMPAT_BRK option is enabled. - With heap randomization, the situation is a little bit more - complicated. - There a few legacy applications out there (such as some ancient +2 - Additionally enable heap randomization. This is the default if + CONFIG_COMPAT_BRK is disabled. + + There are a few legacy applications out there (such as some ancient versions of libc.so.5 from 1996) that assume that brk area starts - just after the end of the code+bss. These applications break when - start of the brk area is randomized. There are however no known + just after the end of the code+bss. These applications break when + start of the brk area is randomized. There are however no known non-legacy applications that would be broken this way, so for most - systems it is safe to choose full randomization. However there is - a CONFIG_COMPAT_BRK option for systems with ancient and/or broken - binaries, that makes heap non-randomized, but keeps all other - parts of process address space randomized if randomize_va_space - sysctl is turned on. + systems it is safe to choose full randomization. + + Systems with ancient and/or broken binaries should be configured + with CONFIG_COMPAT_BRK enabled, which excludes the heap from process + address space randomization. ============================================================== @@ -340,32 +721,132 @@ are doing anyway :) ============================================================== -shmmax: +shmall: + +This parameter sets the total amount of shared memory pages that +can be used system wide. Hence, SHMALL should always be at least +ceil(shmmax/PAGE_SIZE). + +If you are not sure what the default PAGE_SIZE is on your Linux +system, you can run the following command: + +# getconf PAGE_SIZE + +============================================================== + +shmmax: This value can be used to query and set the run time limit on the maximum shared memory segment size that can be created. -Shared memory segments up to 1Gb are now supported in the +Shared memory segments up to 1Gb are now supported in the kernel. This value defaults to SHMMAX. ============================================================== -softlockup_thresh: +shm_rmid_forced: + +Linux lets you set resource limits, including how much memory one +process can consume, via setrlimit(2). Unfortunately, shared memory +segments are allowed to exist without association with any process, and +thus might not be counted against any resource limits. If enabled, +shared memory segments are automatically destroyed when their attach +count becomes zero after a detach or a process termination. It will +also destroy segments that were created, but never attached to, on exit +from the process. The only use left for IPC_RMID is to immediately +destroy an unattached segment. Of course, this breaks the way things are +defined, so some applications might stop working. Note that this +feature will do you no good unless you also configure your resource +limits (in particular, RLIMIT_AS and RLIMIT_NPROC). Most systems don't +need this. + +Note that if you change this from 0 to 1, already created segments +without users and with a dead originative process will be destroyed. -This value can be used to lower the softlockup tolerance -threshold. The default threshold is 10s. If a cpu is locked up -for 10s, the kernel complains. Valid values are 1-60s. +============================================================== + +sysctl_writes_strict: + +Control how file position affects the behavior of updating sysctl values +via the /proc/sys interface: + + -1 - Legacy per-write sysctl value handling, with no printk warnings. + Each write syscall must fully contain the sysctl value to be + written, and multiple writes on the same sysctl file descriptor + will rewrite the sysctl value, regardless of file position. + 0 - (default) Same behavior as above, but warn about processes that + perform writes to a sysctl file descriptor when the file position + is not 0. + 1 - Respect file position when writing sysctl strings. Multiple writes + will append to the sysctl value buffer. Anything past the max length + of the sysctl value buffer will be ignored. Writes to numeric sysctl + entries must always be at file position 0 and the value must be + fully contained in the buffer sent in the write syscall. + +============================================================== + +softlockup_all_cpu_backtrace: + +This value controls the soft lockup detector thread's behavior +when a soft lockup condition is detected as to whether or not +to gather further debug information. If enabled, each cpu will +be issued an NMI and instructed to capture stack trace. + +This feature is only applicable for architectures which support +NMI. + +0: do nothing. This is the default behavior. + +1: on detection capture more debug information. ============================================================== -tainted: +tainted: Non-zero if the kernel has been tainted. Numeric values, which can be ORed together: - 1 - A module with a non-GPL license has been loaded, this - includes modules with no license. - Set by modutils >= 2.4.9 and module-init-tools. - 2 - A module was force loaded by insmod -f. - Set by modutils >= 2.4.9 and module-init-tools. - 4 - Unsafe SMP processors: SMP with CPUs not designed for SMP. + 1 - A module with a non-GPL license has been loaded, this + includes modules with no license. + Set by modutils >= 2.4.9 and module-init-tools. + 2 - A module was force loaded by insmod -f. + Set by modutils >= 2.4.9 and module-init-tools. + 4 - Unsafe SMP processors: SMP with CPUs not designed for SMP. + 8 - A module was forcibly unloaded from the system by rmmod -f. + 16 - A hardware machine check error occurred on the system. + 32 - A bad page was discovered on the system. + 64 - The user has asked that the system be marked "tainted". This + could be because they are running software that directly modifies + the hardware, or for other reasons. + 128 - The system has died. + 256 - The ACPI DSDT has been overridden with one supplied by the user + instead of using the one provided by the hardware. + 512 - A kernel warning has occurred. +1024 - A module from drivers/staging was loaded. +2048 - The system is working around a severe firmware bug. +4096 - An out-of-tree module has been loaded. +8192 - An unsigned module has been loaded in a kernel supporting module + signature. +============================================================== + +unknown_nmi_panic: + +The value in this file affects behavior of handling NMI. When the +value is non-zero, unknown NMI is trapped and then panic occurs. At +that time, kernel debugging information is displayed on console. + +NMI switch that most IA32 servers have fires unknown NMI up, for +example. If a system hangs up, try pressing the NMI switch. + +============================================================== + +watchdog_thresh: + +This value can be used to control the frequency of hrtimer and NMI +events and the soft and hard lockup thresholds. The default threshold +is 10 seconds. + +The softlockup threshold is (2 * watchdog_thresh). Setting this +tunable to zero will disable lockup detection altogether. + +============================================================== |
