aboutsummaryrefslogtreecommitdiff
path: root/Documentation/sysctl
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/sysctl')
-rw-r--r--Documentation/sysctl/00-INDEX2
-rw-r--r--Documentation/sysctl/fs.txt91
-rw-r--r--Documentation/sysctl/kernel.txt518
-rw-r--r--Documentation/sysctl/net.txt81
-rw-r--r--Documentation/sysctl/vm.txt219
5 files changed, 719 insertions, 192 deletions
diff --git a/Documentation/sysctl/00-INDEX b/Documentation/sysctl/00-INDEX
index 1286f455992..8cf5d493fd0 100644
--- a/Documentation/sysctl/00-INDEX
+++ b/Documentation/sysctl/00-INDEX
@@ -4,8 +4,6 @@ README
- general information about /proc/sys/ sysctl files.
abi.txt
- documentation for /proc/sys/abi/*.
-ctl_unnumbered.txt
- - explanation of why one should not add new binary sysctl numbers.
fs.txt
- documentation for /proc/sys/fs/*.
kernel.txt
diff --git a/Documentation/sysctl/fs.txt b/Documentation/sysctl/fs.txt
index 62682500878..88152f214f4 100644
--- a/Documentation/sysctl/fs.txt
+++ b/Documentation/sysctl/fs.txt
@@ -32,6 +32,8 @@ Currently, these files are in /proc/sys/fs:
- nr_open
- overflowuid
- overflowgid
+- protected_hardlinks
+- protected_symlinks
- suid_dumpable
- super-max
- super-nr
@@ -88,20 +90,19 @@ you might want to raise the limit.
file-max & file-nr:
-The kernel allocates file handles dynamically, but as yet it
-doesn't free them again.
-
The value in file-max denotes the maximum number of file-
handles that the Linux kernel will allocate. When you get lots
of error messages about running out of file handles, you might
want to increase this limit.
-Historically, the three values in file-nr denoted the number of
-allocated file handles, the number of allocated but unused file
-handles, and the maximum number of file handles. Linux 2.6 always
-reports 0 as the number of free file handles -- this is not an
-error, it just means that the number of allocated file handles
-exactly matches the number of used file handles.
+Historically,the kernel was able to allocate file handles
+dynamically, but not to free them again. The three values in
+file-nr denote the number of allocated file handles, the number
+of allocated but unused file handles, and the maximum number of
+file handles. Linux 2.6 always reports 0 as the number of free
+file handles -- this is not an error, it just means that the
+number of allocated file handles exactly matches the number of
+used file handles.
Attempts to allocate more file descriptors than file-max are
reported with printk, look for "VFS: file-max limit <number>
@@ -158,22 +159,68 @@ The default is 65534.
==============================================================
+protected_hardlinks:
+
+A long-standing class of security issues is the hardlink-based
+time-of-check-time-of-use race, most commonly seen in world-writable
+directories like /tmp. The common method of exploitation of this flaw
+is to cross privilege boundaries when following a given hardlink (i.e. a
+root process follows a hardlink created by another user). Additionally,
+on systems without separated partitions, this stops unauthorized users
+from "pinning" vulnerable setuid/setgid files against being upgraded by
+the administrator, or linking to special files.
+
+When set to "0", hardlink creation behavior is unrestricted.
+
+When set to "1" hardlinks cannot be created by users if they do not
+already own the source file, or do not have read/write access to it.
+
+This protection is based on the restrictions in Openwall and grsecurity.
+
+==============================================================
+
+protected_symlinks:
+
+A long-standing class of security issues is the symlink-based
+time-of-check-time-of-use race, most commonly seen in world-writable
+directories like /tmp. The common method of exploitation of this flaw
+is to cross privilege boundaries when following a given symlink (i.e. a
+root process follows a symlink belonging to another user). For a likely
+incomplete list of hundreds of examples across the years, please see:
+http://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=/tmp
+
+When set to "0", symlink following behavior is unrestricted.
+
+When set to "1" symlinks are permitted to be followed only when outside
+a sticky world-writable directory, or when the uid of the symlink and
+follower match, or when the directory owner matches the symlink's owner.
+
+This protection is based on the restrictions in Openwall and grsecurity.
+
+==============================================================
+
suid_dumpable:
This value can be used to query and set the core dump mode for setuid
or otherwise protected/tainted binaries. The modes are
0 - (default) - traditional behaviour. Any process which has changed
- privilege levels or is execute only will not be dumped
+ privilege levels or is execute only will not be dumped.
1 - (debug) - all processes dump core when possible. The core dump is
owned by the current user and no security is applied. This is
intended for system debugging situations only. Ptrace is unchecked.
+ This is insecure as it allows regular users to examine the memory
+ contents of privileged processes.
2 - (suidsafe) - any binary which normally would not be dumped is dumped
- readable by root only. This allows the end user to remove
- such a dump but not access it directly. For security reasons
- core dumps in this mode will not overwrite one another or
- other files. This mode is appropriate when administrators are
- attempting to debug problems in a normal environment.
+ anyway, but only if the "core_pattern" kernel sysctl is set to
+ either a pipe handler or a fully qualified path. (For more details
+ on this limitation, see CVE-2006-2451.) This mode is appropriate
+ when administrators are attempting to debug problems in a normal
+ environment, and either have a core dump pipe handler that knows
+ to treat privileged core dumps with care, or specific directory
+ defined for catching core dumps. If a core dump happens without
+ a pipe handler or fully qualifid path, a message will be emitted
+ to syslog warning about the lack of a correct setting.
==============================================================
@@ -226,19 +273,19 @@ a queue must be less or equal then msg_max.
maximum message size value (it is every message queue's attribute set during
its creation).
+/proc/sys/fs/mqueue/msg_default is a read/write file for setting/getting the
+default number of messages in a queue value if attr parameter of mq_open(2) is
+NULL. If it exceed msg_max, the default value is initialized msg_max.
+
+/proc/sys/fs/mqueue/msgsize_default is a read/write file for setting/getting
+the default message size value if attr parameter of mq_open(2) is NULL. If it
+exceed msgsize_max, the default value is initialized msgsize_max.
4. /proc/sys/fs/epoll - Configuration options for the epoll interface
--------------------------------------------------------
This directory contains configuration options for the epoll(7) interface.
-max_user_instances
-------------------
-
-This is the maximum number of epoll file descriptors that a single user can
-have open at a given time. The default value is 128, and should be enough
-for normal users.
-
max_user_watches
----------------
diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 209e1584c3d..c14374e7177 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -17,27 +17,33 @@ before actually making adjustments.
Currently, these files might (depending on your configuration)
show up in /proc/sys/kernel:
-- acpi_video_flags
+
- acct
+- acpi_video_flags
+- auto_msgmni
- bootloader_type [ X86 only ]
- bootloader_version [ X86 only ]
- callhome [ S390 only ]
-- auto_msgmni
+- cap_last_cap
- core_pattern
- core_pipe_limit
- core_uses_pid
- ctrl-alt-del
-- dentry-state
- dmesg_restrict
- domainname
- hostname
- hotplug
-- java-appletviewer [ binfmt_java, obsolete ]
-- java-interpreter [ binfmt_java, obsolete ]
+- hung_task_panic
+- hung_task_check_count
+- hung_task_timeout_secs
+- hung_task_warnings
+- kexec_load_disabled
+- kptr_restrict
- kstack_depth_to_print [ X86 only ]
- l2cr [ PPC only ]
- modprobe ==> Documentation/debugging-modules.txt
- modules_disabled
+- msg_next_id [ sysv ipc ]
- msgmax
- msgmnb
- msgmni
@@ -47,38 +53,40 @@ show up in /proc/sys/kernel:
- overflowgid
- overflowuid
- panic
+- panic_on_oops
+- panic_on_unrecovered_nmi
+- panic_on_stackoverflow
- pid_max
- powersave-nap [ PPC only ]
-- panic_on_unrecovered_nmi
- printk
+- printk_delay
+- printk_ratelimit
+- printk_ratelimit_burst
- randomize_va_space
- real-root-dev ==> Documentation/initrd.txt
- reboot-cmd [ SPARC only ]
- rtsig-max
- rtsig-nr
- sem
+- sem_next_id [ sysv ipc ]
- sg-big-buff [ generic SCSI device (sg) ]
+- shm_next_id [ sysv ipc ]
+- shm_rmid_forced
- shmall
- shmmax [ sysv ipc ]
- shmmni
+- softlockup_all_cpu_backtrace
- stop-a [ SPARC only ]
- sysrq ==> Documentation/sysrq.txt
+- sysctl_writes_strict
- tainted
- threads-max
- unknown_nmi_panic
+- watchdog_thresh
- version
==============================================================
-acpi_video_flags:
-
-flags
-
-See Doc*/kernel/power/video.txt, it allows mode of video boot to be
-set during run time.
-
-==============================================================
-
acct:
highwater lowwater frequency
@@ -96,6 +104,25 @@ valid for 30 seconds.
==============================================================
+acpi_video_flags:
+
+flags
+
+See Doc*/kernel/power/video.txt, it allows mode of video boot to be
+set during run time.
+
+==============================================================
+
+auto_msgmni:
+
+Enables/Disables automatic recomputing of msgmni upon memory add/remove
+or upon ipc namespace creation/removal (see the msgmni description
+above). Echoing "1" into this file enables msgmni automatic recomputing.
+Echoing "0" turns it off. auto_msgmni default value is 1.
+
+
+==============================================================
+
bootloader_type:
x86 bootloader identification
@@ -140,6 +167,13 @@ on has a service contract with IBM.
==============================================================
+cap_last_cap
+
+Highest valid capability of the running kernel. Exports
+CAP_LAST_CAP from the kernel.
+
+==============================================================
+
core_pattern:
core_pattern is used to specify a core dumpfile pattern name.
@@ -155,12 +189,16 @@ core_pattern is used to specify a core dumpfile pattern name.
%<NUL> '%' is dropped
%% output one '%'
%p pid
+ %P global pid (init PID namespace)
%u uid
%g gid
+ %d dump mode, matches PR_SET_DUMPABLE and
+ /proc/sys/fs/suid_dumpable
%s signal number
%t UNIX time of dump
%h hostname
- %e executable filename
+ %e executable filename (may be shortened)
+ %E executable path
%<OTHER> both are dropped
. If the first character of the pattern is a '|', the kernel will treat
the rest of the pattern as a command to run. The core dump will be
@@ -170,22 +208,24 @@ core_pattern is used to specify a core dumpfile pattern name.
core_pipe_limit:
-This sysctl is only applicable when core_pattern is configured to pipe core
-files to a user space helper (when the first character of core_pattern is a '|',
-see above). When collecting cores via a pipe to an application, it is
-occasionally useful for the collecting application to gather data about the
-crashing process from its /proc/pid directory. In order to do this safely, the
-kernel must wait for the collecting process to exit, so as not to remove the
-crashing processes proc files prematurely. This in turn creates the possibility
-that a misbehaving userspace collecting process can block the reaping of a
-crashed process simply by never exiting. This sysctl defends against that. It
-defines how many concurrent crashing processes may be piped to user space
-applications in parallel. If this value is exceeded, then those crashing
-processes above that value are noted via the kernel log and their cores are
-skipped. 0 is a special value, indicating that unlimited processes may be
-captured in parallel, but that no waiting will take place (i.e. the collecting
-process is not guaranteed access to /proc/<crashing pid>/). This value defaults
-to 0.
+This sysctl is only applicable when core_pattern is configured to pipe
+core files to a user space helper (when the first character of
+core_pattern is a '|', see above). When collecting cores via a pipe
+to an application, it is occasionally useful for the collecting
+application to gather data about the crashing process from its
+/proc/pid directory. In order to do this safely, the kernel must wait
+for the collecting process to exit, so as not to remove the crashing
+processes proc files prematurely. This in turn creates the
+possibility that a misbehaving userspace collecting process can block
+the reaping of a crashed process simply by never exiting. This sysctl
+defends against that. It defines how many concurrent crashing
+processes may be piped to user space applications in parallel. If
+this value is exceeded, then those crashing processes above that value
+are noted via the kernel log and their cores are skipped. 0 is a
+special value, indicating that unlimited processes may be captured in
+parallel, but that no waiting will take place (i.e. the collecting
+process is not guaranteed access to /proc/<crashing pid>/). This
+value defaults to 0.
==============================================================
@@ -216,14 +256,14 @@ to decide what to do with it.
dmesg_restrict:
-This toggle indicates whether unprivileged users are prevented from using
-dmesg(8) to view messages from the kernel's log buffer. When
-dmesg_restrict is set to (0) there are no restrictions. When
-dmesg_restrict is set set to (1), users must have CAP_SYS_ADMIN to use
+This toggle indicates whether unprivileged users are prevented
+from using dmesg(8) to view messages from the kernel's log buffer.
+When dmesg_restrict is set to (0) there are no restrictions. When
+dmesg_restrict is set set to (1), users must have CAP_SYSLOG to use
dmesg(8).
-The kernel config option CONFIG_SECURITY_DMESG_RESTRICT sets the default
-value of dmesg_restrict.
+The kernel config option CONFIG_SECURITY_DMESG_RESTRICT sets the
+default value of dmesg_restrict.
==============================================================
@@ -254,10 +294,79 @@ Default value is "/sbin/hotplug".
==============================================================
-l2cr: (PPC only)
+hung_task_panic:
-This flag controls the L2 cache of G3 processor boards. If
-0, the cache is disabled. Enabled if nonzero.
+Controls the kernel's behavior when a hung task is detected.
+This file shows up if CONFIG_DETECT_HUNG_TASK is enabled.
+
+0: continue operation. This is the default behavior.
+
+1: panic immediately.
+
+==============================================================
+
+hung_task_check_count:
+
+The upper bound on the number of tasks that are checked.
+This file shows up if CONFIG_DETECT_HUNG_TASK is enabled.
+
+==============================================================
+
+hung_task_timeout_secs:
+
+Check interval. When a task in D state did not get scheduled
+for more than this value report a warning.
+This file shows up if CONFIG_DETECT_HUNG_TASK is enabled.
+
+0: means infinite timeout - no checking done.
+Possible values to set are in range {0..LONG_MAX/HZ}.
+
+==============================================================
+
+hung_task_warnings:
+
+The maximum number of warnings to report. During a check interval
+if a hung task is detected, this value is decreased by 1.
+When this value reaches 0, no more warnings will be reported.
+This file shows up if CONFIG_DETECT_HUNG_TASK is enabled.
+
+-1: report an infinite number of warnings.
+
+==============================================================
+
+kexec_load_disabled:
+
+A toggle indicating if the kexec_load syscall has been disabled. This
+value defaults to 0 (false: kexec_load enabled), but can be set to 1
+(true: kexec_load disabled). Once true, kexec can no longer be used, and
+the toggle cannot be set back to false. This allows a kexec image to be
+loaded before disabling the syscall, allowing a system to set up (and
+later use) an image without it being altered. Generally used together
+with the "modules_disabled" sysctl.
+
+==============================================================
+
+kptr_restrict:
+
+This toggle indicates whether restrictions are placed on
+exposing kernel addresses via /proc and other interfaces.
+
+When kptr_restrict is set to (0), the default, there are no restrictions.
+
+When kptr_restrict is set to (1), kernel pointers printed using the %pK
+format specifier will be replaced with 0's unless the user has CAP_SYSLOG
+and effective user and group ids are equal to the real ids. This is
+because %pK checks are done at read() time rather than open() time, so
+if permissions are elevated between the open() and the read() (e.g via
+a setuid binary) then %pK will not leak kernel pointers to unprivileged
+users. Note, this is a temporary solution only. The correct long-term
+solution is to do the permission checks at open() time. Consider removing
+world read permissions from files that use %pK, and using dmesg_restrict
+to protect against uses of %pK in dmesg(8) if leaking kernel pointer
+values to unprivileged users is a concern.
+
+When kptr_restrict is set to (2), kernel pointers printed using
+%pK will be replaced with 0's regardless of privileges.
==============================================================
@@ -268,13 +377,114 @@ kernel stack.
==============================================================
+l2cr: (PPC only)
+
+This flag controls the L2 cache of G3 processor boards. If
+0, the cache is disabled. Enabled if nonzero.
+
+==============================================================
+
modules_disabled:
A toggle value indicating if modules are allowed to be loaded
in an otherwise modular kernel. This toggle defaults to off
(0), but can be set true (1). Once true, modules can be
neither loaded nor unloaded, and the toggle cannot be set back
-to false.
+to false. Generally used with the "kexec_load_disabled" toggle.
+
+==============================================================
+
+msg_next_id, sem_next_id, and shm_next_id:
+
+These three toggles allows to specify desired id for next allocated IPC
+object: message, semaphore or shared memory respectively.
+
+By default they are equal to -1, which means generic allocation logic.
+Possible values to set are in range {0..INT_MAX}.
+
+Notes:
+1) kernel doesn't guarantee, that new object will have desired id. So,
+it's up to userspace, how to handle an object with "wrong" id.
+2) Toggle with non-default value will be set back to -1 by kernel after
+successful IPC object allocation.
+
+==============================================================
+
+nmi_watchdog:
+
+Enables/Disables the NMI watchdog on x86 systems. When the value is
+non-zero the NMI watchdog is enabled and will continuously test all
+online cpus to determine whether or not they are still functioning
+properly. Currently, passing "nmi_watchdog=" parameter at boot time is
+required for this function to work.
+
+If LAPIC NMI watchdog method is in use (nmi_watchdog=2 kernel
+parameter), the NMI watchdog shares registers with oprofile. By
+disabling the NMI watchdog, oprofile may have more registers to
+utilize.
+
+==============================================================
+
+numa_balancing
+
+Enables/disables automatic page fault based NUMA memory
+balancing. Memory is moved automatically to nodes
+that access it often.
+
+Enables/disables automatic NUMA memory balancing. On NUMA machines, there
+is a performance penalty if remote memory is accessed by a CPU. When this
+feature is enabled the kernel samples what task thread is accessing memory
+by periodically unmapping pages and later trapping a page fault. At the
+time of the page fault, it is determined if the data being accessed should
+be migrated to a local memory node.
+
+The unmapping of pages and trapping faults incur additional overhead that
+ideally is offset by improved memory locality but there is no universal
+guarantee. If the target workload is already bound to NUMA nodes then this
+feature should be disabled. Otherwise, if the system overhead from the
+feature is too high then the rate the kernel samples for NUMA hinting
+faults may be controlled by the numa_balancing_scan_period_min_ms,
+numa_balancing_scan_delay_ms, numa_balancing_scan_period_max_ms,
+numa_balancing_scan_size_mb, and numa_balancing_settle_count sysctls.
+
+==============================================================
+
+numa_balancing_scan_period_min_ms, numa_balancing_scan_delay_ms,
+numa_balancing_scan_period_max_ms, numa_balancing_scan_size_mb
+
+Automatic NUMA balancing scans tasks address space and unmaps pages to
+detect if pages are properly placed or if the data should be migrated to a
+memory node local to where the task is running. Every "scan delay" the task
+scans the next "scan size" number of pages in its address space. When the
+end of the address space is reached the scanner restarts from the beginning.
+
+In combination, the "scan delay" and "scan size" determine the scan rate.
+When "scan delay" decreases, the scan rate increases. The scan delay and
+hence the scan rate of every task is adaptive and depends on historical
+behaviour. If pages are properly placed then the scan delay increases,
+otherwise the scan delay decreases. The "scan size" is not adaptive but
+the higher the "scan size", the higher the scan rate.
+
+Higher scan rates incur higher system overhead as page faults must be
+trapped and potentially data must be migrated. However, the higher the scan
+rate, the more quickly a tasks memory is migrated to a local node if the
+workload pattern changes and minimises performance impact due to remote
+memory accesses. These sysctls control the thresholds for scan delays and
+the number of pages scanned.
+
+numa_balancing_scan_period_min_ms is the minimum time in milliseconds to
+scan a tasks virtual memory. It effectively controls the maximum scanning
+rate for each task.
+
+numa_balancing_scan_delay_ms is the starting "scan delay" used for a task
+when it initially forks.
+
+numa_balancing_scan_period_max_ms is the maximum time in milliseconds to
+scan a tasks virtual memory. It effectively controls the minimum scanning
+rate for each task.
+
+numa_balancing_scan_size_mb is how many megabytes worth of pages are
+scanned for a given scan.
==============================================================
@@ -297,10 +507,10 @@ The only way to tune these values is to rebuild the kernel :-)
overflowgid & overflowuid:
-if your architecture did not always support 32-bit UIDs (i.e. arm, i386,
-m68k, sh, and sparc32), a fixed UID and GID will be returned to
-applications that use the old 16-bit UID/GID system calls, if the actual
-UID or GID would exceed 65535.
+if your architecture did not always support 32-bit UIDs (i.e. arm,
+i386, m68k, sh, and sparc32), a fixed UID and GID will be returned to
+applications that use the old 16-bit UID/GID system calls, if the
+actual UID or GID would exceed 65535.
These sysctls allow you to change the value of the fixed UID and GID.
The default is 65534.
@@ -309,9 +519,22 @@ The default is 65534.
panic:
-The value in this file represents the number of seconds the
-kernel waits before rebooting on a panic. When you use the
-software watchdog, the recommended setting is 60.
+The value in this file represents the number of seconds the kernel
+waits before rebooting on a panic. When you use the software watchdog,
+the recommended setting is 60.
+
+==============================================================
+
+panic_on_unrecovered_nmi:
+
+The default Linux behaviour on an NMI of either memory or unknown is
+to continue operation. For many environments such as scientific
+computing it is preferable that the box is taken out and the error
+dealt with than an uncorrected parity/ECC error get propagated.
+
+A small number of systems do generate NMI's for bizarre random reasons
+such as power management so the default is off. That sysctl works like
+the existing panic controls already in that directory.
==============================================================
@@ -326,6 +549,45 @@ Controls the kernel's behaviour when an oops or BUG is encountered.
==============================================================
+panic_on_stackoverflow:
+
+Controls the kernel's behavior when detecting the overflows of
+kernel, IRQ and exception stacks except a user stack.
+This file shows up if CONFIG_DEBUG_STACKOVERFLOW is enabled.
+
+0: try to continue operation.
+
+1: panic immediately.
+
+==============================================================
+
+perf_cpu_time_max_percent:
+
+Hints to the kernel how much CPU time it should be allowed to
+use to handle perf sampling events. If the perf subsystem
+is informed that its samples are exceeding this limit, it
+will drop its sampling frequency to attempt to reduce its CPU
+usage.
+
+Some perf sampling happens in NMIs. If these samples
+unexpectedly take too long to execute, the NMIs can become
+stacked up next to each other so much that nothing else is
+allowed to execute.
+
+0: disable the mechanism. Do not monitor or correct perf's
+ sampling rate no matter how CPU time it takes.
+
+1-100: attempt to throttle perf's sample rate to this
+ percentage of CPU. Note: the kernel calculates an
+ "expected" length of each sample event. 100 here means
+ 100% of that expected length. Even if this is set to
+ 100, you may still see sample throttling if this
+ length is exceeded. Set to 0 if you truly do not care
+ how much CPU is consumed.
+
+==============================================================
+
+
pid_max:
PID allocation wrap value. When the kernel's next PID value
@@ -334,6 +596,14 @@ PIDs of value pid_max or larger are not allocated.
==============================================================
+ns_last_pid:
+
+The last pid allocated in the current (the one task using this sysctl
+lives in) pid namespace. When selecting a pid for a next task on fork
+kernel tries to allocate a number starting from this one.
+
+==============================================================
+
powersave-nap: (PPC only)
If set, Linux-PPC will use the 'nap' mode of powersaving,
@@ -353,7 +623,7 @@ the different loglevels.
- console_loglevel: messages with a higher priority than
this will be printed to the console
-- default_message_level: messages without an explicit priority
+- default_message_loglevel: messages without an explicit priority
will be printed with this priority
- minimum_console_loglevel: minimum (highest) value to which
console_loglevel can be set
@@ -361,6 +631,14 @@ the different loglevels.
==============================================================
+printk_delay:
+
+Delay each printk message in printk_delay milliseconds
+
+Value from 0 - 10000 is allowed.
+
+==============================================================
+
printk_ratelimit:
Some warning messages are rate limited. printk_ratelimit specifies
@@ -380,15 +658,7 @@ send before ratelimiting kicks in.
==============================================================
-printk_delay:
-
-Delay each printk message in printk_delay milliseconds
-
-Value from 0 - 10000 is allowed.
-
-==============================================================
-
-randomize-va-space:
+randomize_va_space:
This option can be used to select the type of process address
space randomization that is used in the system, for architectures
@@ -451,25 +721,86 @@ are doing anyway :)
==============================================================
-shmmax:
+shmall:
+
+This parameter sets the total amount of shared memory pages that
+can be used system wide. Hence, SHMALL should always be at least
+ceil(shmmax/PAGE_SIZE).
+
+If you are not sure what the default PAGE_SIZE is on your Linux
+system, you can run the following command:
+
+# getconf PAGE_SIZE
+
+==============================================================
+
+shmmax:
This value can be used to query and set the run time limit
on the maximum shared memory segment size that can be created.
-Shared memory segments up to 1Gb are now supported in the
+Shared memory segments up to 1Gb are now supported in the
kernel. This value defaults to SHMMAX.
==============================================================
-softlockup_thresh:
+shm_rmid_forced:
-This value can be used to lower the softlockup tolerance threshold. The
-default threshold is 60 seconds. If a cpu is locked up for 60 seconds,
-the kernel complains. Valid values are 1-60 seconds. Setting this
-tunable to zero will disable the softlockup detection altogether.
+Linux lets you set resource limits, including how much memory one
+process can consume, via setrlimit(2). Unfortunately, shared memory
+segments are allowed to exist without association with any process, and
+thus might not be counted against any resource limits. If enabled,
+shared memory segments are automatically destroyed when their attach
+count becomes zero after a detach or a process termination. It will
+also destroy segments that were created, but never attached to, on exit
+from the process. The only use left for IPC_RMID is to immediately
+destroy an unattached segment. Of course, this breaks the way things are
+defined, so some applications might stop working. Note that this
+feature will do you no good unless you also configure your resource
+limits (in particular, RLIMIT_AS and RLIMIT_NPROC). Most systems don't
+need this.
+
+Note that if you change this from 0 to 1, already created segments
+without users and with a dead originative process will be destroyed.
==============================================================
-tainted:
+sysctl_writes_strict:
+
+Control how file position affects the behavior of updating sysctl values
+via the /proc/sys interface:
+
+ -1 - Legacy per-write sysctl value handling, with no printk warnings.
+ Each write syscall must fully contain the sysctl value to be
+ written, and multiple writes on the same sysctl file descriptor
+ will rewrite the sysctl value, regardless of file position.
+ 0 - (default) Same behavior as above, but warn about processes that
+ perform writes to a sysctl file descriptor when the file position
+ is not 0.
+ 1 - Respect file position when writing sysctl strings. Multiple writes
+ will append to the sysctl value buffer. Anything past the max length
+ of the sysctl value buffer will be ignored. Writes to numeric sysctl
+ entries must always be at file position 0 and the value must be
+ fully contained in the buffer sent in the write syscall.
+
+==============================================================
+
+softlockup_all_cpu_backtrace:
+
+This value controls the soft lockup detector thread's behavior
+when a soft lockup condition is detected as to whether or not
+to gather further debug information. If enabled, each cpu will
+be issued an NMI and instructed to capture stack trace.
+
+This feature is only applicable for architectures which support
+NMI.
+
+0: do nothing. This is the default behavior.
+
+1: on detection capture more debug information.
+
+==============================================================
+
+tainted:
Non-zero if the kernel has been tainted. Numeric values, which
can be ORed together:
@@ -491,52 +822,31 @@ can be ORed together:
instead of using the one provided by the hardware.
512 - A kernel warning has occurred.
1024 - A module from drivers/staging was loaded.
-
-==============================================================
-
-auto_msgmni:
-
-Enables/Disables automatic recomputing of msgmni upon memory add/remove or
-upon ipc namespace creation/removal (see the msgmni description above).
-Echoing "1" into this file enables msgmni automatic recomputing.
-Echoing "0" turns it off.
-auto_msgmni default value is 1.
-
-==============================================================
-
-nmi_watchdog:
-
-Enables/Disables the NMI watchdog on x86 systems. When the value is non-zero
-the NMI watchdog is enabled and will continuously test all online cpus to
-determine whether or not they are still functioning properly. Currently,
-passing "nmi_watchdog=" parameter at boot time is required for this function
-to work.
-
-If LAPIC NMI watchdog method is in use (nmi_watchdog=2 kernel parameter), the
-NMI watchdog shares registers with oprofile. By disabling the NMI watchdog,
-oprofile may have more registers to utilize.
+2048 - The system is working around a severe firmware bug.
+4096 - An out-of-tree module has been loaded.
+8192 - An unsigned module has been loaded in a kernel supporting module
+ signature.
==============================================================
unknown_nmi_panic:
-The value in this file affects behavior of handling NMI. When the value is
-non-zero, unknown NMI is trapped and then panic occurs. At that time, kernel
-debugging information is displayed on console.
+The value in this file affects behavior of handling NMI. When the
+value is non-zero, unknown NMI is trapped and then panic occurs. At
+that time, kernel debugging information is displayed on console.
-NMI switch that most IA32 servers have fires unknown NMI up, for example.
-If a system hangs up, try pressing the NMI switch.
+NMI switch that most IA32 servers have fires unknown NMI up, for
+example. If a system hangs up, try pressing the NMI switch.
==============================================================
-panic_on_unrecovered_nmi:
+watchdog_thresh:
-The default Linux behaviour on an NMI of either memory or unknown is to continue
-operation. For many environments such as scientific computing it is preferable
-that the box is taken out and the error dealt with than an uncorrected
-parity/ECC error get propogated.
+This value can be used to control the frequency of hrtimer and NMI
+events and the soft and hard lockup thresholds. The default threshold
+is 10 seconds.
-A small number of systems do generate NMI's for bizarre random reasons such as
-power management so the default is off. That sysctl works like the existing
-panic controls already in that directory.
+The softlockup threshold is (2 * watchdog_thresh). Setting this
+tunable to zero will disable lockup detection altogether.
+==============================================================
diff --git a/Documentation/sysctl/net.txt b/Documentation/sysctl/net.txt
index cbd05ffc606..9a0319a8247 100644
--- a/Documentation/sysctl/net.txt
+++ b/Documentation/sysctl/net.txt
@@ -1,4 +1,4 @@
-Documentation for /proc/sys/net/* kernel version 2.4.0-test11-pre4
+Documentation for /proc/sys/net/*
(c) 1999 Terrehon Bowden <terrehon@pacbell.net>
Bodo Bauer <bb@ricochet.net>
(c) 2000 Jorge Nerin <comandante@zaralinux.com>
@@ -9,10 +9,10 @@ For general info and legal blurb, please look in README.
==============================================================
This file contains the documentation for the sysctl files in
-/proc/sys/net and is valid for Linux kernel version 2.4.0-test11-pre4.
+/proc/sys/net
The interface to the networking parts of the kernel is located in
-/proc/sys/net. The following table shows all possible subdirectories.You may
+/proc/sys/net. The following table shows all possible subdirectories. You may
see only some of them, depending on your kernel's configuration.
@@ -26,12 +26,67 @@ Table : Subdirectories in /proc/sys/net
ipv4 IP version 4 x25 X.25 protocol
ipx IPX token-ring IBM token ring
bridge Bridging decnet DEC net
- ipv6 IP version 6
+ ipv6 IP version 6 tipc TIPC
..............................................................................
1. /proc/sys/net/core - Network core options
-------------------------------------------------------
+bpf_jit_enable
+--------------
+
+This enables Berkeley Packet Filter Just in Time compiler.
+Currently supported on x86_64 architecture, bpf_jit provides a framework
+to speed packet filtering, the one used by tcpdump/libpcap for example.
+Values :
+ 0 - disable the JIT (default value)
+ 1 - enable the JIT
+ 2 - enable the JIT and ask the compiler to emit traces on kernel log.
+
+dev_weight
+--------------
+
+The maximum number of packets that kernel can handle on a NAPI interrupt,
+it's a Per-CPU variable.
+Default: 64
+
+default_qdisc
+--------------
+
+The default queuing discipline to use for network devices. This allows
+overriding the default queue discipline of pfifo_fast with an
+alternative. Since the default queuing discipline is created with the
+no additional parameters so is best suited to queuing disciplines that
+work well without configuration like stochastic fair queue (sfq),
+CoDel (codel) or fair queue CoDel (fq_codel). Don't use queuing disciplines
+like Hierarchical Token Bucket or Deficit Round Robin which require setting
+up classes and bandwidths.
+Default: pfifo_fast
+
+busy_read
+----------------
+Low latency busy poll timeout for socket reads. (needs CONFIG_NET_RX_BUSY_POLL)
+Approximate time in us to busy loop waiting for packets on the device queue.
+This sets the default value of the SO_BUSY_POLL socket option.
+Can be set or overridden per socket by setting socket option SO_BUSY_POLL,
+which is the preferred method of enabling. If you need to enable the feature
+globally via sysctl, a value of 50 is recommended.
+Will increase power usage.
+Default: 0 (off)
+
+busy_poll
+----------------
+Low latency busy poll timeout for poll and select. (needs CONFIG_NET_RX_BUSY_POLL)
+Approximate time in us to busy loop waiting for events.
+Recommended value depends on the number of sockets you poll on.
+For several sockets 50, for several hundreds 100.
+For more than that you probably want to use epoll.
+Note that only sockets with SO_BUSY_POLL set will be busy polled,
+so you want to either selectively set SO_BUSY_POLL on those sockets or set
+sysctl.net.busy_read globally.
+Will increase power usage.
+Default: 0 (off)
+
rmem_default
------------
@@ -75,8 +130,7 @@ netdev_budget
Maximum number of packets taken from all interfaces in one polling cycle (NAPI
poll). In one polling cycle interfaces which are registered to polling are
-probed in a round-robin manner. The limit of packets in one such probe can be
-set per-device via sysfs class/net/<device>/weight .
+probed in a round-robin manner.
netdev_max_backlog
------------------
@@ -183,3 +237,18 @@ IPX.
The /proc/net/ipx_route table holds a list of IPX routes. For each route it
gives the destination network, the router node (or Directly) and the network
address of the router (or Connected) for internal networks.
+
+6. TIPC
+-------------------------------------------------------
+
+The TIPC protocol now has a tunable for the receive memory, similar to the
+tcp_rmem - i.e. a vector of 3 INTEGERs: (min, default, max)
+
+ # cat /proc/sys/net/tipc/tipc_rmem
+ 4252725 34021800 68043600
+ #
+
+The max value is set to CONN_OVERLOAD_LIMIT, and the default and min values
+are scaled (shifted) versions of that same value. Note that the min value
+is not at this point in time used in any meaningful way, but the triplet is
+preserved in order to be consistent with things like tcp_rmem.
diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 30289fab86e..4415aa91568 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -18,6 +18,7 @@ files can be found in mm/swap.c.
Currently, these files are in /proc/sys/vm:
+- admin_reserve_kbytes
- block_dump
- compact_memory
- dirty_background_bytes
@@ -42,11 +43,11 @@ Currently, these files are in /proc/sys/vm:
- mmap_min_addr
- nr_hugepages
- nr_overcommit_hugepages
-- nr_pdflush_threads
- nr_trim_pages (only if CONFIG_MMU=n)
- numa_zonelist_order
- oom_dump_tasks
- oom_kill_allocating_task
+- overcommit_kbytes
- overcommit_memory
- overcommit_ratio
- page-cluster
@@ -54,11 +55,41 @@ Currently, these files are in /proc/sys/vm:
- percpu_pagelist_fraction
- stat_interval
- swappiness
+- user_reserve_kbytes
- vfs_cache_pressure
- zone_reclaim_mode
==============================================================
+admin_reserve_kbytes
+
+The amount of free memory in the system that should be reserved for users
+with the capability cap_sys_admin.
+
+admin_reserve_kbytes defaults to min(3% of free pages, 8MB)
+
+That should provide enough for the admin to log in and kill a process,
+if necessary, under the default overcommit 'guess' mode.
+
+Systems running under overcommit 'never' should increase this to account
+for the full Virtual Memory Size of programs used to recover. Otherwise,
+root may not be able to log in to recover the system.
+
+How do you calculate a minimum useful reserve?
+
+sshd or login + bash (or some other shell) + top (or ps, kill, etc.)
+
+For overcommit 'guess', we can sum resident set sizes (RSS).
+On x86_64 this is about 8MB.
+
+For overcommit 'never', we can take the max of their virtual sizes (VSZ)
+and add the sum of their RSS.
+On x86_64 this is about 128MB.
+
+Changing this takes effect whenever an application requests memory.
+
+==============================================================
+
block_dump
block_dump enables block I/O debugging when set to a nonzero value. More
@@ -77,8 +108,8 @@ huge pages although processes will also directly compact memory as required.
dirty_background_bytes
-Contains the amount of dirty memory at which the pdflush background writeback
-daemon will start writeback.
+Contains the amount of dirty memory at which the background kernel
+flusher threads will start writeback.
Note: dirty_background_bytes is the counterpart of dirty_background_ratio. Only
one of them may be specified at a time. When one sysctl is written it is
@@ -89,8 +120,11 @@ other appears as 0 when read.
dirty_background_ratio
-Contains, as a percentage of total system memory, the number of pages at which
-the pdflush background writeback daemon will start writing out dirty data.
+Contains, as a percentage of total available memory that contains free pages
+and reclaimable pages, the number of pages at which the background kernel
+flusher threads will start writing out dirty data.
+
+The total avaiable memory is not equal to total system memory.
==============================================================
@@ -113,23 +147,25 @@ retained.
dirty_expire_centisecs
This tunable is used to define when dirty data is old enough to be eligible
-for writeout by the pdflush daemons. It is expressed in 100'ths of a second.
-Data which has been dirty in-memory for longer than this interval will be
-written out next time a pdflush daemon wakes up.
+for writeout by the kernel flusher threads. It is expressed in 100'ths
+of a second. Data which has been dirty in-memory for longer than this
+interval will be written out next time a flusher thread wakes up.
==============================================================
dirty_ratio
-Contains, as a percentage of total system memory, the number of pages at which
-a process which is generating disk writes will itself start writing out dirty
-data.
+Contains, as a percentage of total available memory that contains free pages
+and reclaimable pages, the number of pages at which a process which is
+generating disk writes will itself start writing out dirty data.
+
+The total avaiable memory is not equal to total system memory.
==============================================================
dirty_writeback_centisecs
-The pdflush writeback daemons will periodically wake up and write `old' data
+The kernel flusher threads will periodically wake up and write `old' data
out to disk. This tunable expresses the interval between those wakeups, in
100'ths of a second.
@@ -139,18 +175,39 @@ Setting this to zero disables periodic writeback altogether.
drop_caches
-Writing to this will cause the kernel to drop clean caches, dentries and
-inodes from memory, causing that memory to become free.
+Writing to this will cause the kernel to drop clean caches, as well as
+reclaimable slab objects like dentries and inodes. Once dropped, their
+memory becomes free.
To free pagecache:
echo 1 > /proc/sys/vm/drop_caches
-To free dentries and inodes:
+To free reclaimable slab objects (includes dentries and inodes):
echo 2 > /proc/sys/vm/drop_caches
-To free pagecache, dentries and inodes:
+To free slab objects and pagecache:
echo 3 > /proc/sys/vm/drop_caches
-As this is a non-destructive operation and dirty objects are not freeable, the
-user should run `sync' first.
+This is a non-destructive operation and will not free any dirty objects.
+To increase the number of objects freed by this operation, the user may run
+`sync' prior to writing to /proc/sys/vm/drop_caches. This will minimize the
+number of dirty objects on the system and create more candidates to be
+dropped.
+
+This file is not a means to control the growth of the various kernel caches
+(inodes, dentries, pagecache, etc...) These objects are automatically
+reclaimed by the kernel when memory is needed elsewhere on the system.
+
+Use of this file can cause performance problems. Since it discards cached
+objects, it may cost a significant amount of I/O and CPU to recreate the
+dropped objects, especially if they were under heavy use. Because of this,
+use outside of a testing or debugging environment is not recommended.
+
+You may see informational messages in your kernel log when this file is
+used:
+
+ cat (1234): drop_caches: 3
+
+These are informational only. They do not mean that anything is wrong
+with your system. To disable them, echo 4 (bit 3) into drop_caches.
==============================================================
@@ -170,17 +227,25 @@ fragmentation index is <= extfrag_threshold. The default value is 500.
hugepages_treat_as_movable
-This parameter is only useful when kernelcore= is specified at boot time to
-create ZONE_MOVABLE for pages that may be reclaimed or migrated. Huge pages
-are not movable so are not normally allocated from ZONE_MOVABLE. A non-zero
-value written to hugepages_treat_as_movable allows huge pages to be allocated
-from ZONE_MOVABLE.
+This parameter controls whether we can allocate hugepages from ZONE_MOVABLE
+or not. If set to non-zero, hugepages can be allocated from ZONE_MOVABLE.
+ZONE_MOVABLE is created when kernel boot parameter kernelcore= is specified,
+so this parameter has no effect if used without kernelcore=.
-Once enabled, the ZONE_MOVABLE is treated as an area of memory the huge
-pages pool can easily grow or shrink within. Assuming that applications are
-not running that mlock() a lot of memory, it is likely the huge pages pool
-can grow to the size of ZONE_MOVABLE by repeatedly entering the desired value
-into nr_hugepages and triggering page reclaim.
+Hugepage migration is now available in some situations which depend on the
+architecture and/or the hugepage size. If a hugepage supports migration,
+allocation from ZONE_MOVABLE is always enabled for the hugepage regardless
+of the value of this parameter.
+IOW, this parameter affects only non-migratable hugepages.
+
+Assuming that hugepages are not migratable in your system, one usecase of
+this parameter is that users can make hugepage pool more extensible by
+enabling the allocation from ZONE_MOVABLE. This is because on ZONE_MOVABLE
+page reclaim/migration/compaction work more and you can get contiguous
+memory more likely. Note that using ZONE_MOVABLE for non-migratable
+hugepages can do harm to other features like memory hotremove (because
+memory hotremove expects that memory blocks on ZONE_MOVABLE are always
+removable,) so it's a trade-off responsible for the users.
==============================================================
@@ -426,16 +491,6 @@ See Documentation/vm/hugetlbpage.txt
==============================================================
-nr_pdflush_threads
-
-The current number of pdflush threads. This value is read-only.
-The value changes according to the number of dirty pages in the system.
-
-When necessary, additional pdflush threads are created, one per second, up to
-nr_pdflush_threads_max.
-
-==============================================================
-
nr_trim_pages
This is available only on NOMMU kernels.
@@ -481,16 +536,16 @@ the DMA zone.
Type(A) is called as "Node" order. Type (B) is "Zone" order.
"Node order" orders the zonelists by node, then by zone within each node.
-Specify "[Nn]ode" for zone order
+Specify "[Nn]ode" for node order
"Zone Order" orders the zonelists by zone type, then by node within each
-zone. Specify "[Zz]one"for zode order.
+zone. Specify "[Zz]one" for zone order.
Specify "[Dd]efault" to request automatic configuration. Autoconfiguration
will select "node" order in following case.
(1) if the DMA zone does not exist or
(2) if the DMA zone comprises greater than 50% of the available memory or
-(3) if any node's DMA zone comprises greater than 60% of its local memory and
+(3) if any node's DMA zone comprises greater than 70% of its local memory and
the amount of local memory is big enough.
Otherwise, "zone" order will be selected. Default order is recommended unless
@@ -502,9 +557,10 @@ oom_dump_tasks
Enables a system-wide task dump (excluding kernel threads) to be
produced when the kernel performs an OOM-killing and includes such
-information as pid, uid, tgid, vm size, rss, cpu, oom_adj score, and
-name. This is helpful to determine why the OOM killer was invoked
-and to identify the rogue task that caused it.
+information as pid, uid, tgid, vm size, rss, nr_ptes, swapents,
+oom_score_adj score, and name. This is helpful to determine why the
+OOM killer was invoked, to identify the rogue task that caused it,
+and to determine why the OOM killer chose the task it did to kill.
If this is set to zero, this information is suppressed. On very
large systems with thousands of tasks it may not be feasible to dump
@@ -540,6 +596,17 @@ The default value is 0.
==============================================================
+overcommit_kbytes:
+
+When overcommit_memory is set to 2, the committed address space is not
+permitted to exceed swap plus this amount of physical RAM. See below.
+
+Note: overcommit_kbytes is the counterpart of overcommit_ratio. Only one
+of them may be specified at a time. Setting one disables the other (which
+then appears as 0 when read).
+
+==============================================================
+
overcommit_memory:
This value contains a flag that enables memory overcommitment.
@@ -552,6 +619,7 @@ memory until it actually runs out.
When this flag is 2, the kernel uses a "never overcommit"
policy that attempts to prevent any overcommit of memory.
+Note that user_reserve_kbytes affects this policy.
This feature can be very useful because there are a lot of
programs that malloc() huge amounts of memory "just-in-case"
@@ -574,16 +642,24 @@ of physical RAM. See above.
page-cluster
-page-cluster controls the number of pages which are written to swap in
-a single attempt. The swap I/O size.
+page-cluster controls the number of pages up to which consecutive pages
+are read in from swap in a single attempt. This is the swap counterpart
+to page cache readahead.
+The mentioned consecutivity is not in terms of virtual/physical addresses,
+but consecutive on swap space - that means they were swapped out together.
It is a logarithmic value - setting it to zero means "1 page", setting
it to 1 means "2 pages", setting it to 2 means "4 pages", etc.
+Zero disables swap readahead completely.
The default value is three (eight pages at a time). There may be some
small benefits in tuning this to a different value if your workload is
swap-intensive.
+Lower values mean lower latencies for initial faults, but at the same time
+extra faults and I/O delays for following faults if they would have been part of
+that consecutive pages readahead would have brought in.
+
=============================================================
panic_on_oom
@@ -626,7 +702,8 @@ The batch value of each per cpu pagelist is also updated as a result. It is
set to pcp->high/4. The upper limit of batch is (PAGE_SHIFT * 8)
The initial value is zero. Kernel does not use this value at boot time to set
-the high water marks for each per cpu page list.
+the high water marks for each per cpu page list. If the user writes '0' to this
+sysctl, it will revert to this default behavior.
==============================================================
@@ -641,17 +718,37 @@ swappiness
This control is used to define how aggressive the kernel will swap
memory pages. Higher values will increase agressiveness, lower values
-decrease the amount of swap.
+decrease the amount of swap. A value of 0 instructs the kernel not to
+initiate swap until the amount of free and file-backed pages is less
+than the high water mark in a zone.
The default value is 60.
==============================================================
+- user_reserve_kbytes
+
+When overcommit_memory is set to 2, "never overommit" mode, reserve
+min(3% of current process size, user_reserve_kbytes) of free memory.
+This is intended to prevent a user from starting a single memory hogging
+process, such that they cannot recover (kill the hog).
+
+user_reserve_kbytes defaults to min(3% of the current process size, 128MB).
+
+If this is reduced to zero, then the user will be allowed to allocate
+all free memory with a single process, minus admin_reserve_kbytes.
+Any subsequent attempts to execute a command will result in
+"fork: Cannot allocate memory".
+
+Changing this takes effect whenever an application requests memory.
+
+==============================================================
+
vfs_cache_pressure
------------------
-Controls the tendency of the kernel to reclaim the memory which is used for
-caching of directory and inode objects.
+This percentage value controls the tendency of the kernel to reclaim
+the memory which is used for caching of directory and inode objects.
At the default value of vfs_cache_pressure=100 the kernel will attempt to
reclaim dentries and inodes at a "fair" rate with respect to pagecache and
@@ -661,6 +758,11 @@ never reclaim dentries and inodes due to memory pressure and this can easily
lead to out-of-memory conditions. Increasing vfs_cache_pressure beyond 100
causes the kernel to prefer to reclaim dentries and inodes.
+Increasing vfs_cache_pressure significantly beyond 100 may have negative
+performance impact. Reclaim code needs to take various locks to find freeable
+directory and inode objects. With vfs_cache_pressure=1000, it will look for
+ten times more freeable objects than there are.
+
==============================================================
zone_reclaim_mode:
@@ -676,16 +778,17 @@ This is value ORed together of
2 = Zone reclaim writes dirty pages out
4 = Zone reclaim swaps pages
-zone_reclaim_mode is set during bootup to 1 if it is determined that pages
-from remote zones will cause a measurable performance reduction. The
-page allocator will then reclaim easily reusable pages (those page
-cache pages that are currently not used) before allocating off node pages.
-
-It may be beneficial to switch off zone reclaim if the system is
-used for a file server and all of memory should be used for caching files
-from disk. In that case the caching effect is more important than
+zone_reclaim_mode is disabled by default. For file servers or workloads
+that benefit from having their data cached, zone_reclaim_mode should be
+left disabled as the caching effect is likely to be more important than
data locality.
+zone_reclaim may be enabled if it's known that the workload is partitioned
+such that each partition fits within a NUMA node and that accessing remote
+memory would cause a measurable performance reduction. The page allocator
+will then reclaim easily reusable pages (those page cache pages that are
+currently not used) before allocating off node pages.
+
Allowing zone reclaim to write out pages stops processes that are
writing large amounts of data from dirtying pages on other nodes. Zone
reclaim will write out dirty pages if a zone fills up and so effectively