diff options
Diffstat (limited to 'Documentation')
-rw-r--r-- | Documentation/ABI/testing/sysfs-firmware-efi | 20 | ||||
-rw-r--r-- | Documentation/ABI/testing/sysfs-firmware-efi-runtime-map | 34 | ||||
-rw-r--r-- | Documentation/ABI/testing/sysfs-kernel-boot_params | 38 | ||||
-rw-r--r-- | Documentation/RCU/trace.txt | 22 | ||||
-rw-r--r-- | Documentation/acpi/apei/einj.txt | 19 | ||||
-rw-r--r-- | Documentation/circular-buffers.txt | 45 | ||||
-rw-r--r-- | Documentation/devicetree/bindings/clock/exynos5250-clock.txt | 2 | ||||
-rw-r--r-- | Documentation/devicetree/bindings/timer/allwinner,sun5i-a13-hstimer.txt | 22 | ||||
-rw-r--r-- | Documentation/kernel-parameters.txt | 36 | ||||
-rw-r--r-- | Documentation/kmsg/s390/zcrypt | 20 | ||||
-rw-r--r-- | Documentation/memory-barriers.txt | 922 | ||||
-rw-r--r-- | Documentation/robust-futex-ABI.txt | 4 | ||||
-rw-r--r-- | Documentation/sysctl/kernel.txt | 5 | ||||
-rw-r--r-- | Documentation/x86/boot.txt | 3 | ||||
-rw-r--r-- | Documentation/x86/x86_64/mm.txt | 7 | ||||
-rw-r--r-- | Documentation/zorro.txt | 5 |
16 files changed, 956 insertions, 248 deletions
diff --git a/Documentation/ABI/testing/sysfs-firmware-efi b/Documentation/ABI/testing/sysfs-firmware-efi new file mode 100644 index 00000000000..05874da7ce8 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-firmware-efi @@ -0,0 +1,20 @@ +What: /sys/firmware/efi/fw_vendor +Date: December 2013 +Contact: Dave Young <dyoung@redhat.com> +Description: It shows the physical address of firmware vendor field in the + EFI system table. +Users: Kexec + +What: /sys/firmware/efi/runtime +Date: December 2013 +Contact: Dave Young <dyoung@redhat.com> +Description: It shows the physical address of runtime service table entry in + the EFI system table. +Users: Kexec + +What: /sys/firmware/efi/config_table +Date: December 2013 +Contact: Dave Young <dyoung@redhat.com> +Description: It shows the physical address of config table entry in the EFI + system table. +Users: Kexec diff --git a/Documentation/ABI/testing/sysfs-firmware-efi-runtime-map b/Documentation/ABI/testing/sysfs-firmware-efi-runtime-map new file mode 100644 index 00000000000..c61b9b348e9 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-firmware-efi-runtime-map @@ -0,0 +1,34 @@ +What: /sys/firmware/efi/runtime-map/ +Date: December 2013 +Contact: Dave Young <dyoung@redhat.com> +Description: Switching efi runtime services to virtual mode requires + that all efi memory ranges which have the runtime attribute + bit set to be mapped to virtual addresses. + + The efi runtime services can only be switched to virtual + mode once without rebooting. The kexec kernel must maintain + the same physical to virtual address mappings as the first + kernel. The mappings are exported to sysfs so userspace tools + can reassemble them and pass them into the kexec kernel. + + /sys/firmware/efi/runtime-map/ is the directory the kernel + exports that information in. + + subdirectories are named with the number of the memory range: + + /sys/firmware/efi/runtime-map/0 + /sys/firmware/efi/runtime-map/1 + /sys/firmware/efi/runtime-map/2 + /sys/firmware/efi/runtime-map/3 + ... + + Each subdirectory contains five files: + + attribute : The attributes of the memory range. + num_pages : The size of the memory range in pages. + phys_addr : The physical address of the memory range. + type : The type of the memory range. + virt_addr : The virtual address of the memory range. + + Above values are all hexadecimal numbers with the '0x' prefix. +Users: Kexec diff --git a/Documentation/ABI/testing/sysfs-kernel-boot_params b/Documentation/ABI/testing/sysfs-kernel-boot_params new file mode 100644 index 00000000000..eca38ce2852 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-kernel-boot_params @@ -0,0 +1,38 @@ +What: /sys/kernel/boot_params +Date: December 2013 +Contact: Dave Young <dyoung@redhat.com> +Description: The /sys/kernel/boot_params directory contains two + files: "data" and "version" and one subdirectory "setup_data". + It is used to export the kernel boot parameters of an x86 + platform to userspace for kexec and debugging purpose. + + If there's no setup_data in boot_params the subdirectory will + not be created. + + "data" file is the binary representation of struct boot_params. + + "version" file is the string representation of boot + protocol version. + + "setup_data" subdirectory contains the setup_data data + structure in boot_params. setup_data is maintained in kernel + as a link list. In "setup_data" subdirectory there's one + subdirectory for each link list node named with the number + of the list nodes. The list node subdirectory contains two + files "type" and "data". "type" file is the string + representation of setup_data type. "data" file is the binary + representation of setup_data payload. + + The whole boot_params directory structure is like below: + /sys/kernel/boot_params + |__ data + |__ setup_data + | |__ 0 + | | |__ data + | | |__ type + | |__ 1 + | |__ data + | |__ type + |__ version + +Users: Kexec diff --git a/Documentation/RCU/trace.txt b/Documentation/RCU/trace.txt index f3778f8952d..910870b15ac 100644 --- a/Documentation/RCU/trace.txt +++ b/Documentation/RCU/trace.txt @@ -396,14 +396,14 @@ o Each element of the form "3/3 ..>. 0:7 ^0" represents one rcu_node The output of "cat rcu/rcu_sched/rcu_pending" looks as follows: - 0!np=26111 qsp=29 rpq=5386 cbr=1 cng=570 gpc=3674 gps=577 nn=15903 - 1!np=28913 qsp=35 rpq=6097 cbr=1 cng=448 gpc=3700 gps=554 nn=18113 - 2!np=32740 qsp=37 rpq=6202 cbr=0 cng=476 gpc=4627 gps=546 nn=20889 - 3 np=23679 qsp=22 rpq=5044 cbr=1 cng=415 gpc=3403 gps=347 nn=14469 - 4!np=30714 qsp=4 rpq=5574 cbr=0 cng=528 gpc=3931 gps=639 nn=20042 - 5 np=28910 qsp=2 rpq=5246 cbr=0 cng=428 gpc=4105 gps=709 nn=18422 - 6!np=38648 qsp=5 rpq=7076 cbr=0 cng=840 gpc=4072 gps=961 nn=25699 - 7 np=37275 qsp=2 rpq=6873 cbr=0 cng=868 gpc=3416 gps=971 nn=25147 + 0!np=26111 qsp=29 rpq=5386 cbr=1 cng=570 gpc=3674 gps=577 nn=15903 ndw=0 + 1!np=28913 qsp=35 rpq=6097 cbr=1 cng=448 gpc=3700 gps=554 nn=18113 ndw=0 + 2!np=32740 qsp=37 rpq=6202 cbr=0 cng=476 gpc=4627 gps=546 nn=20889 ndw=0 + 3 np=23679 qsp=22 rpq=5044 cbr=1 cng=415 gpc=3403 gps=347 nn=14469 ndw=0 + 4!np=30714 qsp=4 rpq=5574 cbr=0 cng=528 gpc=3931 gps=639 nn=20042 ndw=0 + 5 np=28910 qsp=2 rpq=5246 cbr=0 cng=428 gpc=4105 gps=709 nn=18422 ndw=0 + 6!np=38648 qsp=5 rpq=7076 cbr=0 cng=840 gpc=4072 gps=961 nn=25699 ndw=0 + 7 np=37275 qsp=2 rpq=6873 cbr=0 cng=868 gpc=3416 gps=971 nn=25147 ndw=0 The fields are as follows: @@ -432,6 +432,10 @@ o "gpc" is the number of times that an old grace period had o "gps" is the number of times that a new grace period had started, but this CPU was not yet aware of it. +o "ndw" is the number of times that a wakeup of an rcuo + callback-offload kthread had to be deferred in order to avoid + deadlock. + o "nn" is the number of times that this CPU needed nothing. @@ -443,7 +447,7 @@ The output of "cat rcu/rcuboost" looks as follows: balk: nt=0 egt=6541 bt=0 nb=0 ny=126 nos=0 This information is output only for rcu_preempt. Each two-line entry -corresponds to a leaf rcu_node strcuture. The fields are as follows: +corresponds to a leaf rcu_node structure. The fields are as follows: o "n:m" is the CPU-number range for the corresponding two-line entry. In the sample output above, the first entry covers diff --git a/Documentation/acpi/apei/einj.txt b/Documentation/acpi/apei/einj.txt index a58b63da1a3..f51861bcb07 100644 --- a/Documentation/acpi/apei/einj.txt +++ b/Documentation/acpi/apei/einj.txt @@ -45,11 +45,22 @@ directory apei/einj. The following files are provided. injection. Before this, please specify all necessary error parameters. +- flags + Present for kernel version 3.13 and above. Used to specify which + of param{1..4} are valid and should be used by BIOS during injection. + Value is a bitmask as specified in ACPI5.0 spec for the + SET_ERROR_TYPE_WITH_ADDRESS data structure: + Bit 0 - Processor APIC field valid (see param3 below) + Bit 1 - Memory address and mask valid (param1 and param2) + Bit 2 - PCIe (seg,bus,dev,fn) valid (param4 below) + If set to zero, legacy behaviour is used where the type of injection + specifies just one bit set, and param1 is multiplexed. + - param1 This file is used to set the first error parameter value. Effect of parameter depends on error_type specified. For example, if error type is memory related type, the param1 should be a valid physical - memory address. + memory address. [Unless "flag" is set - see above] - param2 This file is used to set the second error parameter value. Effect of @@ -58,6 +69,12 @@ directory apei/einj. The following files are provided. address mask. Linux requires page or narrower granularity, say, 0xfffffffffffff000. +- param3 + Used when the 0x1 bit is set in "flag" to specify the APIC id + +- param4 + Used when the 0x4 bit is set in "flag" to specify target PCIe device + - notrigger The EINJ mechanism is a two step process. First inject the error, then perform some actions to trigger it. Setting "notrigger" to 1 skips the diff --git a/Documentation/circular-buffers.txt b/Documentation/circular-buffers.txt index 8117e5bf606..88951b17926 100644 --- a/Documentation/circular-buffers.txt +++ b/Documentation/circular-buffers.txt @@ -160,6 +160,7 @@ The producer will look something like this: spin_lock(&producer_lock); unsigned long head = buffer->head; + /* The spin_unlock() and next spin_lock() provide needed ordering. */ unsigned long tail = ACCESS_ONCE(buffer->tail); if (CIRC_SPACE(head, tail, buffer->size) >= 1) { @@ -168,9 +169,8 @@ The producer will look something like this: produce_item(item); - smp_wmb(); /* commit the item before incrementing the head */ - - buffer->head = (head + 1) & (buffer->size - 1); + smp_store_release(buffer->head, + (head + 1) & (buffer->size - 1)); /* wake_up() will make sure that the head is committed before * waking anyone up */ @@ -183,9 +183,14 @@ This will instruct the CPU that the contents of the new item must be written before the head index makes it available to the consumer and then instructs the CPU that the revised head index must be written before the consumer is woken. -Note that wake_up() doesn't have to be the exact mechanism used, but whatever -is used must guarantee a (write) memory barrier between the update of the head -index and the change of state of the consumer, if a change of state occurs. +Note that wake_up() does not guarantee any sort of barrier unless something +is actually awakened. We therefore cannot rely on it for ordering. However, +there is always one element of the array left empty. Therefore, the +producer must produce two elements before it could possibly corrupt the +element currently being read by the consumer. Therefore, the unlock-lock +pair between consecutive invocations of the consumer provides the necessary +ordering between the read of the index indicating that the consumer has +vacated a given element and the write by the producer to that same element. THE CONSUMER @@ -195,21 +200,20 @@ The consumer will look something like this: spin_lock(&consumer_lock); - unsigned long head = ACCESS_ONCE(buffer->head); + /* Read index before reading contents at that index. */ + unsigned long head = smp_load_acquire(buffer->head); unsigned long tail = buffer->tail; if (CIRC_CNT(head, tail, buffer->size) >= 1) { - /* read index before reading contents at that index */ - smp_read_barrier_depends(); /* extract one item from the buffer */ struct item *item = buffer[tail]; consume_item(item); - smp_mb(); /* finish reading descriptor before incrementing tail */ - - buffer->tail = (tail + 1) & (buffer->size - 1); + /* Finish reading descriptor before incrementing tail. */ + smp_store_release(buffer->tail, + (tail + 1) & (buffer->size - 1)); } spin_unlock(&consumer_lock); @@ -218,12 +222,17 @@ This will instruct the CPU to make sure the index is up to date before reading the new item, and then it shall make sure the CPU has finished reading the item before it writes the new tail pointer, which will erase the item. - -Note the use of ACCESS_ONCE() in both algorithms to read the opposition index. -This prevents the compiler from discarding and reloading its cached value - -which some compilers will do across smp_read_barrier_depends(). This isn't -strictly needed if you can be sure that the opposition index will _only_ be -used the once. +Note the use of ACCESS_ONCE() and smp_load_acquire() to read the +opposition index. This prevents the compiler from discarding and +reloading its cached value - which some compilers will do across +smp_read_barrier_depends(). This isn't strictly needed if you can +be sure that the opposition index will _only_ be used the once. +The smp_load_acquire() additionally forces the CPU to order against +subsequent memory references. Similarly, smp_store_release() is used +in both algorithms to write the thread's index. This documents the +fact that we are writing to something that can be read concurrently, +prevents the compiler from tearing the store, and enforces ordering +against previous accesses. =============== diff --git a/Documentation/devicetree/bindings/clock/exynos5250-clock.txt b/Documentation/devicetree/bindings/clock/exynos5250-clock.txt index 46f5c791ea0..0f2f920e873 100644 --- a/Documentation/devicetree/bindings/clock/exynos5250-clock.txt +++ b/Documentation/devicetree/bindings/clock/exynos5250-clock.txt @@ -159,6 +159,8 @@ clock which they consume. mixer 343 hdmi 344 g2d 345 + mdma0 346 + smmu_mdma0 347 [Clock Muxes] diff --git a/Documentation/devicetree/bindings/timer/allwinner,sun5i-a13-hstimer.txt b/Documentation/devicetree/bindings/timer/allwinner,sun5i-a13-hstimer.txt new file mode 100644 index 00000000000..7c26154b8bb --- /dev/null +++ b/Documentation/devicetree/bindings/timer/allwinner,sun5i-a13-hstimer.txt @@ -0,0 +1,22 @@ +Allwinner SoCs High Speed Timer Controller + +Required properties: + +- compatible : should be "allwinner,sun5i-a13-hstimer" or + "allwinner,sun7i-a20-hstimer" +- reg : Specifies base physical address and size of the registers. +- interrupts : The interrupts of these timers (2 for the sun5i IP, 4 for the sun7i + one) +- clocks: phandle to the source clock (usually the AHB clock) + +Example: + +timer@01c60000 { + compatible = "allwinner,sun7i-a20-hstimer"; + reg = <0x01c60000 0x1000>; + interrupts = <0 51 1>, + <0 52 1>, + <0 53 1>, + <0 54 1>; + clocks = <&ahb1_gates 19>; +}; diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index b9e9bd85429..4252af6ffda 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -774,6 +774,15 @@ bytes respectively. Such letter suffixes can also be entirely omitted. disable= [IPV6] See Documentation/networking/ipv6.txt. + disable_cpu_apicid= [X86,APIC,SMP] + Format: <int> + The number of initial APIC ID for the + corresponding CPU to be disabled at boot, + mostly used for the kdump 2nd kernel to + disable BSP to wake up multiple CPUs without + causing system reset or hang due to sending + INIT from AP to BSP. + disable_ddw [PPC/PSERIES] Disable Dynamic DMA Window support. Use this if to workaround buggy firmware. @@ -881,6 +890,14 @@ bytes respectively. Such letter suffixes can also be entirely omitted. The xen output can only be used by Xen PV guests. + edac_report= [HW,EDAC] Control how to report EDAC event + Format: {"on" | "off" | "force"} + on: enable EDAC to report H/W event. May be overridden + by other higher priority error reporting module. + off: disable H/W event reporting through EDAC. + force: enforce the use of EDAC to report H/W event. + default: on. + ekgdboc= [X86,KGDB] Allow early kernel console debugging ekgdboc=kbd @@ -890,6 +907,12 @@ bytes respectively. Such letter suffixes can also be entirely omitted. edd= [EDD] Format: {"off" | "on" | "skip[mbr]"} + efi= [EFI] + Format: { "old_map" } + old_map [X86-64]: switch to the old ioremap-based EFI + runtime services mapping. 32-bit still uses this one by + default. + efi_no_storage_paranoia [EFI; X86] Using this parameter you can use more than 50% of your efi variable storage. Use this parameter only if @@ -1994,6 +2017,10 @@ bytes respectively. Such letter suffixes can also be entirely omitted. noapic [SMP,APIC] Tells the kernel to not make use of any IOAPICs that may be present in the system. + nokaslr [X86] + Disable kernel base offset ASLR (Address Space + Layout Randomization) if built into the kernel. + noautogroup Disable scheduler automatic task group creation. nobats [PPC] Do not use BATs for mapping kernel lowmem @@ -2627,7 +2654,6 @@ bytes respectively. Such letter suffixes can also be entirely omitted. for RCU-preempt, and "s" for RCU-sched, and "N" is the CPU number. This reduces OS jitter on the offloaded CPUs, which can be useful for HPC and - real-time workloads. It can also improve energy efficiency for asymmetric multiprocessors. @@ -2643,8 +2669,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted. periodically wake up to do the polling. rcutree.blimit= [KNL] - Set maximum number of finished RCU callbacks to process - in one batch. + Set maximum number of finished RCU callbacks to + process in one batch. rcutree.rcu_fanout_leaf= [KNL] Increase the number of CPUs assigned to each @@ -2663,8 +2689,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted. value is one, and maximum value is HZ. rcutree.qhimark= [KNL] - Set threshold of queued - RCU callbacks over which batch limiting is disabled. + Set threshold of queued RCU callbacks beyond which + batch limiting is disabled. rcutree.qlowmark= [KNL] Set threshold of queued RCU callbacks below which diff --git a/Documentation/kmsg/s390/zcrypt b/Documentation/kmsg/s390/zcrypt new file mode 100644 index 00000000000..7fb2087409d --- /dev/null +++ b/Documentation/kmsg/s390/zcrypt @@ -0,0 +1,20 @@ +/*? + * Text: "Cryptographic device %x failed and was set offline\n" + * Severity: Error + * Parameter: + * @1: device index + * Description: + * A cryptographic device failed to process a cryptographic request. + * The cryptographic device driver could not correct the error and + * set the device offline. The application that issued the + * request received an indication that the request has failed. + * User action: + * Use the lszcrypt command to confirm that the cryptographic + * hardware is still configured to your LPAR or z/VM guest virtual + * machine. If the device is available to your Linux instance the + * command output contains a line that begins with 'card<device index>', + * where <device index> is the two-digit decimal number in the message text. + * After ensuring that the device is available, use the chzcrypt command to + * set it online again. + * If the error persists, contact your support organization. + */ diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt index c8c42e64e95..102dc19c411 100644 --- a/Documentation/memory-barriers.txt +++ b/Documentation/memory-barriers.txt @@ -194,18 +194,22 @@ There are some minimal guarantees that may be expected of a CPU: (*) On any given CPU, dependent memory accesses will be issued in order, with respect to itself. This means that for: - Q = P; D = *Q; + ACCESS_ONCE(Q) = P; smp_read_barrier_depends(); D = ACCESS_ONCE(*Q); the CPU will issue the following memory operations: Q = LOAD P, D = LOAD *Q - and always in that order. + and always in that order. On most systems, smp_read_barrier_depends() + does nothing, but it is required for DEC Alpha. The ACCESS_ONCE() + is required to prevent compiler mischief. Please note that you + should normally use something like rcu_dereference() instead of + open-coding smp_read_barrier_depends(). (*) Overlapping loads and stores within a particular CPU will appear to be ordered within that CPU. This means that for: - a = *X; *X = b; + a = ACCESS_ONCE(*X); ACCESS_ONCE(*X) = b; the CPU will only issue the following sequence of memory operations: @@ -213,7 +217,7 @@ There are some minimal guarantees that may be expected of a CPU: And for: - *X = c; d = *X; + ACCESS_ONCE(*X) = c; d = ACCESS_ONCE(*X); the CPU will only issue: @@ -224,6 +228,12 @@ There are some minimal guarantees that may be expected of a CPU: And there are a number of things that _must_ or _must_not_ be assumed: + (*) It _must_not_ be assumed that the compiler will do what you want with + memory references that are not protected by ACCESS_ONCE(). Without + ACCESS_ONCE(), the compiler is within its rights to do all sorts + of "creative" transformations, which are covered in the Compiler + Barrier section. + (*) It _must_not_ be assumed that independent loads and stores will be issued in the order given. This means that for: @@ -371,33 +381,44 @@ Memory barriers come in four basic varieties: And a couple of implicit varieties: - (5) LOCK operations. + (5) ACQUIRE operations. This acts as a one-way permeable barrier. It guarantees that all memory - operations after the LOCK operation will appear to happen after the LOCK - operation with respect to the other components of the system. + operations after the ACQUIRE operation will appear to happen after the + ACQUIRE operation with respect to the other components of the system. + ACQUIRE operations include LOCK operations and smp_load_acquire() + operations. - Memory operations that occur before a LOCK operation may appear to happen - after it completes. + Memory operations that occur before an ACQUIRE operation may appear to + happen after it completes. - A LOCK operation should almost always be paired with an UNLOCK operation. + An ACQUIRE operation should almost always be paired with a RELEASE + operation. - (6) UNLOCK operations. + (6) RELEASE operations. This also acts as a one-way permeable barrier. It guarantees that all - memory operations before the UNLOCK operation will appear to happen before - the UNLOCK operation with respect to the other components of the system. + memory operations before the RELEASE operation will appear to happen + before the RELEASE operation with respect to the other components of the + system. RELEASE operations include UNLOCK operations and + smp_store_release() operations. - Memory operations that occur after an UNLOCK operation may appear to + Memory operations that occur after a RELEASE operation may appear to happen before it completes. - LOCK and UNLOCK operations are guaranteed to appear with respect to each - other strictly in the order specified. + The use of ACQUIRE and RELEASE operations generally precludes the need + for other sorts of memory barrier (but note the exceptions mentioned in + the subsection "MMIO write barrier"). In addition, a RELEASE+ACQUIRE + pair is -not- guaranteed to act as a full memory barrier. However, after + an ACQUIRE on a given variable, all memory accesses preceding any prior + RELEASE on that same variable are guaranteed to be visible. In other + words, within a given variable's critical section, all accesses of all + previous critical sections for that variable are guaranteed to have + completed. - The use of LOCK and UNLOCK operations generally precludes the need for - other sorts of memory barrier (but note the exceptions mentioned in the - subsection "MMIO write barrier"). + This means that ACQUIRE acts as a minimal "acquire" operation and + RELEASE acts as a minimal "release" operation. Memory barriers are only required where there's a possibility of interaction @@ -450,14 +471,14 @@ The usage requirements of data dependency barriers are a little subtle, and it's not always obvious that they're needed. To illustrate, consider the following sequence of events: - CPU 1 CPU 2 - =============== =============== + CPU 1 CPU 2 + =============== =============== { A == 1, B == 2, C = 3, P == &A, Q == &C } B = 4; <write barrier> - P = &B - Q = P; - D = *Q; + ACCESS_ONCE(P) = &B + Q = ACCESS_ONCE(P); + D = *Q; There's a clear data dependency here, and it would seem that by the end of the sequence, Q must be either &A or &B, and that: @@ -477,15 +498,15 @@ Alpha). To deal with this, a data dependency barrier or better must be inserted between the address load and the data load: - CPU 1 CPU 2 - =============== =============== + CPU 1 CPU 2 + =============== =============== { A == 1, B == 2, C = 3, P == &A, Q == &C } B = 4; <write barrier> - P = &B - Q = P; - <data dependency barrier> - D = *Q; + ACCESS_ONCE(P) = &B + Q = ACCESS_ONCE(P); + <data dependency barrier> + D = *Q; This enforces the occurrence of one of the two implications, and prevents the third possibility from arising. @@ -500,25 +521,26 @@ odd-numbered bank is idle, one can see the new value of the pointer P (&B), but the old value of the variable B (2). -Another example of where data dependency barriers might by required is where a +Another example of where data dependency barriers might be required is where a number is read from memory and then used to calculate the index for an array access: - CPU 1 CPU 2 - =============== =============== + CPU 1 CPU 2 + =============== =============== { M[0] == 1, M[1] == 2, M[3] = 3, P == 0, Q == 3 } M[1] = 4; <write barrier> - P = 1 - Q = P; - <data dependency barrier> - D = M[Q]; + ACCESS_ONCE(P) = 1 + Q = ACCESS_ONCE(P); + <data dependency barrier> + D = M[Q]; -The data dependency barrier is very important to the RCU system, for example. -See rcu_dereference() in include/linux/rcupdate.h. This permits the current -target of an RCU'd pointer to be replaced with a new modified target, without -the replacement target appearing to be incompletely initialised. +The data dependency barrier is very important to the RCU system, +for example. See rcu_assign_pointer() and rcu_dereference() in +include/linux/rcupdate.h. This permits the current target of an RCU'd +pointer to be replaced with a new modified target, without the replacement +target appearing to be incompletely initialised. See also the subsection on "Cache Coherency" for a more thorough example. @@ -530,24 +552,190 @@ A control dependency requires a full read memory barrier, not simply a data dependency barrier to make it work correctly. Consider the following bit of code: - q = &a; - if (p) { - <data dependency barrier> - q = &b; + q = ACCESS_ONCE(a); + if (q) { + <data dependency barrier> /* BUG: No data dependency!!! */ + p = ACCESS_ONCE(b); } - x = *q; This will not have the desired effect because there is no actual data -dependency, but rather a control dependency that the CPU may short-circuit by -attempting to predict the outcome in advance. In such a case what's actually -required is: +dependency, but rather a control dependency that the CPU may short-circuit +by attempting to predict the outcome in advance, so that other CPUs see +the load from b as having happened before the load from a. In such a +case what's actually required is: - q = &a; - if (p) { + q = ACCESS_ONCE(a); + if (q) { <read barrier> - q = &b; + p = ACCESS_ONCE(b); + } + +However, stores are not speculated. This means that ordering -is- provided +in the following example: + + q = ACCESS_ONCE(a); + if (ACCESS_ONCE(q)) { + ACCESS_ONCE(b) = p; + } + +Please note that ACCESS_ONCE() is not optional! Without the ACCESS_ONCE(), +the compiler is within its rights to transform this example: + + q = a; + if (q) { + b = p; /* BUG: Compiler can reorder!!! */ + do_something(); + } else { + b = p; /* BUG: Compiler can reorder!!! */ + do_something_else(); + } + +into this, which of course defeats the ordering: + + b = p; + q = a; + if (q) + do_something(); + else + do_something_else(); + +Worse yet, if the compiler is able to prove (say) that the value of +variable 'a' is always non-zero, it would be well within its rights +to optimize the original example by eliminating the "if" statement +as follows: + + q = a; + b = p; /* BUG: Compiler can reorder!!! */ + do_something(); + +The solution is again ACCESS_ONCE(), which preserves the ordering between +the load from variable 'a' and the store to variable 'b': + + q = ACCESS_ONCE(a); + if (q) { + ACCESS_ONCE(b) = p; + do_something(); + } else { + ACCESS_ONCE(b) = p; + do_something_else(); + } + +You could also use barrier() to prevent the compiler from moving +the stores to variable 'b', but barrier() would not prevent the +compiler from proving to itself that a==1 always, so ACCESS_ONCE() +is also needed. + +It is important to note that control dependencies absolutely require a +a conditional. For example, the following "optimized" version of +the above example breaks ordering: + + q = ACCESS_ONCE(a); + ACCESS_ONCE(b) = p; /* BUG: No ordering vs. load from a!!! */ + if (q) { + /* ACCESS_ONCE(b) = p; -- moved up, BUG!!! */ + do_something(); + } else { + /* ACCESS_ONCE(b) = p; -- moved up, BUG!!! */ + do_something_else(); } - x = *q; + +It is of course legal for the prior load to be part of the conditional, +for example, as follows: + + if (ACCESS_ONCE(a) > 0) { + ACCESS_ONCE(b) = q / 2; + do_something(); + } else { + ACCESS_ONCE(b) = q / 3; + do_something_else(); + } + +This will again ensure that the load from variable 'a' is ordered before the +stores to variable 'b'. + +In addition, you need to be careful what you do with the local variable 'q', +otherwise the compiler might be able to guess the value and again remove +the needed conditional. For example: + + q = ACCESS_ONCE(a); + if (q % MAX) { + ACCESS_ONCE(b) = p; + do_something(); + } else { + ACCESS_ONCE(b) = p; + do_something_else(); + } + +If MAX is defined to be 1, then the compiler knows that (q % MAX) is +equal to zero, in which case the compiler is within its rights to +transform the above code into the following: + + q = ACCESS_ONCE(a); + ACCESS_ONCE(b) = p; + do_something_else(); + +This transformation loses the ordering between the load from variable 'a' +and the store to variable 'b'. If you are relying on this ordering, you +should do something like the following: + + q = ACCESS_ONCE(a); + BUILD_BUG_ON(MAX <= 1); /* Order load from a with store to b. */ + if (q % MAX) { + ACCESS_ONCE(b) = p; + do_something(); + } else { + ACCESS_ONCE(b) = p; + do_something_else(); + } + +Finally, control dependencies do -not- provide transitivity. This is +demonstrated by two related examples: + + CPU 0 CPU 1 + ===================== ===================== + r1 = ACCESS_ONCE(x); r2 = ACCESS_ONCE(y); + if (r1 >= 0) if (r2 >= 0) + ACCESS_ONCE(y) = 1; ACCESS_ONCE(x) = 1; + + assert(!(r1 == 1 && r2 == 1)); + +The above two-CPU example will never trigger the assert(). However, +if control dependencies guaranteed transitivity (which they do not), +then adding the following two CPUs would guarantee a related assertion: + + CPU 2 CPU 3 + ===================== ===================== + ACCESS_ONCE(x) = 2; ACCESS_ONCE(y) = 2; + + assert(!(r1 == 2 && r2 == 2 && x == 1 && y == 1)); /* FAILS!!! */ + +But because control dependencies do -not- provide transitivity, the +above assertion can fail after the combined four-CPU example completes. +If you need the four-CPU example to provide ordering, you will need +smp_mb() between the loads and stores in the CPU 0 and CPU 1 code fragments. + +In summary: + + (*) Control dependencies can order prior loads against later stores. + However, they do -not- guarantee any other sort of ordering: + Not prior loads against later loads, nor prior stores against + later anything. If you need these other forms of ordering, + use smb_rmb(), smp_wmb(), or, in the case of prior stores and + later loads, smp_mb(). + + (*) Control dependencies require at least one run-time conditional + between the prior load and the subsequent store. If the compiler + is able to optimize the conditional away, it will have also + optimized away the ordering. Careful use of ACCESS_ONCE() can + help to preserve the needed conditional. + + (*) Control dependencies require that the compiler avoid reordering the + dependency into nonexistence. Careful use of ACCESS_ONCE() or + barrier() can help to preserve your control dependency. Please + see the Compiler Barrier section for more information. + + (*) Control dependencies do -not- provide transitivity. If you + need transitivity, use smp_mb(). SMP BARRIER PAIRING @@ -561,23 +749,23 @@ barrier, though a general barrier would also be viable. Similarly a read barrier or a data dependency barrier should always be paired with at least an write barrier, though, again, a general barrier is viable: - CPU 1 CPU 2 - =============== =============== - a = 1; + CPU 1 CPU 2 + =============== |