1 files changed, 1913 insertions, 0 deletions
diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
new file mode 100644
index 00000000000..f8550310a6d
--- /dev/null
+++ b/Documentation/memory-barriers.txt
@@ -0,0 +1,1913 @@
+			 ============================
+			 LINUX KERNEL MEMORY BARRIERS
+			 ============================
+
+By: David Howells <dhowells@redhat.com>
+
+Contents:
+
+ (*) Abstract memory access model.
+
+     - Device operations.
+     - Guarantees.
+
+ (*) What are memory barriers?
+
+     - Varieties of memory barrier.
+     - What may not be assumed about memory barriers?
+     - Data dependency barriers.
+     - Control dependencies.
+     - SMP barrier pairing.
+     - Examples of memory barrier sequences.
+
+ (*) Explicit kernel barriers.
+
+     - Compiler barrier.
+     - The CPU memory barriers.
+     - MMIO write barrier.
+
+ (*) Implicit kernel memory barriers.
+
+     - Locking functions.
+     - Interrupt disabling functions.
+     - Miscellaneous functions.
+
+ (*) Inter-CPU locking barrier effects.
+
+     - Locks vs memory accesses.
+     - Locks vs I/O accesses.
+
+ (*) Where are memory barriers needed?
+
+     - Interprocessor interaction.
+     - Atomic operations.
+     - Accessing devices.
+     - Interrupts.
+
+ (*) Kernel I/O barrier effects.
+
+ (*) Assumed minimum execution ordering model.
+
+ (*) The effects of the cpu cache.
+
+     - Cache coherency.
+     - Cache coherency vs DMA.
+     - Cache coherency vs MMIO.
+
+ (*) The things CPUs get up to.
+
+     - And then there's the Alpha.
+
+ (*) References.
+
+
+============================
+ABSTRACT MEMORY ACCESS MODEL
+============================
+
+Consider the following abstract model of the system:
+
+		            :                :
+		            :                :
+		            :                :
+		+-------+   :   +--------+   :   +-------+
+		|       |   :   |        |   :   |       |
+		|       |   :   |        |   :   |       |
+		| CPU 1 |<----->| Memory |<----->| CPU 2 |
+		|       |   :   |        |   :   |       |
+		|       |   :   |        |   :   |       |
+		+-------+   :   +--------+   :   +-------+
+		    ^       :       ^        :       ^
+		    |       :       |        :       |
+		    |       :       |        :       |
+		    |       :       v        :       |
+		    |       :   +--------+   :       |
+		    |       :   |        |   :       |
+		    |       :   |        |   :       |
+		    +---------->| Device |<----------+
+		            :   |        |   :
+		            :   |        |   :
+		            :   +--------+   :
+		            :                :
+
+Each CPU executes a program that generates memory access operations.  In the
+abstract CPU, memory operation ordering is very relaxed, and a CPU may actually
+perform the memory operations in any order it likes, provided program causality
+appears to be maintained.  Similarly, the compiler may also arrange the
+instructions it emits in any order it likes, provided it doesn't affect the
+apparent operation of the program.
+
+So in the above diagram, the effects of the memory operations performed by a
+CPU are perceived by the rest of the system as the operations cross the
+interface between the CPU and rest of the system (the dotted lines).
+
+
+For example, consider the following sequence of events:
+
+	CPU 1		CPU 2
+	===============	===============
+	{ A == 1; B == 2 }
+	A = 3;		x = A;
+	B = 4;		y = B;
+
+The set of accesses as seen by the memory system in the middle can be arranged
+in 24 different combinations:
+
+	STORE A=3,	STORE B=4,	x=LOAD A->3,	y=LOAD B->4
+	STORE A=3,	STORE B=4,	y=LOAD B->4,	x=LOAD A->3
+	STORE A=3,	x=LOAD A->3,	STORE B=4,	y=LOAD B->4
+	STORE A=3,	x=LOAD A->3,	y=LOAD B->2,	STORE B=4
+	STORE A=3,	y=LOAD B->2,	STORE B=4,	x=LOAD A->3
+	STORE A=3,	y=LOAD B->2,	x=LOAD A->3,	STORE B=4
+	STORE B=4,	STORE A=3,	x=LOAD A->3,	y=LOAD B->4
+	STORE B=4, ...
+	...
+
+and can thus result in four different combinations of values:
+
+	x == 1, y == 2
+	x == 1, y == 4
+	x == 3, y == 2
+	x == 3, y == 4
+
+
+Furthermore, the stores committed by a CPU to the memory system may not be
+perceived by the loads made by another CPU in the same order as the stores were
+committed.
+
+
+As a further example, consider this sequence of events:
+
+	CPU 1		CPU 2
+	===============	===============
+	{ A == 1, B == 2, C = 3, P == &A, Q == &C }
+	B = 4;		Q = P;
+	P = &B		D = *Q;
+
+There is an obvious data dependency here, as the value loaded into D depends on
+the address retrieved from P by CPU 2.  At the end of the sequence, any of the
+following results are possible:
+
+	(Q == &A) and (D == 1)
+	(Q == &B) and (D == 2)
+	(Q == &B) and (D == 4)
+
+Note that CPU 2 will never try and load C into D because the CPU will load P
+into Q before issuing the load of *Q.
+
+
+DEVICE OPERATIONS
+-----------------
+
+Some devices present their control interfaces as collections of memory
+locations, but the order in which the control registers are accessed is very
+important.  For instance, imagine an ethernet card with a set of internal
+registers that are accessed through an address port register (A) and a data
+port register (D).  To read internal register 5, the following code might then
+be used:
+
+	*A = 5;
+	x = *D;
+
+but this might show up as either of the following two sequences:
+
+	STORE *A = 5, x = LOAD *D
+	x = LOAD *D, STORE *A = 5
+
+the second of which will almost certainly result in a malfunction, since it set
+the address _after_ attempting to read the register.
+
+
+GUARANTEES
+----------
+
+There are some minimal guarantees that may be expected of a CPU:
+
+ (*) On any given CPU, dependent memory accesses will be issued in order, with
+     respect to itself.  This means that for:
+
+	Q = P; D = *Q;
+
+     the CPU will issue the following memory operations:
+
+	Q = LOAD P, D = LOAD *Q
+
+     and always in that order.
+
+ (*) Overlapping loads and stores within a particular CPU will appear to be
+     ordered within that CPU.  This means that for:
+
+	a = *X; *X = b;
+
+     the CPU will only issue the following sequence of memory operations:
+
+	a = LOAD *X, STORE *X = b
+
+     And for:
+
+	*X = c; d = *X;
+
+     the CPU will only issue:
+
+	STORE *X = c, d = LOAD *X
+
+     (Loads and stores overlap if they are targetted at overlapping pieces of
+     memory).
+
+And there are a number of things that _must_ or _must_not_ be assumed:
+
+ (*) It _must_not_ be assumed that independent loads and stores will be issued
+     in the order given.  This means that for:
+
+	X = *A; Y = *B; *D = Z;
+
+     we may get any of the following sequences:
+
+	X = LOAD *A,  Y = LOAD *B,  STORE *D = Z
+	X = LOAD *A,  STORE *D = Z, Y = LOAD *B
+	Y = LOAD *B,  X = LOAD *A,  STORE *D = Z
+	Y = LOAD *B,  STORE *D = Z, X = LOAD *A
+	STORE *D = Z, X = LOAD *A,  Y = LOAD *B
+	STORE *D = Z, Y = LOAD *B,  X = LOAD *A
+
+ (*) It _must_ be assumed that overlapping memory accesses may be merged or
+     discarded.  This means that for:
+
+	X = *A; Y = *(A + 4);
+
+     we may get any one of the following sequences:
+
+	X = LOAD *A; Y = LOAD *(A + 4);
+	Y = LOAD *(A + 4); X = LOAD *A;
+	{X, Y} = LOAD {*A, *(A + 4) };
+
+     And for:
+
+	*A = X; Y = *A;
+
+     we may get either of:
+
+	STORE *A = X; Y = LOAD *A;
+	STORE *A = Y;
+
+
+=========================
+WHAT ARE MEMORY BARRIERS?
+=========================
+
+As can be seen above, independent memory operations are effectively performed
+in random order, but this can be a problem for CPU-CPU interaction and for I/O.
+What is required is some way of intervening to instruct the compiler and the
+CPU to restrict the order.
+
+Memory barriers are such interventions.  They impose a perceived partial
+ordering between the memory operations specified on either side of the barrier.
+They request that the sequence of memory events generated appears to other
+parts of the system as if the barrier is effective on that CPU.
+
+
+VARIETIES OF MEMORY BARRIER
+---------------------------
+
+Memory barriers come in four basic varieties:
+
+ (1) Write (or store) memory barriers.
+
+     A write memory barrier gives a guarantee that all the STORE operations
+     specified before the barrier will appear to happen before all the STORE
+     operations specified after the barrier with respect to the other
+     components of the system.
+
+     A write barrier is a partial ordering on stores only; it is not required
+     to have any effect on loads.
+
+     A CPU can be viewed as as commiting a sequence of store operations to the
+     memory system as time progresses.  All stores before a write barrier will
+     occur in the sequence _before_ all the stores after the write barrier.
+
+     [!] Note that write barriers should normally be paired with read or data
+     dependency barriers; see the "SMP barrier pairing" subsection.
+
+
+ (2) Data dependency barriers.
+
+     A data dependency barrier is a weaker form of read barrier.  In the case
+     where two loads are performed such that the second depends on the result
+     of the first (eg: the first load retrieves the address to which the second
+     load will be directed), a data dependency barrier would be required to
+     make sure that the target of the second load is updated before the address
+     obtained by the first load is accessed.
+
+     A data dependency barrier is a partial ordering on interdependent loads
+     only; it is not required to have any effect on stores, independent loads
+     or overlapping loads.
+
+     As mentioned in (1), the other CPUs in the system can be viewed as
+     committing sequences of stores to the memory system that the CPU being
+     considered can then perceive.  A data dependency barrier issued by the CPU
+     under consideration guarantees that for any load preceding it, if that
+     load touches one of a sequence of stores from another CPU, then by the
+     time the barrier completes, the effects of all the stores prior to that
+     touched by the load will be perceptible to any loads issued after the data
+     dependency barrier.
+
+     See the "Examples of memory barrier sequences" subsection for diagrams
+     showing the ordering constraints.
+
+     [!] Note that the first load really has to have a _data_ dependency and
+     not a control dependency.  If the address for the second load is dependent
+     on the first load, but the dependency is through a conditional rather than
+     actually loading the address itself, then it's a _control_ dependency and
+     a full read barrier or better is required.  See the "Control dependencies"
+     subsection for more information.
+
+     [!] Note that data dependency barriers should normally be paired with
+     write barriers; see the "SMP barrier pairing" subsection.
+
+
+ (3) Read (or load) memory barriers.
+
+     A read barrier is a data dependency barrier plus a guarantee that all the
+     LOAD operations specified before the barrier will appear to happen before
+     all the LOAD operations specified after the barrier with respect to the
+     other components of the system.
+
+     A read barrier is a partial ordering on loads only; it is not required to
+     have any effect on stores.
+
+     Read memory barriers imply data dependency barriers, and so can substitute
+     for them.
+
+     [!] Note that read barriers should normally be paired with write barriers;
+     see the "SMP barrier pairing" subsection.
+
+
+ (4) General memory barriers.
+
+     A general memory barrier is a combination of both a read memory barrier
+     and a write memory barrier.  It is a partial ordering over both loads and
+     stores.
+
+     General memory barriers imply both read and write memory barriers, and so
+     can substitute for either.
+
+
+And a couple of implicit varieties:
+
+ (5) LOCK operations.
+
+     This acts as a one-way permeable barrier.  It guarantees that all memory
+     operations after the LOCK operation will appear to happen after the LOCK
+     operation with respect to the other components of the system.
+
+     Memory operations that occur before a LOCK operation may appear to happen
+     after it completes.
+
+     A LOCK operation should almost always be paired with an UNLOCK operation.
+
+
+ (6) UNLOCK operations.
+
+     This also acts as a one-way permeable barrier.  It guarantees that all
+     memory operations before the UNLOCK operation will appear to happen before
+     the UNLOCK operation with respect to the other components of the system.
+
+     Memory operations that occur after an UNLOCK operation may appear to
+     happen before it completes.
+
+     LOCK and UNLOCK operations are guaranteed to appear with respect to each
+     other strictly in the order specified.
+
+     The use of LOCK and UNLOCK operations generally precludes the need for
+     other sorts of memory barrier (but note the exceptions mentioned in the
+     subsection "MMIO write barrier").
+
+
+Memory barriers are only required where there's a possibility of interaction
+between two CPUs or between a CPU and a device.  If it can be guaranteed that
+there won't be any such interaction in any particular piece of code, then
+memory barriers are unnecessary in that piece of code.
+
+
+Note that these are the _minimum_ guarantees.  Different architectures may give
+more substantial guarantees, but they may _not_ be relied upon outside of arch
+specific code.
+
+
+WHAT MAY NOT BE ASSUMED ABOUT MEMORY BARRIERS?
+----------------------------------------------
+
+There are certain things that the Linux kernel memory barriers do not guarantee:
+
+ (*) There is no guarantee that any of the memory accesses specified before a
+     memory barrier will be _complete_ by the completion of a memory barrier
+     instruction; the barrier can be considered to draw a line in that CPU's
+     access queue that accesses of the appropriate type may not cross.
+
+ (*) There is no guarantee that issuing a memory barrier on one CPU will have
+     any direct effect on another CPU or any other hardware in the system.  The
+     indirect effect will be the order in which the second CPU sees the effects
+     of the first CPU's accesses occur, but see the next point:
+
+ (*) There is no guarantee that the a CPU will see the correct order of effects
+     from a second CPU's accesses, even _if_ the second CPU uses a memory
+     barrier, unless the first CPU _also_ uses a matching memory barrier (see
+     the subsection on "SMP Barrier Pairing").
+
+ (*) There is no guarantee that some intervening piece of off-the-CPU
+     hardware[*] will not reorder the memory accesses.  CPU cache coherency
+     mechanisms should propagate the indirect effects of a memory barrier
+     between CPUs, but might not do so in order.
+
+	[*] For information on bus mastering DMA and coherency please read:
+
+	    Documentation/pci.txt
+	    Documentation/DMA-mapping.txt
+	    Documentation/DMA-API.txt
+
+
+DATA DEPENDENCY BARRIERS
+------------------------
+
+The usage requirements of data dependency barriers are a little subtle, and
+it's not always obvious that they're needed.  To illustrate, consider the
+following sequence of events:
+
+	CPU 1		CPU 2
+	===============	===============
+	{ A == 1, B == 2, C = 3, P == &A, Q == &C }
+	B = 4;
+	<write barrier>
+	P = &B
+			Q = P;
+			D = *Q;
+
+There's a clear data dependency here, and it would seem that by the end of the
+sequence, Q must be either &A or &B, and that:
+
+	(Q == &A) implies (D == 1)
+	(Q == &B) implies (D == 4)
+
+But! CPU 2's perception of P may be updated _before_ its perception of B, thus
+leading to the following situation:
+
+	(Q == &B) and (D == 2) ????
+
+Whilst this may seem like a failure of coherency or causality maintenance, it
+isn't, and this behaviour can be observed on certain real CPUs (such as the DEC
+Alpha).
+
+To deal with this, a data dependency barrier must be inserted between the
+address load and the data load:
+
+	CPU 1		CPU 2
+	===============	===============
+	{ A == 1, B == 2, C = 3, P == &A, Q == &C }
+	B = 4;
+	<write barrier>
+	P = &B
+			Q = P;
+			<data dependency barrier>
+			D = *Q;
+
+This enforces the occurrence of one of the two implications, and prevents the
+third possibility from arising.
+
+[!] Note that this extremely counterintuitive situation arises most easily on
+machines with split caches, so that, for example, one cache bank processes
+even-numbered cache lines and the other bank processes odd-numbered cache
+lines.  The pointer P might be stored in an odd-numbered cache line, and the
+variable B might be stored in an even-numbered cache line.  Then, if the
+even-numbered bank of the reading CPU's cache is extremely busy while the
+odd-numbered bank is idle, one can see the new value of the pointer P (&B),
+but the old value of the variable B (1).
+
+
+Another example of where data dependency barriers might by required is where a
+number is read from memory and then used to calculate the index for an array
+access:
+
+	CPU 1		CPU 2
+	===============	===============
+	{ M[0] == 1, M[1] == 2, M[3] = 3, P == 0, Q == 3 }
+	M[1] = 4;
+	<write barrier>
+	P = 1
+			Q = P;
+			<data dependency barrier>
+			D = M[Q];
+
+
+The data dependency barrier is very important to the RCU system, for example.
+See rcu_dereference() in include/linux/rcupdate.h.  This permits the current
+target of an RCU'd pointer to be replaced with a new modified target, without
+the replacement target appearing to be incompletely initialised.
+
+See also the subsection on "Cache Coherency" for a more thorough example.
+
+
+CONTROL DEPENDENCIES
+--------------------
+
+A control dependency requires a full read memory barrier, not simply a data
+dependency barrier to make it work correctly.  Consider the following bit of
+code:
+
+	q = &a;
+	if (p)
+		q = &b;
+	<data dependency barrier>
+	x = *q;
+
+This will not have the desired effect because there is no actual data
+dependency, but rather a control dependency that the CPU may short-circuit by
+attempting to predict the outcome in advance.  In such a case what's actually
+required is:
+
+	q = &a;
+	if (p)
+		q = &b;
+	<read barrier>
+	x = *q;
+
+
+SMP BARRIER PAIRING
+-------------------
+
+When dealing with CPU-CPU interactions, certain types of memory barrier should
+always be paired.  A lack of appropriate pairing is almost certainly an error.
+
+A write barrier should always be paired with a data dependency barrier or read
+barrier, though a general barrier would also be viable.  Similarly a read
+barrier or a data dependency barrier should always be paired with at least an
+write barrier, though, again, a general barrier is viable:
+
+	CPU 1		CPU 2
+	===============	===============
+	a = 1;
+	<write barrier>
+	b = 2;		x = a;
+			<read barrier>
+			y = b;
+
+Or:
+
+	CPU 1		CPU 2
+	===============	===============================
+	a = 1;
+	<write barrier>
+	b = &a;		x = b;
+			<data dependency barrier>
+			y = *x;
+
+Basically, the read barrier always has to be there, even though it can be of
+the "weaker" type.
+
+
+EXAMPLES OF MEMORY BARRIER SEQUENCES
+------------------------------------
+
+Firstly, write barriers act as a partial orderings on store operations.
+Consider the following sequence of events:
+
+	CPU 1
+	=======================
+	STORE A = 1
+	STORE B = 2
+	STORE C = 3
+	<write barrier>
+	STORE D = 4
+	STORE E = 5
+
+This sequence of events is committed to the memory coherence system in an order
+that the rest of the system might perceive as the unordered set of { STORE A,
+STORE B, STORE C } all occuring before the unordered set of { STORE D, STORE E
+}:
+
+	+-------+       :      :
+	|       |       +------+
+	|       |------>| C=3  |     }     /\
+	|       |  :    +------+     }-----  \  -----> Events perceptible
+	|       |  :    | A=1  |     }        \/       to rest of system
+	|       |  :    +------+     }
+	| CPU 1 |  :    | B=2  |     }
+	|       |       +------+     }
+	|       |   wwwwwwwwwwwwwwww }   <--- At this point the write barrier
+	|       |       +------+     }        requires all stores prior to the
+	|       |  :    | E=5  |     }        barrier to be committed before
+	|       |  :    +------+     }        further stores may be take place.
+	|       |------>| D=4  |     }
+	|       |       +------+
+	+-------+       :      :
+	                   |
+	                   | Sequence in which stores committed to memory system
+	                   | by CPU 1
+	                   V
+
+
+Secondly, data dependency barriers act as a partial orderings on data-dependent
+loads.  Consider the following sequence of events:
+
+	CPU 1			CPU 2
+	=======================	=======================
+	STORE A = 1
+	STORE B = 2
+	<write barrier>
+	STORE C = &B		LOAD X
+	STORE D = 4		LOAD C (gets &B)
+				LOAD *C (reads B)
+
+Without intervention, CPU 2 may perceive the events on CPU 1 in some
+effectively random order, despite the write barrier issued by CPU 1:
+
+	+-------+       :      :                :       :
+	|       |       +------+                +-------+  | Sequence of update
+	|       |------>| B=2  |-----       --->| Y->8  |  | of perception on
+	|       |  :    +------+     \          +-------+  | CPU 2
+	| CPU 1 |  :    | A=1  |      \     --->| C->&Y |  V
+	|       |       +------+       |        +-------+
+	|       |   wwwwwwwwwwwwwwww   |        :       :
+	|       |       +------+       |        :       :
+	|       |  :    | C=&B |---    |        :       :       +-------+
+	|       |  :    +------+   \   |        +-------+       |       |
+	|       |------>| D=4  |    ----------->| C->&B |------>|       |
+	|       |       +------+       |        +-------+       |       |
+	+-------+       :      :       |        :       :       |       |
+	                               |        :       :       |       |
+	                               |        :       :       | CPU 2 |
+	                               |        +-------+       |       |
+	    Apparently incorrect --->  |        | B->7  |------>|       |
+	    perception of B (!)        |        +-------+       |       |
+	                               |        :       :       |       |
+	                               |        +-------+       |       |
+	    The load of X holds --->    \       | X->9  |------>|       |
+	    up the maintenance           \      +-------+       |       |
+	    of coherence of B             ----->| B->2  |       +-------+
+	                                        +-------+
+	                                        :       :
+
+
+In the above example, CPU 2 perceives that B is 7, despite the load of *C
+(which would be B) coming after the the LOAD of C.
+
+If, however, a data dependency barrier were to be placed between the load of C
+and the load of *C (ie: B) on CPU 2, then the following will occur:
+
+	+-------+       :      :                :       :
+	|       |       +------+                +-------+
+	|       |------>| B=2  |-----       --->| Y->8  |
+	|       |  :    +------+     \          +-------+
+	| CPU 1 |  :    | A=1  |      \     --->| C->&Y |
+	|       |       +------+       |        +-------+
+	|       |   wwwwwwwwwwwwwwww   |        :       :
+	|       |       +------+       |        :       :
+	|       |  :    | C=&B |---    |        :       :       +-------+
+	|       |  :    +------+   \   |        +-------+       |       |
+	|       |------>| D=4  |    ----------->| C->&B |------>|       |
+	|       |       +------+       |        +-------+       |       |
+	+-------+       :      :       |        :       :       |       |
+	                               |        :       :       |       |
+	                               |        :       :       | CPU 2 |
+	                               |        +-------+       |       |
+	                                \       | X->9  |------>|       |
+	                                 \      +-------+       |       |
+	                                  ----->| B->2  |       |       |
+	                                        +-------+       |       |
+	     Makes sure all effects --->    ddddddddddddddddd   |       |
+	     prior to the store of C            +-------+       |       |
+	     are perceptible to                 | B->2  |------>|       |
+	     successive loads                   +-------+       |       |
+	                                        :       :       +-------+
+
+
+And thirdly, a read barrier acts as a partial order on loads.  Consider the
+following sequence of events:
+
+	CPU 1			CPU 2
+	=======================	=======================
+	STORE A=1
+	STORE B=2
+	STORE C=3
+	<write barrier>
+	STORE D=4
+	STORE E=5
+				LOAD A
+				LOAD B
+				LOAD C
+				LOAD D
+				LOAD E
+
+Without intervention, CPU 2 may then choose to perceive the events on CPU 1 in
+some effectively random order, despite the write barrier issued by CPU 1:
+
+	+-------+       :      :
+	|       |       +------+
+	|       |------>| C=3  | }
+	|       |  :    +------+ }
+	|       |  :    | A=1  | }
+	|       |  :    +------+ }
+	| CPU 1 |  :    | B=2  | }---
+	|       |       +------+ }   \
+	|       |   wwwwwwwwwwwww}    \
+	|       |       +------+ }     \          :       :       +-------+
+	|       |  :    | E=5  | }      \         +-------+       |       |
+	|       |  :    +------+ }       \      { | C->3  |------>|       |
+	|       |------>| D=4  | }        \     { +-------+    :  |       |
+	|       |       +------+           \    { | E->5  |    :  |       |
+	+-------+       :      :            \   { +-------+    :  |       |
+	                           Transfer  -->{ | A->1  |    :  | CPU 2 |
+	                          from CPU 1    { +-------+    :  |       |
+	                           to CPU 2     { | D->4  |    :  |       |
+	                                        { +-------+    :  |       |
+	                                        { | B->2  |------>|       |
+	                                          +-------+       |       |
+	                                          :       :       +-------+
+
+
+If, however, a read barrier were to be placed between the load of C and the
+load of D on CPU 2, then the partial ordering imposed by CPU 1 will be
+perceived correctly by CPU 2.
+
+	+-------+       :      :
+	|       |       +------+
+	|       |------>| C=3  | }
+	|       |  :    +------+ }
+	|       |  :    | A=1  | }---
+	|       |  :    +------+ }   \
+	| CPU 1 |  :    | B=2  | }    \
+	|       |       +------+       \
+	|       |   wwwwwwwwwwwwwwww    \
+	|       |       +------+         \        :       :       +-------+
+	|       |  :    | E=5  | }        \       +-------+       |       |
+	|       |  :    +------+ }---      \    { | C->3  |------>|       |
+	|       |------>| D=4  | }   \      \   { +-------+    :  |       |
+	|       |       +------+      \      -->{ | B->2  |    :  |       |
+	+-------+       :      :       \        { +-------+    :  |       |
+	                                \       { | A->1  |    :  | CPU 2 |
+	                                 \        +-------+       |       |
+	   At this point the read ---->   \   rrrrrrrrrrrrrrrrr   |       |
+	   barrier causes all effects      \      +-------+       |       |
+	   prior to the storage of C        \   { | E->5  |    :  |       |
+	   to be perceptible to CPU 2        -->{ +-------+    :  |       |
+	                                        { | D->4  |------>|       |
+	                                          +-------+       |       |
+	                                          :       :       +-------+
+
+
+========================
+EXPLICIT KERNEL BARRIERS
+========================
+
+The Linux kernel has a variety of different barriers that act at different
+levels:
+
+  (*) Compiler barrier.
+
+  (*) CPU memory barriers.
+
+  (*) MMIO write barrier.
+
+
+COMPILER BARRIER
+----------------
+
+The Linux kernel has an explicit compiler barrier function that prevents the
+compiler from moving the memory accesses either side of it to the other side:
+
+	barrier();
+
+This a general barrier - lesser varieties of compiler barrier do not exist.
+
+The compiler barrier has no direct effect on the CPU, which may then reorder
+things however it wishes.
+
+
+CPU MEMORY BARRIERS
+-------------------
+
+The Linux kernel has eight basic CPU memory barriers:
+
+	TYPE		MANDATORY		SMP CONDITIONAL
+	===============	=======================	===========================
+	GENERAL		mb()			smp_mb()
+	WRITE		wmb()			smp_wmb()
+	READ		rmb()			smp_rmb()
+	DATA DEPENDENCY	read_barrier_depends()	smp_read_barrier_depends()
+
+
+All CPU memory barriers unconditionally imply compiler barriers.
+
+SMP memory barriers are reduced to compiler barriers on uniprocessor compiled
+systems because it is assumed that a CPU will be appear to be self-consistent,
+and will order overlapping accesses correctly with respect to itself.
+
+[!] Note that SMP memory barriers _must_ be used to control the ordering of
+references to shared memory on SMP systems, though the use of locking instead
+is sufficient.
+
+Mandatory barriers should not be used to control SMP effects, since mandatory
+barriers unnecessarily impose overhead on UP systems. They may, however, be
+used to control MMIO effects on accesses through relaxed memory I/O windows.
+These are required even on non-SMP systems as they affect the order in which
+memory operations appear to a device by prohibiting both the compiler and the
+CPU from reordering them.
+
+
+There are some more advanced barrier functions:
+
+ (*) set_mb(var, value)
+ (*) set_wmb(var, value)
+
+     These assign the value to the variable and then insert at least a write
+     barrier after it, depending on the function.  They aren't guaranteed to
+     insert anything more than a compiler barrier in a UP compilation.
+
+
+ (*) smp_mb__before_atomic_dec();
+ (*) smp_mb__after_atomic_dec();
+ (*) smp_mb__before_atomic_inc();
+ (*) smp_mb__after_atomic_inc();
+
+     These are for use with atomic add, subtract, increment and decrement
+     functions, especially when used for reference counting.  These functions
+     do not imply memory barriers.
+
+     As an example, consider a piece of code that marks an object as being dead
+     and then decrements the object's reference count:
+
+	obj->dead = 1;
+	smp_mb__before_atomic_dec();
+	atomic_dec(&obj->ref_count);
+
+     This makes sure that the death mark on the object is perceived to be set
+     *before* the reference counter is decremented.
+
+     See Documentation/atomic_ops.txt for more information.  See the "Atomic
+     operations" subsection for information on where to use these.
+
+
+ (*) smp_mb__before_clear_bit(void);
+ (*) smp_mb__after_clear_bit(void);
+
+     These are for use similar to the atomic inc/dec barriers.  These are
+     typically used for bitwise unlocking operations, so care must be taken as
+     there are no implicit memory barriers here either.
+
+     Consider implementing an unlock operation of some nature by clearing a
+     locking bit.  The clear_bit() would then need to be barriered like this:
+
+	smp_mb__before_clear_bit();
+	clear_bit( ... );
+
+     This prevents memory operations before the clear leaking to after it.  See
+     the subsection on "Locking Functions" with reference to UNLOCK operation
+     implications.
+
+     See Documentation/atomic_ops.txt for more information.  See the "Atomic
+     operations" subsection for information on where to use these.
+
+
+MMIO WRITE BARRIER
+------------------
+
+The Linux kernel also has a special barrier for use with memory-mapped I/O
+writes:
+
+	mmiowb();
+
+This is a variation on the mandatory write barrier that causes writes to weakly
+ordered I/O regions to be partially ordered.  Its effects may go beyond the
+CPU->Hardware interface and actually affect the hardware at some level.
+
+See the subsection "Locks vs I/O accesses" for more information.
+
+
+===============================
+IMPLICIT KERNEL MEMORY BARRIERS
+===============================
+
+Some of the other functions in the linux kernel imply memory barriers, amongst
+which are locking, scheduling and memory allocation functions.
+
+This specification is a _minimum_ guarantee; any particular architecture may
+provide more substantial guarantees, but these may not be relied upon outside
+of arch specific code.
+
+
+LOCKING FUNCTIONS
+-----------------
+
+The Linux kernel has a number of locking constructs:
+
+ (*) spin locks
+ (*) R/W spin locks
+ (*) mutexes
+ (*) semaphores
+ (*) R/W semaphores
+ (*) RCU
+
+In all cases there are variants on "LOCK" operations and "UNLOCK" operations
+for each construct.  These operations all imply certain barriers:
+
+ (1) LOCK operation implication:
+
+     Memory operations issued after the LOCK will be completed after the LOCK
+     operation has completed.
+
+     Memory operations issued before the LOCK may be completed after the LOCK
+     operation has completed.
+
+ (2) UNLOCK operation implication:
+
+     Memory operations issued before the UNLOCK will be completed before the
+     UNLOCK operation has completed.
+
+     Memory operations issued after the UNLOCK may be completed before the
+     UNLOCK operation has completed.
+
+ (3) LOCK vs LOCK implication:
+
+     All LOCK operations issued before another LOCK operation will be completed
+     before that LOCK operation.
+
+ (4) LOCK vs UNLOCK implication:
+
+     All LOCK operations issued before an UNLOCK operation will be completed
+     before the UNLOCK operation.
+
+     All UNLOCK operations issued before a LOCK operation will be completed
+     before the LOCK operation.
+
+ (5) Failed conditional LOCK implication:
+
+     Certain variants of the LOCK operation may fail, either due to being
+     unable to get the lock immediately, or due to receiving an unblocked
+     signal whilst asleep waiting for the lock to become available.  Failed
+     locks do not imply any sort of barrier.
+
+Therefore, from (1), (2) and (4) an UNLOCK followed by an unconditional LOCK is
+equivalent to a full barrier, but a LOCK followed by an UNLOCK is not.
+
+[!] Note: one of the consequence of LOCKs and UNLOCKs being only one-way
+    barriers is that the effects instructions outside of a critical section may
+    seep into the inside of the critical section.
+
+Locks and semaphores may not provide any guarantee of ordering on UP compiled
+systems, and so cannot be counted on in such a situation to actually achieve
+anything at all - especially with respect to I/O accesses - unless combined
+with interrupt disabling operations.
+
+See also the section on "Inter-CPU locking barrier effects".
+
+
+As an example, consider the following:
+
+	*A = a;
+	*B = b;
+	LOCK
+	*C = c;
+	*D = d;
+	UNLOCK
+	*E = e;
+	*F = f;
+
+The following sequence of events is acceptable:
+
+	LOCK, {*F,*A}, *E, {*C,*D}, *B, UNLOCK
+
+	[+] Note that {*F,*A} indicates a combined access.
+
+But none of the following are:
+
+	{*F,*A}, *B,	LOCK, *C, *D,	UNLOCK, *E
+	*A, *B, *C,	LOCK, *D,	UNLOCK, *E, *F
+	*A, *B,		LOCK, *C,	UNLOCK, *D, *E, *F
+	*B,		LOCK, *C, *D,	UNLOCK, {*F,*A}, *E
+
+
+
+INTERRUPT DISABLING FUNCTIONS
+-----------------------------
+
+Functions that disable interrupts (LOCK equivalent) and enable interrupts
+(UNLOCK equivalent) will act as compiler barriers only.  So if memory or I/O
+barriers are required in such a situation, they must be provided from some
+other means.
+
+
+MISCELLANEOUS FUNCTIONS
+-----------------------
+
+Other functions that imply barriers:
+
+ (*) schedule() and similar imply full memory barriers.
+
+ (*) Memory allocation and release functions imply full memory barriers.
+
+
+=================================
+INTER-CPU LOCKING BARRIER EFFECTS
+=================================
+
+On SMP systems locking primitives give a more substantial form of barrier: one
+that does affect memory access ordering on other CPUs, within the context of
+conflict on any particular lock.
+
+
+LOCKS VS MEMORY ACCESSES
+------------------------
+
+Consider the following: the system has a pair of spinlocks (N) and (Q), and
+three CPUs; then should the following sequence of events occur:
+
+	CPU 1				CPU 2
+	===============================	===============================
+	*A = a;				*E = e;
+	LOCK M				LOCK Q
+	*B = b;				*F = f;
+	*C = c;				*G = g;
+	UNLOCK M			UNLOCK Q
+	*D = d;				*H = h;
+
+Then there is no guarantee as to what order CPU #3 will see the accesses to *A
+through *H occur in, other than the constraints imposed by the separate locks
+on the separate CPUs. It might, for example, see:
+
+	*E, LOCK M, LOCK Q, *G, *C, *F, *A, *B, UNLOCK Q, *D, *H, UNLOCK M
+
+But it won't see any of:
+
+	*B, *C or *D preceding LOCK M