From 62da1921292ef789c23a7bf01d671d7572baf377 Mon Sep 17 00:00:00 2001
From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Date: Thu, 20 Sep 2012 16:02:49 -0700
Subject: rcu: Accelerate callbacks for CPU initiating a grace period

Because grace-period initialization is carried out by a separate
kthread, it might happen on a different CPU than the one that
had the callback needing a grace period -- which is where the
callback acceleration needs to happen.

Fortunately, rcu_start_gp() holds the root rcu_node structure's
->lock, which prevents a new grace period from starting.  This
allows this function to safely determine that a grace period has
not yet started, which in turn allows it to fully accelerate any
callbacks that it has pending.  This commit adds this acceleration.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcutree.c | 26 ++++++++++++++++++++++++--
 1 file changed, 24 insertions(+), 2 deletions(-)

(limited to 'kernel/rcutree.c')

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 74df86bd920..93d6871bf7f 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -1404,15 +1404,37 @@ rcu_start_gp(struct rcu_state *rsp, unsigned long flags)
 	    !cpu_needs_another_gp(rsp, rdp)) {
 		/*
 		 * Either we have not yet spawned the grace-period
-		 * task or this CPU does not need another grace period.
+		 * task, this CPU does not need another grace period,
+		 * or a grace period is already in progress.
 		 * Either way, don't start a new grace period.
 		 */
 		raw_spin_unlock_irqrestore(&rnp->lock, flags);
 		return;
 	}
 
+	/*
+	 * Because there is no grace period in progress right now,
+	 * any callbacks we have up to this point will be satisfied
+	 * by the next grace period.  So promote all callbacks to be
+	 * handled after the end of the next grace period.  If the
+	 * CPU is not yet aware of the end of the previous grace period,
+	 * we need to allow for the callback advancement that will
+	 * occur when it does become aware.  Deadlock prevents us from
+	 * making it aware at this point: We cannot acquire a leaf
+	 * rcu_node ->lock while holding the root rcu_node ->lock.
+	 */
+	rdp->nxttail[RCU_NEXT_READY_TAIL] = rdp->nxttail[RCU_NEXT_TAIL];
+	if (rdp->completed == rsp->completed)
+		rdp->nxttail[RCU_WAIT_TAIL] = rdp->nxttail[RCU_NEXT_TAIL];
+
 	rsp->gp_flags = RCU_GP_FLAG_INIT;
-	raw_spin_unlock_irqrestore(&rnp->lock, flags);
+	raw_spin_unlock(&rnp->lock); /* Interrupts remain disabled. */
+
+	/* Ensure that CPU is aware of completion of last grace period. */
+	rcu_process_gp_end(rsp, rdp);
+	local_irq_restore(flags);
+
+	/* Wake up rcu_gp_kthread() to start the grace period. */
 	wake_up(&rsp->gp_wq);
 }
 
-- 
cgit v1.2.3-18-g5258


From abfd6e58aed4f89fd69b9b17bc4b4527efe3a645 Mon Sep 17 00:00:00 2001
From: "Paul E. McKenney" <paul.mckenney@linaro.org>
Date: Thu, 20 Sep 2012 16:59:47 -0700
Subject: rcu: Fix comment about _rcu_barrier()/orphanage exclusion

In the old days, _rcu_barrier() acquired ->onofflock to exclude
rcu_send_cbs_to_orphanage(), which allowed the latter to avoid memory
barriers in callback handling.  However, _rcu_barrier() recently started
doing get_online_cpus() to lock out CPU-hotplug operations entirely, which
means that the comment in rcu_send_cbs_to_orphanage() that talks about
->onofflock is now obsolete.  This commit therefore fixes the comment.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcutree.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

(limited to 'kernel/rcutree.c')

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 74df86bd920..ac8aed8ee41 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -1581,8 +1581,8 @@ rcu_send_cbs_to_orphanage(int cpu, struct rcu_state *rsp,
 {
 	/*
 	 * Orphan the callbacks.  First adjust the counts.  This is safe
-	 * because ->onofflock excludes _rcu_barrier()'s adoption of
-	 * the callbacks, thus no memory barrier is required.
+	 * because _rcu_barrier() excludes CPU-hotplug operations, so it
+	 * cannot be running now.  Thus no memory barrier is required.
 	 */
 	if (rdp->nxtlist != NULL) {
 		rsp->qlen_lazy += rdp->qlen_lazy;
-- 
cgit v1.2.3-18-g5258


From 3705b88db0d7cc4a097c32d9e554054103d3f807 Mon Sep 17 00:00:00 2001
From: Antti P Miettinen <ananaza@iki.fi>
Date: Fri, 5 Oct 2012 09:59:15 +0300
Subject: rcu: Add a module parameter to force use of expedited RCU primitives

There have been some embedded applications that would benefit from
use of expedited grace-period primitives.  In some ways, this is
similar to synchronize_net() doing either a normal or an expedited
grace period depending on lock state, but with control outside of
the kernel.

This commit therefore adds rcu_expedited boot and sysfs parameters
that cause the kernel to substitute expedited primitives for the
normal grace-period primitives.

[ paulmck: Add trace/event/rcu.h to kernel/srcu.c to avoid build error.
	   Get rid of infinite loop through contention path.]

Signed-off-by: Antti P Miettinen <amiettinen@nvidia.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcutree.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

(limited to 'kernel/rcutree.c')

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 74df86bd920..f9c17c39953 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -2224,7 +2224,10 @@ void synchronize_sched(void)
 			   "Illegal synchronize_sched() in RCU-sched read-side critical section");
 	if (rcu_blocking_is_gp())
 		return;
-	wait_rcu_gp(call_rcu_sched);
+	if (rcu_expedited)
+		synchronize_sched_expedited();
+	else
+		wait_rcu_gp(call_rcu_sched);
 }
 EXPORT_SYMBOL_GPL(synchronize_sched);
 
@@ -2245,7 +2248,10 @@ void synchronize_rcu_bh(void)
 			   "Illegal synchronize_rcu_bh() in RCU-bh read-side critical section");
 	if (rcu_blocking_is_gp())
 		return;
-	wait_rcu_gp(call_rcu_bh);
+	if (rcu_expedited)
+		synchronize_rcu_bh_expedited();
+	else
+		wait_rcu_gp(call_rcu_bh);
 }
 EXPORT_SYMBOL_GPL(synchronize_rcu_bh);
 
@@ -2328,7 +2334,7 @@ void synchronize_sched_expedited(void)
 		if (trycount++ < 10) {
 			udelay(trycount * num_online_cpus());
 		} else {
-			synchronize_sched();
+			wait_rcu_gp(call_rcu_sched);
 			return;
 		}
 
-- 
cgit v1.2.3-18-g5258


From 340f588bbaed6cb518aa65e7a330dcc3fff911f8 Mon Sep 17 00:00:00 2001
From: "Paul E. McKenney" <paul.mckenney@linaro.org>
Date: Mon, 15 Oct 2012 08:24:54 -0700
Subject: rcu: Fix precedence error in cpu_needs_another_gp()

The fix introduced by a10d206e (rcu: Fix day-one dyntick-idle
stall-warning bug) has a C-language precedence error.  It turns out
that this error is harmless in that the same result is computed for all
inputs, but the code is nevertheless a potential source of confusion.
This commit therefore introduces parentheses in order to force the
execution of the code to reflect the intent.

Reported-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcutree.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

(limited to 'kernel/rcutree.c')

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index f9c17c39953..effd47a54b3 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -313,7 +313,7 @@ static int
 cpu_needs_another_gp(struct rcu_state *rsp, struct rcu_data *rdp)
 {
 	return *rdp->nxttail[RCU_DONE_TAIL +
-			     ACCESS_ONCE(rsp->completed) != rdp->completed] &&
+			     (ACCESS_ONCE(rsp->completed) != rdp->completed)] &&
 	       !rcu_gp_in_progress(rsp);
 }
 
-- 
cgit v1.2.3-18-g5258


From b637a328bd4f43a0e146d1eef0142b650ba0d644 Mon Sep 17 00:00:00 2001
From: "Paul E. McKenney" <paul.mckenney@linaro.org>
Date: Wed, 19 Sep 2012 16:58:38 -0700
Subject: rcu: Print remote CPU's stacks in stall warnings

The RCU CPU stall warnings rely on trigger_all_cpu_backtrace() to
do NMI-based dump of the stack traces of all CPUs.  Unfortunately, a
number of architectures do not implement trigger_all_cpu_backtrace(), in
which case RCU falls back to just dumping the stack of the running CPU.
This is unhelpful in the case where the running CPU has detected that
some other CPU has stalled.

This commit therefore makes the running CPU dump the stacks of the
tasks running on the stalled CPUs.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcutree.c | 25 ++++++++++++++++++++++++-
 1 file changed, 24 insertions(+), 1 deletion(-)

(limited to 'kernel/rcutree.c')

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 74df86bd920..e78538712df 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -873,6 +873,29 @@ static void record_gp_stall_check_time(struct rcu_state *rsp)
 	rsp->jiffies_stall = jiffies + jiffies_till_stall_check();
 }
 
+/*
+ * Dump stacks of all tasks running on stalled CPUs.  This is a fallback
+ * for architectures that do not implement trigger_all_cpu_backtrace().
+ * The NMI-triggered stack traces are more accurate because they are
+ * printed by the target CPU.
+ */
+static void rcu_dump_cpu_stacks(struct rcu_state *rsp)
+{
+	int cpu;
+	unsigned long flags;
+	struct rcu_node *rnp;
+
+	rcu_for_each_leaf_node(rsp, rnp) {
+		raw_spin_lock_irqsave(&rnp->lock, flags);
+		if (rnp->qsmask != 0) {
+			for (cpu = 0; cpu <= rnp->grphi - rnp->grplo; cpu++)
+				if (rnp->qsmask & (1UL << cpu))
+					dump_cpu_task(rnp->grplo + cpu);
+		}
+		raw_spin_unlock_irqrestore(&rnp->lock, flags);
+	}
+}
+
 static void print_other_cpu_stall(struct rcu_state *rsp)
 {
 	int cpu;
@@ -929,7 +952,7 @@ static void print_other_cpu_stall(struct rcu_state *rsp)
 	if (ndetected == 0)
 		printk(KERN_ERR "INFO: Stall ended before state dump start\n");
 	else if (!trigger_all_cpu_backtrace())
-		dump_stack();
+		rcu_dump_cpu_stacks(rsp);
 
 	/* Complain about tasks blocking the grace period. */
 
-- 
cgit v1.2.3-18-g5258


From eee058826100e5a344f11601e0c47baeaf07c77b Mon Sep 17 00:00:00 2001
From: "Paul E. McKenney" <paul.mckenney@linaro.org>
Date: Fri, 21 Sep 2012 14:15:05 -0700
Subject: rcu: Add grace-period information to RCU CPU stall warnings

This commit causes the last grace period started and completed to be
printed on RCU CPU stall warning messages in order to aid diagnosis.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcutree.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

(limited to 'kernel/rcutree.c')

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index e78538712df..8f4de3a7c1f 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -947,8 +947,9 @@ static void print_other_cpu_stall(struct rcu_state *rsp)
 	raw_spin_unlock_irqrestore(&rnp->lock, flags);
 
 	print_cpu_stall_info_end();
-	printk(KERN_CONT "(detected by %d, t=%ld jiffies)\n",
-	       smp_processor_id(), (long)(jiffies - rsp->gp_start));
+	pr_cont("(detected by %d, t=%ld jiffies, g=%lu, c=%lu)\n",
+	       smp_processor_id(), (long)(jiffies - rsp->gp_start),
+	       rsp->gpnum, rsp->completed);
 	if (ndetected == 0)
 		printk(KERN_ERR "INFO: Stall ended before state dump start\n");
 	else if (!trigger_all_cpu_backtrace())
@@ -975,7 +976,8 @@ static void print_cpu_stall(struct rcu_state *rsp)
 	print_cpu_stall_info_begin();
 	print_cpu_stall_info(rsp, smp_processor_id());
 	print_cpu_stall_info_end();
-	printk(KERN_CONT " (t=%lu jiffies)\n", jiffies - rsp->gp_start);
+	pr_cont(" (t=%lu jiffies g=%lu c=%lu)\n",
+		jiffies - rsp->gp_start, rsp->gpnum, rsp->completed);
 	if (!trigger_all_cpu_backtrace())
 		dump_stack();
 
-- 
cgit v1.2.3-18-g5258


From 53bb857c373d6b7936f8a7b4451f0a99703c308e Mon Sep 17 00:00:00 2001
From: "Paul E. McKenney" <paul.mckenney@linaro.org>
Date: Fri, 21 Sep 2012 16:35:25 -0700
Subject: rcu: Dump number of callbacks in stall warning messages

In theory, if a grace period manages to get started despite there being
no callbacks on any of the CPUs, all CPUs could go into dyntick-idle
mode, so that the grace period would never end.  This commit updates
the RCU CPU stall warning messages to detect this condition by summing
up the number of callbacks on all CPUs.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcutree.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

(limited to 'kernel/rcutree.c')

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 8f4de3a7c1f..24b21cba2cc 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -903,6 +903,7 @@ static void print_other_cpu_stall(struct rcu_state *rsp)
 	unsigned long flags;
 	int ndetected = 0;
 	struct rcu_node *rnp = rcu_get_root(rsp);
+	long totqlen = 0;
 
 	/* Only let one CPU complain about others per time interval. */
 
@@ -947,9 +948,11 @@ static void print_other_cpu_stall(struct rcu_state *rsp)
 	raw_spin_unlock_irqrestore(&rnp->lock, flags);
 
 	print_cpu_stall_info_end();
-	pr_cont("(detected by %d, t=%ld jiffies, g=%lu, c=%lu)\n",
+	for_each_possible_cpu(cpu)
+		totqlen += per_cpu_ptr(rsp->rda, cpu)->qlen;
+	pr_cont("(detected by %d, t=%ld jiffies, g=%lu, c=%lu, q=%lu)\n",
 	       smp_processor_id(), (long)(jiffies - rsp->gp_start),
-	       rsp->gpnum, rsp->completed);
+	       rsp->gpnum, rsp->completed, totqlen);
 	if (ndetected == 0)
 		printk(KERN_ERR "INFO: Stall ended before state dump start\n");
 	else if (!trigger_all_cpu_backtrace())
@@ -964,8 +967,10 @@ static void print_other_cpu_stall(struct rcu_state *rsp)
 
 static void print_cpu_stall(struct rcu_state *rsp)
 {
+	int cpu;
 	unsigned long flags;
 	struct rcu_node *rnp = rcu_get_root(rsp);
+	long totqlen = 0;
 
 	/*
 	 * OK, time to rat on ourselves...
@@ -976,8 +981,10 @@ static void print_cpu_stall(struct rcu_state *rsp)
 	print_cpu_stall_info_begin();
 	print_cpu_stall_info(rsp, smp_processor_id());
 	print_cpu_stall_info_end();
-	pr_cont(" (t=%lu jiffies g=%lu c=%lu)\n",
-		jiffies - rsp->gp_start, rsp->gpnum, rsp->completed);
+	for_each_possible_cpu(cpu)
+		totqlen += per_cpu_ptr(rsp->rda, cpu)->qlen;
+	pr_cont(" (t=%lu jiffies g=%lu c=%lu q=%lu)\n",
+		jiffies - rsp->gp_start, rsp->gpnum, rsp->completed, totqlen);
 	if (!trigger_all_cpu_backtrace())
 		dump_stack();
 
-- 
cgit v1.2.3-18-g5258


From 7b2e6011f150c42235c4a541d20cf6891afe878a Mon Sep 17 00:00:00 2001
From: "Paul E. McKenney" <paul.mckenney@linaro.org>
Date: Mon, 8 Oct 2012 10:54:03 -0700
Subject: rcu: Rename ->onofflock to ->orphan_lock

The ->onofflock field in the rcu_state structure at one time synchronized
CPU-hotplug operations for RCU.  However, its scope has decreased over time
so that it now only protects the lists of orphaned RCU callbacks.  This
commit therefore renames it to ->orphan_lock to reflect its current use.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcutree.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

(limited to 'kernel/rcutree.c')

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index ac8aed8ee41..89148860e2b 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -70,7 +70,7 @@ static struct lock_class_key rcu_fqs_class[RCU_NUM_LVLS];
 	.fqs_state = RCU_GP_IDLE, \
 	.gpnum = -300, \
 	.completed = -300, \
-	.onofflock = __RAW_SPIN_LOCK_UNLOCKED(&sname##_state.onofflock), \
+	.orphan_lock = __RAW_SPIN_LOCK_UNLOCKED(&sname##_state.orphan_lock), \
 	.orphan_nxttail = &sname##_state.orphan_nxtlist, \
 	.orphan_donetail = &sname##_state.orphan_donelist, \
 	.barrier_mutex = __MUTEX_INITIALIZER(sname##_state.barrier_mutex), \
@@ -1573,7 +1573,7 @@ rcu_check_quiescent_state(struct rcu_state *rsp, struct rcu_data *rdp)
 /*
  * Send the specified CPU's RCU callbacks to the orphanage.  The
  * specified CPU must be offline, and the caller must hold the
- * ->onofflock.
+ * ->orphan_lock.
  */
 static void
 rcu_send_cbs_to_orphanage(int cpu, struct rcu_state *rsp,
@@ -1623,7 +1623,7 @@ rcu_send_cbs_to_orphanage(int cpu, struct rcu_state *rsp,
 
 /*
  * Adopt the RCU callbacks from the specified rcu_state structure's
- * orphanage.  The caller must hold the ->onofflock.
+ * orphanage.  The caller must hold the ->orphan_lock.
  */
 static void rcu_adopt_orphan_cbs(struct rcu_state *rsp)
 {
@@ -1702,7 +1702,7 @@ static void rcu_cleanup_dead_cpu(int cpu, struct rcu_state *rsp)
 
 	/* Exclude any attempts to start a new grace period. */
 	mutex_lock(&rsp->onoff_mutex);
-	raw_spin_lock_irqsave(&rsp->onofflock, flags);
+	raw_spin_lock_irqsave(&rsp->orphan_lock, flags);
 
 	/* Orphan the dead CPU's callbacks, and adopt them if appropriate. */
 	rcu_send_cbs_to_orphanage(cpu, rsp, rnp, rdp);
@@ -1729,10 +1729,10 @@ static void rcu_cleanup_dead_cpu(int cpu, struct rcu_state *rsp)
 	/*
 	 * We still hold the leaf rcu_node structure lock here, and
 	 * irqs are still disabled.  The reason for this subterfuge is
-	 * because invoking rcu_report_unblock_qs_rnp() with ->onofflock
+	 * because invoking rcu_report_unblock_qs_rnp() with ->orphan_lock
 	 * held leads to deadlock.
 	 */
-	raw_spin_unlock(&rsp->onofflock); /* irqs remain disabled. */
+	raw_spin_unlock(&rsp->orphan_lock); /* irqs remain disabled. */
 	rnp = rdp->mynode;
 	if (need_report & RCU_OFL_TASKS_NORM_GP)
 		rcu_report_unblock_qs_rnp(rnp, flags);
-- 
cgit v1.2.3-18-g5258


From 1924bcb0259711eea98491a7942d1ffbf677e114 Mon Sep 17 00:00:00 2001
From: "Paul E. McKenney" <paul.mckenney@linaro.org>
Date: Thu, 11 Oct 2012 12:30:37 -0700
Subject: rcu: Avoid counter wrap in synchronize_sched_expedited()

There is a counter scheme similar to ticket locking that
synchronize_sched_expedited() uses to service multiple concurrent
callers with the same expedited grace period.  Upon entry, a
sync_sched_expedited_started variable is atomically incremented,
and upon completion of a expedited grace period a separate
sync_sched_expedited_done variable is atomically incremented.

However, if a synchronize_sched_expedited() is delayed while
in try_stop_cpus(), concurrent invocations will increment the
sync_sched_expedited_started counter, which will eventually overflow.
If the original synchronize_sched_expedited() resumes execution just
as the counter overflows, a concurrent invocation could incorrectly
conclude that an expedited grace period elapsed in zero time, which
would be bad.  One could rely on counter size to prevent this from
happening in practice, but the goal is to formally validate this
code, so it needs to be fixed anyway.

This commit therefore checks the gap between the two counters before
incrementing sync_sched_expedited_started, and if the gap is too
large, does a normal grace period instead.  Overflow is thus only
possible if there are more than about 3.5 billion threads on 32-bit
systems, which can be excluded until such time as task_struct fits
into a single byte and 4G/4G patches are accepted into mainline.
It is also easy to encode this limitation into mechanical theorem
provers.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcutree.c | 62 ++++++++++++++++++++++++++++++++++++++++----------------
 1 file changed, 44 insertions(+), 18 deletions(-)

(limited to 'kernel/rcutree.c')

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 89148860e2b..678905555ca 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -2249,8 +2249,8 @@ void synchronize_rcu_bh(void)
 }
 EXPORT_SYMBOL_GPL(synchronize_rcu_bh);
 
-static atomic_t sync_sched_expedited_started = ATOMIC_INIT(0);
-static atomic_t sync_sched_expedited_done = ATOMIC_INIT(0);
+static atomic_long_t sync_sched_expedited_started = ATOMIC_LONG_INIT(0);
+static atomic_long_t sync_sched_expedited_done = ATOMIC_LONG_INIT(0);
 
 static int synchronize_sched_expedited_cpu_stop(void *data)
 {
@@ -2308,10 +2308,30 @@ static int synchronize_sched_expedited_cpu_stop(void *data)
  */
 void synchronize_sched_expedited(void)
 {
-	int firstsnap, s, snap, trycount = 0;
+	long firstsnap, s, snap;
+	int trycount = 0;
 
-	/* Note that atomic_inc_return() implies full memory barrier. */
-	firstsnap = snap = atomic_inc_return(&sync_sched_expedited_started);
+	/*
+	 * If we are in danger of counter wrap, just do synchronize_sched().
+	 * By allowing sync_sched_expedited_started to advance no more than
+	 * ULONG_MAX/8 ahead of sync_sched_expedited_done, we are ensuring
+	 * that more than 3.5 billion CPUs would be required to force a
+	 * counter wrap on a 32-bit system.  Quite a few more CPUs would of
+	 * course be required on a 64-bit system.
+	 */
+	if (ULONG_CMP_GE((ulong)atomic_read(&sync_sched_expedited_started),
+			 (ulong)atomic_read(&sync_sched_expedited_done) +
+			 ULONG_MAX / 8)) {
+		synchronize_sched();
+		return;
+	}
+
+	/*
+	 * Take a ticket.  Note that atomic_inc_return() implies a
+	 * full memory barrier.
+	 */
+	snap = atomic_long_inc_return(&sync_sched_expedited_started);
+	firstsnap = snap;
 	get_online_cpus();
 	WARN_ON_ONCE(cpu_is_offline(raw_smp_processor_id()));
 
@@ -2324,6 +2344,13 @@ void synchronize_sched_expedited(void)
 			     NULL) == -EAGAIN) {
 		put_online_cpus();
 
+		/* Check to see if someone else did our work for us. */
+		s = atomic_long_read(&sync_sched_expedited_done);
+		if (ULONG_CMP_GE((ulong)s, (ulong)firstsnap)) {
+			smp_mb(); /* ensure test happens before caller kfree */
+			return;
+		}
+
 		/* No joy, try again later.  Or just synchronize_sched(). */
 		if (trycount++ < 10) {
 			udelay(trycount * num_online_cpus());
@@ -2332,23 +2359,22 @@ void synchronize_sched_expedited(void)
 			return;
 		}
 
-		/* Check to see if someone else did our work for us. */
-		s = atomic_read(&sync_sched_expedited_done);
-		if (UINT_CMP_GE((unsigned)s, (unsigned)firstsnap)) {
+		/* Recheck to see if someone else did our work for us. */
+		s = atomic_long_read(&sync_sched_expedited_done);
+		if (ULONG_CMP_GE((ulong)s, (ulong)firstsnap)) {
 			smp_mb(); /* ensure test happens before caller kfree */
 			return;
 		}
 
 		/*
 		 * Refetching sync_sched_expedited_started allows later
-		 * callers to piggyback on our grace period.  We subtract
-		 * 1 to get the same token that the last incrementer got.
-		 * We retry after they started, so our grace period works
-		 * for them, and they started after our first try, so their
-		 * grace period works for us.
+		 * callers to piggyback on our grace period.  We retry
+		 * after they started, so our grace period works for them,
+		 * and they started after our first try, so their grace
+		 * period works for us.
 		 */
 		get_online_cpus();
-		snap = atomic_read(&sync_sched_expedited_started);
+		snap = atomic_long_read(&sync_sched_expedited_started);
 		smp_mb(); /* ensure read is before try_stop_cpus(). */
 	}
 
@@ -2356,15 +2382,15 @@ void synchronize_sched_expedited(void)
 	 * Everyone up to our most recent fetch is covered by our grace
 	 * period.  Update the counter, but only if our work is still
 	 * relevant -- which it won't be if someone who started later
-	 * than we did beat us to the punch.
+	 * than we did already did their update.
 	 */
 	do {
-		s = atomic_read(&sync_sched_expedited_done);
-		if (UINT_CMP_GE((unsigned)s, (unsigned)snap)) {
+		s = atomic_long_read(&sync_sched_expedited_done);
+		if (ULONG_CMP_GE((ulong)s, (ulong)snap)) {
 			smp_mb(); /* ensure test happens before caller kfree */
 			break;
 		}
-	} while (atomic_cmpxchg(&sync_sched_expedited_done, s, snap) != s);
+	} while (atomic_long_cmpxchg(&sync_sched_expedited_done, s, snap) != s);
 
 	put_online_cpus();
 }
-- 
cgit v1.2.3-18-g5258


From 40694d6644d5cca28531707559466122eb212d8b Mon Sep 17 00:00:00 2001
From: "Paul E. McKenney" <paul.mckenney@linaro.org>
Date: Thu, 11 Oct 2012 15:24:03 -0700
Subject: rcu: Move synchronize_sched_expedited() state to rcu_state

Tracing (debugfs) of expedited RCU primitives is required, which in turn
requires that the relevant data be located where the tracing code can find
it, not in its current static global variables in kernel/rcutree.c.
This commit therefore moves sync_sched_expedited_started and
sync_sched_expedited_done to the rcu_state structure, as fields
->expedited_start and ->expedited_done, respectively.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcutree.c | 20 +++++++++-----------
 1 file changed, 9 insertions(+), 11 deletions(-)

(limited to 'kernel/rcutree.c')

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 678905555ca..3c72e5e5528 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -2249,9 +2249,6 @@ void synchronize_rcu_bh(void)
 }
 EXPORT_SYMBOL_GPL(synchronize_rcu_bh);
 
-static atomic_long_t sync_sched_expedited_started = ATOMIC_LONG_INIT(0);
-static atomic_long_t sync_sched_expedited_done = ATOMIC_LONG_INIT(0);
-
 static int synchronize_sched_expedited_cpu_stop(void *data)
 {
 	/*
@@ -2310,6 +2307,7 @@ void synchronize_sched_expedited(void)
 {
 	long firstsnap, s, snap;
 	int trycount = 0;
+	struct rcu_state *rsp = &rcu_sched_state;
 
 	/*
 	 * If we are in danger of counter wrap, just do synchronize_sched().
@@ -2319,8 +2317,8 @@ void synchronize_sched_expedited(void)
 	 * counter wrap on a 32-bit system.  Quite a few more CPUs would of
 	 * course be required on a 64-bit system.
 	 */
-	if (ULONG_CMP_GE((ulong)atomic_read(&sync_sched_expedited_started),
-			 (ulong)atomic_read(&sync_sched_expedited_done) +
+	if (ULONG_CMP_GE((ulong)atomic_long_read(&rsp->expedited_start),
+			 (ulong)atomic_long_read(&rsp->expedited_done) +
 			 ULONG_MAX / 8)) {
 		synchronize_sched();
 		return;
@@ -2330,7 +2328,7 @@ void synchronize_sched_expedited(void)
 	 * Take a ticket.  Note that atomic_inc_return() implies a
 	 * full memory barrier.
 	 */
-	snap = atomic_long_inc_return(&sync_sched_expedited_started);
+	snap = atomic_long_inc_return(&rsp->expedited_start);
 	firstsnap = snap;
 	get_online_cpus();
 	WARN_ON_ONCE(cpu_is_offline(raw_smp_processor_id()));
@@ -2345,7 +2343,7 @@ void synchronize_sched_expedited(void)
 		put_online_cpus();
 
 		/* Check to see if someone else did our work for us. */
-		s = atomic_long_read(&sync_sched_expedited_done);
+		s = atomic_long_read(&rsp->expedited_done);
 		if (ULONG_CMP_GE((ulong)s, (ulong)firstsnap)) {
 			smp_mb(); /* ensure test happens before caller kfree */
 			return;
@@ -2360,7 +2358,7 @@ void synchronize_sched_expedited(void)
 		}
 
 		/* Recheck to see if someone else did our work for us. */
-		s = atomic_long_read(&sync_sched_expedited_done);
+		s = atomic_long_read(&rsp->expedited_done);
 		if (ULONG_CMP_GE((ulong)s, (ulong)firstsnap)) {
 			smp_mb(); /* ensure test happens before caller kfree */
 			return;
@@ -2374,7 +2372,7 @@ void synchronize_sched_expedited(void)
 		 * period works for us.
 		 */
 		get_online_cpus();
-		snap = atomic_long_read(&sync_sched_expedited_started);
+		snap = atomic_long_read(&rsp->expedited_start);
 		smp_mb(); /* ensure read is before try_stop_cpus(). */
 	}
 
@@ -2385,12 +2383,12 @@ void synchronize_sched_expedited(void)
 	 * than we did already did their update.
 	 */
 	do {
-		s = atomic_long_read(&sync_sched_expedited_done);
+		s = atomic_long_read(&rsp->expedited_done);
 		if (ULONG_CMP_GE((ulong)s, (ulong)snap)) {
 			smp_mb(); /* ensure test happens before caller kfree */
 			break;
 		}
-	} while (atomic_long_cmpxchg(&sync_sched_expedited_done, s, snap) != s);
+	} while (atomic_long_cmpxchg(&rsp->expedited_done, s, snap) != s);
 
 	put_online_cpus();
 }
-- 
cgit v1.2.3-18-g5258


From a30489c5228fba6f16b4c740a0292879ef13371e Mon Sep 17 00:00:00 2001
From: "Paul E. McKenney" <paul.mckenney@linaro.org>
Date: Thu, 11 Oct 2012 16:18:09 -0700
Subject: rcu: Instrument synchronize_rcu_expedited() for debugfs tracing

This commit adds the counters to rcu_state and updates them in
synchronize_rcu_expedited() to provide the data needed for debugfs
tracing.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcutree.c | 18 +++++++++++++++---
 1 file changed, 15 insertions(+), 3 deletions(-)

(limited to 'kernel/rcutree.c')

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 3c72e5e5528..b966d56ebb5 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -2321,6 +2321,7 @@ void synchronize_sched_expedited(void)
 			 (ulong)atomic_long_read(&rsp->expedited_done) +
 			 ULONG_MAX / 8)) {
 		synchronize_sched();
+		atomic_long_inc(&rsp->expedited_wrap);
 		return;
 	}
 
@@ -2341,11 +2342,14 @@ void synchronize_sched_expedited(void)
 			     synchronize_sched_expedited_cpu_stop,
 			     NULL) == -EAGAIN) {
 		put_online_cpus();
+		atomic_long_inc(&rsp->expedited_tryfail);
 
 		/* Check to see if someone else did our work for us. */
 		s = atomic_long_read(&rsp->expedited_done);
 		if (ULONG_CMP_GE((ulong)s, (ulong)firstsnap)) {
-			smp_mb(); /* ensure test happens before caller kfree */
+			/* ensure test happens before caller kfree */
+			smp_mb__before_atomic_inc(); /* ^^^ */
+			atomic_long_inc(&rsp->expedited_workdone1);
 			return;
 		}
 
@@ -2354,13 +2358,16 @@ void synchronize_sched_expedited(void)
 			udelay(trycount * num_online_cpus());
 		} else {
 			synchronize_sched();
+			atomic_long_inc(&rsp->expedited_normal);
 			return;
 		}
 
 		/* Recheck to see if someone else did our work for us. */
 		s = atomic_long_read(&rsp->expedited_done);
 		if (ULONG_CMP_GE((ulong)s, (ulong)firstsnap)) {
-			smp_mb(); /* ensure test happens before caller kfree */
+			/* ensure test happens before caller kfree */
+			smp_mb__before_atomic_inc(); /* ^^^ */
+			atomic_long_inc(&rsp->expedited_workdone2);
 			return;
 		}
 
@@ -2375,6 +2382,7 @@ void synchronize_sched_expedited(void)
 		snap = atomic_long_read(&rsp->expedited_start);
 		smp_mb(); /* ensure read is before try_stop_cpus(). */
 	}
+	atomic_long_inc(&rsp->expedited_stoppedcpus);
 
 	/*
 	 * Everyone up to our most recent fetch is covered by our grace
@@ -2383,12 +2391,16 @@ void synchronize_sched_expedited(void)
 	 * than we did already did their update.
 	 */
 	do {
+		atomic_long_inc(&rsp->expedited_done_tries);
 		s = atomic_long_read(&rsp->expedited_done);
 		if (ULONG_CMP_GE((ulong)s, (ulong)snap)) {
-			smp_mb(); /* ensure test happens before caller kfree */
+			/* ensure test happens before caller kfree */
+			smp_mb__before_atomic_inc(); /* ^^^ */
+			atomic_long_inc(&rsp->expedited_done_lost);
 			break;
 		}
 	} while (atomic_long_cmpxchg(&rsp->expedited_done, s, snap) != s);
+	atomic_long_inc(&rsp->expedited_done_exit);
 
 	put_online_cpus();
 }
-- 
cgit v1.2.3-18-g5258


From 42c3533eee88e012e1aa3c4d6d2cc53354130e24 Mon Sep 17 00:00:00 2001
From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Date: Fri, 28 Sep 2012 10:49:58 -0700
Subject: rcu: Fix tracing formatting

The rcu_state structure's ->completed field is unsigned long, so this
commit adjusts show_one_rcugp()'s printf() format to suit.  Also add
the required ACCESS_ONCE() directives while we are in this function.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcutree.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

(limited to 'kernel/rcutree.c')

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index b966d56ebb5..8ed9c481db0 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -68,8 +68,8 @@ static struct lock_class_key rcu_fqs_class[RCU_NUM_LVLS];
 	.level = { &sname##_state.node[0] }, \
 	.call = cr, \
 	.fqs_state = RCU_GP_IDLE, \
-	.gpnum = -300, \
-	.completed = -300, \
+	.gpnum = 0UL - 300UL, \
+	.completed = 0UL - 300UL, \
 	.orphan_lock = __RAW_SPIN_LOCK_UNLOCKED(&sname##_state.orphan_lock), \
 	.orphan_nxttail = &sname##_state.orphan_nxtlist, \
 	.orphan_donetail = &sname##_state.orphan_donelist, \
-- 
cgit v1.2.3-18-g5258


From 878d7439d0f45a95869e417576774673d1fa243f Mon Sep 17 00:00:00 2001
From: Eric Dumazet <edumazet@google.com>
Date: Thu, 18 Oct 2012 04:55:36 -0700
Subject: rcu: Fix batch-limit size problem

Commit 29c00b4a1d9e27 (rcu: Add event-tracing for RCU callback
invocation) added a regression in rcu_do_batch()

Under stress, RCU is supposed to allow to process all items in queue,
instead of a batch of 10 items (blimit), but an integer overflow makes
the effective limit being 1.  So, unless there is frequent idle periods
(during which RCU ignores batch limits), RCU can be forced into a
state where it cannot keep up with the callback-generation rate,
eventually resulting in OOM.

This commit therefore converts a few variables in rcu_do_batch() from
int to long to fix this problem, along with the module parameters
controlling the batch limits.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: <stable@vger.kernel.org> # 3.2 +
---
 kernel/rcutree.c | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

(limited to 'kernel/rcutree.c')

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 93d6871bf7f..e4c2192b47c 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -212,13 +212,13 @@ DEFINE_PER_CPU(struct rcu_dynticks, rcu_dynticks) = {
 #endif
 };
 
-static int blimit = 10;		/* Maximum callbacks per rcu_do_batch. */
-static int qhimark = 10000;	/* If this many pending, ignore blimit. */
-static int qlowmark = 100;	/* Once only this many pending, use blimit. */
+static long blimit = 10;	/* Maximum callbacks per rcu_do_batch. */
+static long qhimark = 10000;	/* If this many pending, ignore blimit. */
+static long qlowmark = 100;	/* Once only this many pending, use blimit. */
 
-module_param(blimit, int, 0444);
-module_param(qhimark, int, 0444);
-module_param(qlowmark, int, 0444);
+module_param(blimit, long, 0444);
+module_param(qhimark, long, 0444);
+module_param(qlowmark, long, 0444);
 
 int rcu_cpu_stall_suppress __read_mostly; /* 1 = suppress stall warnings. */
 int rcu_cpu_stall_timeout __read_mostly = CONFIG_RCU_CPU_STALL_TIMEOUT;
@@ -1791,7 +1791,8 @@ static void rcu_do_batch(struct rcu_state *rsp, struct rcu_data *rdp)
 {
 	unsigned long flags;
 	struct rcu_head *next, *list, **tail;
-	int bl, count, count_lazy, i;
+	long bl, count, count_lazy;
+	int i;
 
 	/* If no callbacks are ready, just return.*/
 	if (!cpu_has_callbacks_ready_to_invoke(rdp)) {
-- 
cgit v1.2.3-18-g5258


From f0a0e6f282c72247e7c8ec17c68d528c1bb4d49e Mon Sep 17 00:00:00 2001
From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Date: Tue, 23 Oct 2012 13:47:01 -0700
Subject: rcu: Clarify memory-ordering properties of grace-period primitives

This commit explicitly states the memory-ordering properties of the
RCU grace-period primitives.  Although these properties were in some
sense implied by the fundmental property of RCU ("a grace period must
wait for all pre-existing RCU read-side critical sections to complete"),
stating it explicitly will be a great labor-saving device.

Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
---
 kernel/rcutree.c | 29 +++++++++++++++++++++++++----
 1 file changed, 25 insertions(+), 4 deletions(-)

(limited to 'kernel/rcutree.c')

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index e4c2192b47c..15a2beec320 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -2228,10 +2228,28 @@ static inline int rcu_blocking_is_gp(void)
  * rcu_read_lock_sched().
  *
  * This means that all preempt_disable code sequences, including NMI and
- * hardware-interrupt handlers, in progress on entry will have completed
- * before this primitive returns.  However, this does not guarantee that
- * softirq handlers will have completed, since in some kernels, these
- * handlers can run in process context, and can block.
+ * non-threaded hardware-interrupt handlers, in progress on entry will
+ * have completed before this primitive returns.  However, this does not
+ * guarantee that softirq handlers will have completed, since in some
+ * kernels, these handlers can run in process context, and can block.
+ *
+ * Note that this guarantee implies further memory-ordering guarantees.
+ * On systems with more than one CPU, when synchronize_sched() returns,
+ * each CPU is guaranteed to have executed a full memory barrier since the
+ * end of its last RCU-sched read-side critical section whose beginning
+ * preceded the call to synchronize_sched().  In addition, each CPU having
+ * an RCU read-side critical section that extends beyond the return from
+ * synchronize_sched() is guaranteed to have executed a full memory barrier
+ * after the beginning of synchronize_sched() and before the beginning of
+ * that RCU read-side critical section.  Note that these guarantees include
+ * CPUs that are offline, idle, or executing in user mode, as well as CPUs
+ * that are executing in the kernel.
+ *
+ * Furthermore, if CPU A invoked synchronize_sched(), which returned
+ * to its caller on CPU B, then both CPU A and CPU B are guaranteed
+ * to have executed a full memory barrier during the execution of
+ * synchronize_sched() -- even if CPU A and CPU B are the same CPU (but
+ * again only if the system has more than one CPU).
  *
  * This primitive provides the guarantees made by the (now removed)
  * synchronize_kernel() API.  In contrast, synchronize_rcu() only
@@ -2259,6 +2277,9 @@ EXPORT_SYMBOL_GPL(synchronize_sched);
  * read-side critical sections have completed.  RCU read-side critical
  * sections are delimited by rcu_read_lock_bh() and rcu_read_unlock_bh(),
  * and may be nested.
+ *
+ * See the description of synchronize_sched() for more detailed information
+ * on memory ordering guarantees.
  */
 void synchronize_rcu_bh(void)
 {
-- 
cgit v1.2.3-18-g5258