diff options
author | Ingo Molnar <mingo@kernel.org> | 2013-04-10 12:55:49 +0200 |
---|---|---|
committer | Ingo Molnar <mingo@kernel.org> | 2013-04-10 12:55:49 +0200 |
commit | 8fcfae31719c0a6c03f2cf63f815b46d378d8be4 (patch) | |
tree | 3da9d65885de6a2b046fbd5eebc0d19def0c1e2c | |
parent | d02a9a89db3437467de45a451739e520877f4a48 (diff) | |
parent | 6d87669357936bffa1e8fea7a4e7743e76905736 (diff) |
Merge branch 'rcu/next' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull RCU updates from Paul E. McKenney:
* Remove restrictions on no-CBs CPUs, make RCU_FAST_NO_HZ
take advantage of numbered callbacks, do additional callback
accelerations based on numbered callbacks. Posted to LKML
at https://lkml.org/lkml/2013/3/18/960.
* RCU documentation updates. Posted to LKML at
https://lkml.org/lkml/2013/3/18/570.
* Miscellaneous fixes. Posted to LKML at
https://lkml.org/lkml/2013/3/18/594.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-rw-r--r-- | Documentation/RCU/checklist.txt | 26 | ||||
-rw-r--r-- | Documentation/RCU/lockdep.txt | 5 | ||||
-rw-r--r-- | Documentation/RCU/rcubarrier.txt | 15 | ||||
-rw-r--r-- | Documentation/RCU/stallwarn.txt | 33 | ||||
-rw-r--r-- | Documentation/RCU/whatisRCU.txt | 4 | ||||
-rw-r--r-- | Documentation/kernel-parameters.txt | 35 | ||||
-rw-r--r-- | include/linux/list_bl.h | 5 | ||||
-rw-r--r-- | include/linux/rculist_bl.h | 2 | ||||
-rw-r--r-- | include/linux/rcupdate.h | 1 | ||||
-rw-r--r-- | include/trace/events/rcu.h | 55 | ||||
-rw-r--r-- | init/Kconfig | 73 | ||||
-rw-r--r-- | kernel/rcutree.c | 260 | ||||
-rw-r--r-- | kernel/rcutree.h | 41 | ||||
-rw-r--r-- | kernel/rcutree_plugin.h | 601 | ||||
-rw-r--r-- | kernel/rcutree_trace.c | 2 |
15 files changed, 640 insertions, 518 deletions
diff --git a/Documentation/RCU/checklist.txt b/Documentation/RCU/checklist.txt index 31ef8fe07f8..79e789b8b8e 100644 --- a/Documentation/RCU/checklist.txt +++ b/Documentation/RCU/checklist.txt @@ -217,9 +217,14 @@ over a rather long period of time, but improvements are always welcome! whether the increased speed is worth it. 8. Although synchronize_rcu() is slower than is call_rcu(), it - usually results in simpler code. So, unless update performance - is critically important or the updaters cannot block, - synchronize_rcu() should be used in preference to call_rcu(). + usually results in simpler code. So, unless update performance is + critically important, the updaters cannot block, or the latency of + synchronize_rcu() is visible from userspace, synchronize_rcu() + should be used in preference to call_rcu(). Furthermore, + kfree_rcu() usually results in even simpler code than does + synchronize_rcu() without synchronize_rcu()'s multi-millisecond + latency. So please take advantage of kfree_rcu()'s "fire and + forget" memory-freeing capabilities where it applies. An especially important property of the synchronize_rcu() primitive is that it automatically self-limits: if grace periods @@ -268,7 +273,8 @@ over a rather long period of time, but improvements are always welcome! e. Periodically invoke synchronize_rcu(), permitting a limited number of updates per grace period. - The same cautions apply to call_rcu_bh() and call_rcu_sched(). + The same cautions apply to call_rcu_bh(), call_rcu_sched(), + call_srcu(), and kfree_rcu(). 9. All RCU list-traversal primitives, which include rcu_dereference(), list_for_each_entry_rcu(), and @@ -296,9 +302,9 @@ over a rather long period of time, but improvements are always welcome! all currently executing rcu_read_lock()-protected RCU read-side critical sections complete. It does -not- necessarily guarantee that all currently running interrupts, NMIs, preempt_disable() - code, or idle loops will complete. Therefore, if you do not have - rcu_read_lock()-protected read-side critical sections, do -not- - use synchronize_rcu(). + code, or idle loops will complete. Therefore, if your + read-side critical sections are protected by something other + than rcu_read_lock(), do -not- use synchronize_rcu(). Similarly, disabling preemption is not an acceptable substitute for rcu_read_lock(). Code that attempts to use preemption @@ -401,9 +407,9 @@ over a rather long period of time, but improvements are always welcome! read-side critical sections. It is the responsibility of the RCU update-side primitives to deal with this. -17. Use CONFIG_PROVE_RCU, CONFIG_DEBUG_OBJECTS_RCU_HEAD, and - the __rcu sparse checks to validate your RCU code. These - can help find problems as follows: +17. Use CONFIG_PROVE_RCU, CONFIG_DEBUG_OBJECTS_RCU_HEAD, and the + __rcu sparse checks (enabled by CONFIG_SPARSE_RCU_POINTER) to + validate your RCU code. These can help find problems as follows: CONFIG_PROVE_RCU: check that accesses to RCU-protected data structures are carried out under the proper RCU diff --git a/Documentation/RCU/lockdep.txt b/Documentation/RCU/lockdep.txt index a102d4b3724..cd83d2348fe 100644 --- a/Documentation/RCU/lockdep.txt +++ b/Documentation/RCU/lockdep.txt @@ -64,6 +64,11 @@ checking of rcu_dereference() primitives: but retain the compiler constraints that prevent duplicating or coalescsing. This is useful when when testing the value of the pointer itself, for example, against NULL. + rcu_access_index(idx): + Return the value of the index and omit all barriers, but + retain the compiler constraints that prevent duplicating + or coalescsing. This is useful when when testing the + value of the index itself, for example, against -1. The rcu_dereference_check() check expression can be any boolean expression, but would normally include a lockdep expression. However, diff --git a/Documentation/RCU/rcubarrier.txt b/Documentation/RCU/rcubarrier.txt index 38428c12513..2e319d1b9ef 100644 --- a/Documentation/RCU/rcubarrier.txt +++ b/Documentation/RCU/rcubarrier.txt @@ -79,7 +79,20 @@ complete. Pseudo-code using rcu_barrier() is as follows: 2. Execute rcu_barrier(). 3. Allow the module to be unloaded. -The rcutorture module makes use of rcu_barrier in its exit function +There are also rcu_barrier_bh(), rcu_barrier_sched(), and srcu_barrier() +functions for the other flavors of RCU, and you of course must match +the flavor of rcu_barrier() with that of call_rcu(). If your module +uses multiple flavors of call_rcu(), then it must also use multiple +flavors of rcu_barrier() when unloading that module. For example, if +it uses call_rcu_bh(), call_srcu() on srcu_struct_1, and call_srcu() on +srcu_struct_2(), then the following three lines of code will be required +when unloading: + + 1 rcu_barrier_bh(); + 2 srcu_barrier(&srcu_struct_1); + 3 srcu_barrier(&srcu_struct_2); + +The rcutorture module makes use of rcu_barrier() in its exit function as follows: 1 static void diff --git a/Documentation/RCU/stallwarn.txt b/Documentation/RCU/stallwarn.txt index 1927151b386..e38b8df3d72 100644 --- a/Documentation/RCU/stallwarn.txt +++ b/Documentation/RCU/stallwarn.txt @@ -92,14 +92,14 @@ If the CONFIG_RCU_CPU_STALL_INFO kernel configuration parameter is set, more information is printed with the stall-warning message, for example: INFO: rcu_preempt detected stall on CPU - 0: (63959 ticks this GP) idle=241/3fffffffffffffff/0 + 0: (63959 ticks this GP) idle=241/3fffffffffffffff/0 softirq=82/543 (t=65000 jiffies) In kernels with CONFIG_RCU_FAST_NO_HZ, even more information is printed: INFO: rcu_preempt detected stall on CPU - 0: (64628 ticks this GP) idle=dd5/3fffffffffffffff/0 drain=0 . timer not pending + 0: (64628 ticks this GP) idle=dd5/3fffffffffffffff/0 softirq=82/543 last_accelerate: a345/d342 nonlazy_posted: 25 .D (t=65000 jiffies) The "(64628 ticks this GP)" indicates that this CPU has taken more @@ -116,13 +116,28 @@ number between the two "/"s is the value of the nesting, which will be a small positive number if in the idle loop and a very large positive number (as shown above) otherwise. -For CONFIG_RCU_FAST_NO_HZ kernels, the "drain=0" indicates that the CPU is -not in the process of trying to force itself into dyntick-idle state, the -"." indicates that the CPU has not given up forcing RCU into dyntick-idle -mode (it would be "H" otherwise), and the "timer not pending" indicates -that the CPU has not recently forced RCU into dyntick-idle mode (it -would otherwise indicate the number of microseconds remaining in this -forced state). +The "softirq=" portion of the message tracks the number of RCU softirq +handlers that the stalled CPU has executed. The number before the "/" +is the number that had executed since boot at the time that this CPU +last noted the beginning of a grace period, which might be the current +(stalled) grace period, or it might be some earlier grace period (for +example, if the CPU might have been in dyntick-idle mode for an extended +time period. The number after the "/" is the number that have executed +since boot until the current time. If this latter number stays constant +across repeated stall-warning messages, it is possible that RCU's softirq +handlers are no longer able to execute on this CPU. This can happen if +the stalled CPU is spinning with interrupts are disabled, or, in -rt +kernels, if a high-priority process is starving RCU's softirq handler. + +For CONFIG_RCU_FAST_NO_HZ kernels, the "last_accelerate:" prints the +low-order 16 bits (in hex) of the jiffies counter when this CPU last +invoked rcu_try_advance_all_cbs() from rcu_needs_cpu() or last invoked +rcu_accelerate_cbs() from rcu_prepare_for_idle(). The "nonlazy_posted:" +prints the number of non-lazy callbacks posted since the last call to +rcu_needs_cpu(). Finally, an "L" indicates that there are currently +no non-lazy callbacks ("." is printed otherwise, as shown above) and +"D" indicates that dyntick-idle processing is enabled ("." is printed +otherwise, for example, if disabled via the "nohz=" kernel boot parameter). Multiple Warnings From One Stall diff --git a/Documentation/RCU/whatisRCU.txt b/Documentation/RCU/whatisRCU.txt index 0cc7820967f..10df0b82f45 100644 --- a/Documentation/RCU/whatisRCU.txt +++ b/Documentation/RCU/whatisRCU.txt @@ -265,9 +265,9 @@ rcu_dereference() rcu_read_lock(); p = rcu_dereference(head.next); rcu_read_unlock(); - x = p->address; + x = p->address; /* BUG!!! */ rcu_read_lock(); - y = p->data; + y = p->data; /* BUG!!! */ rcu_read_unlock(); Holding a reference from one RCU read-side critical section diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 4609e81dbc3..22303b2e74b 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -2461,9 +2461,12 @@ bytes respectively. Such letter suffixes can also be entirely omitted. In kernels built with CONFIG_RCU_NOCB_CPU=y, set the specified list of CPUs to be no-callback CPUs. Invocation of these CPUs' RCU callbacks will - be offloaded to "rcuoN" kthreads created for - that purpose. This reduces OS jitter on the + be offloaded to "rcuox/N" kthreads created for + that purpose, where "x" is "b" for RCU-bh, "p" + for RCU-preempt, and "s" for RCU-sched, and "N" + is the CPU number. This reduces OS jitter on the offloaded CPUs, which can be useful for HPC and + real-time workloads. It can also improve energy efficiency for asymmetric multiprocessors. @@ -2487,6 +2490,17 @@ bytes respectively. Such letter suffixes can also be entirely omitted. leaf rcu_node structure. Useful for very large systems. + rcutree.jiffies_till_first_fqs= [KNL,BOOT] + Set delay from grace-period initialization to + first attempt to force quiescent states. + Units are jiffies, minimum value is zero, + and maximum value is HZ. + + rcutree.jiffies_till_next_fqs= [KNL,BOOT] + Set delay between subsequent attempts to force + quiescent states. Units are jiffies, minimum + value is one, and maximum value is HZ. + rcutree.qhimark= [KNL,BOOT] Set threshold of queued RCU callbacks over which batch limiting is disabled. @@ -2501,16 +2515,15 @@ bytes respectively. Such letter suffixes can also be entirely omitted. rcutree.rcu_cpu_stall_timeout= [KNL,BOOT] Set timeout for RCU CPU stall warning messages. - rcutree.jiffies_till_first_fqs= [KNL,BOOT] - Set delay from grace-period initialization to - first attempt to force quiescent states. - Units are jiffies, minimum value is zero, - and maximum value is HZ. + rcutree.rcu_idle_gp_delay= [KNL,BOOT] + Set wakeup interval for idle CPUs that have + RCU callbacks (RCU_FAST_NO_HZ=y). - rcutree.jiffies_till_next_fqs= [KNL,BOOT] - Set delay between subsequent attempts to force - quiescent states. Units are jiffies, minimum - value is one, and maximum value is HZ. + rcutree.rcu_idle_lazy_gp_delay= [KNL,BOOT] + Set wakeup interval for idle CPUs that have + only "lazy" RCU callbacks (RCU_FAST_NO_HZ=y). + Lazy RCU callbacks are those which RCU can + prove do nothing more than free memory. rcutorture.fqs_duration= [KNL,BOOT] Set duration of force_quiescent_state bursts. diff --git a/include/linux/list_bl.h b/include/linux/list_bl.h index 31f9d75adc5..2eb88556c5c 100644 --- a/include/linux/list_bl.h +++ b/include/linux/list_bl.h @@ -125,6 +125,11 @@ static inline void hlist_bl_unlock(struct hlist_bl_head *b) __bit_spin_unlock(0, (unsigned long *)b); } +static inline bool hlist_bl_is_locked(struct hlist_bl_head *b) +{ + return bit_spin_is_locked(0, (unsigned long *)b); +} + /** * hlist_bl_for_each_entry - iterate over list of given type * @tpos: the type * to use as a loop cursor. diff --git a/include/linux/rculist_bl.h b/include/linux/rculist_bl.h index cf1244fbf3b..4f216c59e7d 100644 --- a/include/linux/rculist_bl.h +++ b/include/linux/rculist_bl.h @@ -20,7 +20,7 @@ static inline void hlist_bl_set_first_rcu(struct hlist_bl_head *h, static inline struct hlist_bl_node *hlist_bl_first_rcu(struct hlist_bl_head *h) { return (struct hlist_bl_node *) - ((unsigned long)rcu_dereference(h->first) & ~LIST_BL_LOCKMASK); + ((unsigned long)rcu_dereference_check(h->first, hlist_bl_is_locked(h)) & ~LIST_BL_LOCKMASK); } /** diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h index b758ce17b30..9ed2c9a4de4 100644 --- a/include/linux/rcupdate.h +++ b/include/linux/rcupdate.h @@ -80,6 +80,7 @@ extern void do_trace_rcu_torture_read(char *rcutorturename, #define UINT_CMP_LT(a, b) (UINT_MAX / 2 < (a) - (b)) #define ULONG_CMP_GE(a, b) (ULONG_MAX / 2 >= (a) - (b)) #define ULONG_CMP_LT(a, b) (ULONG_MAX / 2 < (a) - (b)) +#define ulong2long(a) (*(long *)(&(a))) /* Exported common interfaces */ diff --git a/include/trace/events/rcu.h b/include/trace/events/rcu.h index 1918e832da4..59ebcc89f14 100644 --- a/include/trace/events/rcu.h +++ b/include/trace/events/rcu.h @@ -72,6 +72,58 @@ TRACE_EVENT(rcu_grace_period, ); /* + * Tracepoint for future grace-period events, including those for no-callbacks + * CPUs. The caller should pull the data from the rcu_node structure, + * other than rcuname, which comes from the rcu_state structure, and event, + * which is one of the following: + * + * "Startleaf": Request a nocb grace period based on leaf-node data. + * "Startedleaf": Leaf-node start proved sufficient. + * "Startedleafroot": Leaf-node start proved sufficient after checking root. + * "Startedroot": Requested a nocb grace period based on root-node data. + * "StartWait": Start waiting for the requested grace period. + * "ResumeWait": Resume waiting after signal. + * "EndWait": Complete wait. + * "Cleanup": Clean up rcu_node structure after previous GP. + * "CleanupMore": Clean up, and another no-CB GP is needed. + */ +TRACE_EVENT(rcu_future_grace_period, + + TP_PROTO(char *rcuname, unsigned long gpnum, unsigned long completed, + unsigned long c, u8 level, int grplo, int grphi, + char *gpevent), + + TP_ARGS(rcuname, gpnum, completed, c, level, grplo, grphi, gpevent), + + TP_STRUCT__entry( + __field(char *, rcuname) + __field(unsigned long, gpnum) + __field(unsigned long, completed) + __field(unsigned long, c) + __field(u8, level) + __field(int, grplo) + __field(int, grphi) + __field(char *, gpevent) + ), + + TP_fast_assign( + __entry->rcuname = rcuname; + __entry->gpnum = gpnum; + __entry->completed = completed; + __entry->c = c; + __entry->level = level; + __entry->grplo = grplo; + __entry->grphi = grphi; + __entry->gpevent = gpevent; + ), + + TP_printk("%s %lu %lu %lu %u %d %d %s", + __entry->rcuname, __entry->gpnum, __entry->completed, + __entry->c, __entry->level, __entry->grplo, __entry->grphi, + __entry->gpevent) +); + +/* * Tracepoint for grace-period-initialization events. These are * distinguished by the type of RCU, the new grace-period number, the * rcu_node structure level, the starting and ending CPU covered by the @@ -601,6 +653,9 @@ TRACE_EVENT(rcu_barrier, #define trace_rcu_grace_period(rcuname, gpnum, gpevent) do { } while (0) #define trace_rcu_grace_period_init(rcuname, gpnum, level, grplo, grphi, \ qsmask) do { } while (0) +#define trace_rcu_future_grace_period(rcuname, gpnum, completed, c, \ + level, grplo, grphi, event) \ + do { } while (0) #define trace_rcu_preempt_task(rcuname, pid, gpnum) do { } while (0) #define trace_rcu_unlock_preempted_task(rcuname, gpnum, pid) do { } while (0) #define trace_rcu_quiescent_state_report(rcuname, gpnum, mask, qsmask, level, \ diff --git a/init/Kconfig b/init/Kconfig index 5341d7232c3..71bb9e73011 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -578,13 +578,16 @@ config RCU_FAST_NO_HZ depends on NO_HZ && SMP default n help - This option causes RCU to attempt to accelerate grace periods in - order to allow CPUs to enter dynticks-idle state more quickly. - On the other hand, this option increases the overhead of the - dynticks-idle checking, thus degrading scheduling latency. + This option permits CPUs to enter dynticks-idle state even if + they have RCU callbacks queued, and prevents RCU from waking + these CPUs up more than roughly once every four jiffies (by + default, you can adjust this using the rcutree.rcu_idle_gp_delay + parameter), thus improving energy efficiency. On the other + hand, this option increases the duration of RCU grace periods, + for example, slowing down synchronize_rcu(). - Say Y if energy efficiency is critically important, and you don't - care about real-time response. + Say Y if energy efficiency is critically important, and you + don't care about increased grace-period durations. Say N if you are unsure. @@ -651,7 +654,7 @@ config RCU_BOOST_DELAY Accept the default if unsure. config RCU_NOCB_CPU - bool "Offload RCU callback processing from boot-selected CPUs" + bool "Offload RCU callback processing from boot-selected CPUs (EXPERIMENTAL" depends on TREE_RCU || TREE_PREEMPT_RCU default n help @@ -662,16 +665,56 @@ config RCU_NOCB_CPU This option offloads callback invocation from the set of CPUs specified at boot time by the rcu_nocbs parameter. - For each such CPU, a kthread ("rcuoN") will be created to - invoke callbacks, where the "N" is the CPU being offloaded. - Nothing prevents this kthread from running on the specified - CPUs, but (1) the kthreads may be preempted between each - callback, and (2) affinity or cgroups can be used to force - the kthreads to run on whatever set of CPUs is desired. - - Say Y here if you want reduced OS jitter on selected CPUs. + For each such CPU, a kthread ("rcuox/N") will be created to + invoke callbacks, where the "N" is the CPU being offloaded, + and where the "x" is "b" for RCU-bh, "p" for RCU-preempt, and + "s" for RCU-sched. Nothing prevents this kthread from running + on the specified CPUs, but (1) the kthreads may be preempted + between each callback, and (2) affinity or cgroups can be used + to force the kthreads to run on whatever set of CPUs is desired. + + Say Y here if you want to help to debug reduced OS jitter. Say N here if you are unsure. +choice + prompt "Build-forced no-CBs CPUs" + default RCU_NOCB_CPU_NONE + help + This option allows no-CBs CPUs to be specified at build time. + Additional no-CBs CPUs may be specified by the rcu_nocbs= + boot parameter. + +config RCU_NOCB_CPU_NONE + bool "No build_forced no-CBs CPUs" + depends on RCU_NOCB_CPU + help + This option does not force any of the CPUs to be no-CBs CPUs. + Only CPUs designated by the rcu_nocbs= boot parameter will be + no-CBs CPUs. + +config RCU_NOCB_CPU_ZERO + bool "CPU 0 is a build_forced no-CBs CPU" + depends on RCU_NOCB_CPU + help + This option forces CPU 0 to be a no-CBs CPU. Additional CPUs + may be designated as no-CBs CPUs using the rcu_nocbs= boot + parameter will be no-CBs CPUs. + + Select this if CPU 0 needs to be a no-CBs CPU for real-time + or energy-efficiency reasons. + +config RCU_NOCB_CPU_ALL + bool "All CPUs are build_forced no-CBs CPUs" + depends on RCU_NOCB_CPU + help + This option forces all CPUs to be no-CBs CPUs. The rcu_nocbs= + boot parameter will be ignored. + + Select this if all CPUs need to be no-CBs CPUs for real-time + or energy-efficiency reasons. + +endchoice + endmenu # "RCU Subsystem" config IKCONFIG diff --git a/kernel/rcutree.c b/kernel/rcutree.c index 5b8ad827fd8..2d5f94c1c7f 100644 --- a/kernel/rcutree.c +++ b/kernel/rcutree.c @@ -64,7 +64,7 @@ static struct lock_class_key rcu_node_class[RCU_NUM_LVLS]; static struct lock_class_key rcu_fqs_class[RCU_NUM_LVLS]; -#define RCU_STATE_INITIALIZER(sname, cr) { \ +#define RCU_STATE_INITIALIZER(sname, sabbr, cr) { \ .level = { &sname##_state.node[0] }, \ .call = cr, \ .fqs_state = RCU_GP_IDLE, \ @@ -76,13 +76,14 @@ static struct lock_class_key rcu_fqs_class[RCU_NUM_LVLS]; .barrier_mutex = __MUTEX_INITIALIZER(sname##_state.barrier_mutex), \ .onoff_mutex = __MUTEX_INITIALIZER(sname##_state.onoff_mutex), \ .name = #sname, \ + .abbr = sabbr, \ } struct rcu_state rcu_sched_state = - RCU_STATE_INITIALIZER(rcu_sched, call_rcu_sched); + RCU_STATE_INITIALIZER(rcu_sched, 's', call_rcu_sched); DEFINE_PER_CPU(struct rcu_data, rcu_sched_data); -struct rcu_state rcu_bh_state = RCU_STATE_INITIALIZER(rcu_bh, call_rcu_bh); +struct rcu_state rcu_bh_state = RCU_STATE_INITIALIZER(rcu_bh, 'b', call_rcu_bh); DEFINE_PER_CPU(struct rcu_data, rcu_bh_data); static struct rcu_state *rcu_state; @@ -223,6 +224,8 @@ static ulong jiffies_till_next_fqs = RCU_JIFFIES_TILL_FORCE_QS; module_param(jiffies_till_first_fqs, ulong, 0644); module_param(jiffies_till_next_fqs, ulong, 0644); +static void rcu_start_gp_advanced(struct rcu_state *rsp, struct rcu_node *rnp, + struct rcu_data *rdp); static void force_qs_rnp(struct rcu_state *rsp, int (*f)(struct rcu_data *)); static void force_quiescent_state(struct rcu_state *rsp); static int rcu_pending(int cpu); @@ -310,6 +313,8 @@ cpu_needs_another_gp(struct rcu_state *rsp, struct rcu_data *rdp) if (rcu_gp_in_progress(rsp)) return 0; /* No, a grace period is already in progress. */ + if (rcu_nocb_needs_gp(rsp)) + return 1; /* Yes, a no-CBs CPU needs one. */ if (!rdp->nxttail[RCU_NEXT_TAIL]) return 0; /* No, this is a no-CBs (or offline) CPU. */ if (*rdp->nxttail[RCU_NEXT_READY_TAIL]) @@ -1035,10 +1040,11 @@ static void init_callback_list(struct rcu_data *rdp) { int i; + if (init_nocb_callback_list(rdp)) + return; rdp->nxtlist = NULL; for (i = 0; i < RCU_NEXT_SIZE; i++) rdp->nxttail[i] = &rdp->nxtlist; - init_nocb_callback_list(rdp); } /* @@ -1071,6 +1077,120 @@ static unsigned long rcu_cbs_completed(struct rcu_state *rsp, } /* + * Trace-event helper function for rcu_start_future_gp() and + * rcu_nocb_wait_gp(). + */ +static void trace_rcu_future_gp(struct rcu_node *rnp, struct rcu_data *rdp, + unsigned long c, char *s) +{ + trace_rcu_future_grace_period(rdp->rsp->name, rnp->gpnum, + rnp->completed, c, rnp->level, + rnp->grplo, rnp->grphi, s); +} + +/* + * Start some future grace period, as needed to handle newly arrived + * callbacks. The required future grace periods are recorded in each + * rcu_node structure's ->need_future_gp field. + * + * The caller must hold the specified rcu_node structure's ->lock. + */ +static unsigned long __maybe_unused +rcu_start_future_gp(struct rcu_node *rnp, struct rcu_data *rdp) +{ + unsigned long c; + int i; + struct rcu_node *rnp_root = rcu_get_root(rdp->rsp); + + /* + * Pick up grace-period number for new callbacks. If this + * grace period is already marked as needed, return to the caller. + */ + c = rcu_cbs_completed(rdp->rsp, rnp); + trace_rcu_future_gp(rnp, rdp, c, "Startleaf"); + if (rnp->need_future_gp[c & 0x1]) { + trace_rcu_future_gp(rnp, rdp, c, "Prestartleaf"); + return c; + } + + /* + * If either this rcu_node structure or the root rcu_node structure + * believe that a grace period is in progress, then we must wait + * for the one following, which is in "c". Because our request + * will be noticed at the end of the current grace period, we don't + * need to explicitly start one. + */ + if (rnp->gpnum != rnp->completed || + ACCESS_ONCE(rnp->gpnum) != ACCESS_ONCE(rnp->completed)) { + rnp->need_future_gp[c & 0x1]++; + trace_rcu_future_gp(rnp, rdp, c, "Startedleaf"); + return c; + } + + /* + * There might be no grace period in progress. If we don't already + * hold it, acquire the root rcu_node structure's lock in order to + * start one (if needed). + */ + if (rnp != rnp_root) + raw_spin_lock(&rnp_root->lock); + + /* + * Get a new grace-period number. If there really is no grace + * period in progress, it will be smaller than the one we obtained + * earlier. Adjust callbacks as needed. Note that even no-CBs + * CPUs have a ->nxtcompleted[] array, so no no-CBs checks needed. + */ + c = rcu_cbs_completed(rdp->rsp, rnp_root); + for (i = RCU_DONE_TAIL; i < RCU_NEXT_TAIL; i++) + if (ULONG_CMP_LT(c, rdp->nxtcompleted[i])) + rdp->nxtcompleted[i] = c; + + /* + * If the needed for the required grace period is already + * recorded, trace and leave. + */ + if (rnp_root->need_future_gp[c & 0x1]) { + trace_rcu_future_gp(rnp, rdp, c, "Prestartedroot"); + goto unlock_out; + } + + /* Record the need for the future grace period. */ + rnp_root->need_future_gp[c & 0x1]++; + + /* If a grace period is not already in progress, start one. */ + if (rnp_root->gpnum != rnp_root->completed) { + trace_rcu_future_gp(rnp, rdp, c, "Startedleafroot"); + } else { + trace_rcu_future_gp(rnp, rdp, c, "Startedroot"); + rcu_start_gp_advanced(rdp->rsp, rnp_root, rdp); + } +unlock_out: + if (rnp != rnp_root) + raw_spin_unlock(&rnp_root->lock); + return c; +} + +/* + * Clean up any old requests for the just-ended grace period. Also return + * whether any additional grace periods have been requested. Also invoke + * rcu_nocb_gp_cleanup() in order to wake up any no-callbacks kthreads + * waiting for this grace period to complete. + */ +static int rcu_future_gp_cleanup(struct rcu_state *rsp, struct rcu_node *rnp) +{ + int c = rnp->completed; + int needmore; + struct rcu_data *rdp = this_cpu_ptr(rsp->rda); + + rcu_nocb_gp_cleanup(rsp, rnp); + rnp->need_future_gp[c & 0x1] = 0; + needmore = rnp->need_future_gp[(c + 1) & 0x1]; + trace_rcu_future_gp(rnp, rdp, c, needmore ? "CleanupMore" : "Cleanup"); + return needmore; +} + +/* * If there is room, assign a ->completed number to any callbacks on * this CPU that have not already been assigned. Also accelerate any * callbacks that were previously assigned a ->completed number that has @@ -1129,6 +1249,8 @@ static void rcu_accelerate_cbs(struct rcu_state *rsp, struct rcu_node *rnp, rdp->nxttail[i] = rdp->nxttail[RCU_NEXT_TAIL]; rdp->nxtcompleted[i] = c; } + /* Record any needed additional grace periods. */ + rcu_start_future_gp(rnp, rdp); /* Trace depending on how much we were able to accelerate. */ if (!*rdp->nxttail[RCU_WAIT_TAIL]) @@ -1308,9 +1430,9 @@ static int rcu_gp_init(struct rcu_state *rsp) rdp = this_cpu_ptr(rsp->rda); rcu_preempt_check_blocked_tasks(rnp); rnp->qsmask = rnp->qsmaskinit; - rnp->gpnum = rsp->gpnum; + ACCESS_ONCE(rnp->gpnum) = rsp->gpnum; WARN_ON_ONCE(rnp->completed != rsp->completed); - rnp->completed = rsp->completed; + ACCESS_ONCE(rnp->completed) = rsp->completed; if (rnp == rdp->mynode) rcu_start_gp_per_cpu(rsp, rnp, rdp); rcu_preempt_boost_start_gp(rnp); @@ -1319,7 +1441,8 @@ static int rcu_gp_init(struct rcu_state *rsp) rnp->grphi, rnp->qsmask); raw_spin_unlock_irq(&rnp->lock); #ifdef CONFIG_PROVE_RCU_DELAY - if ((random32() % (rcu_num_nodes * 8)) == 0) + if ((random32() % (rcu_num_nodes * 8)) == 0 && + system_state == SYSTEM_RUNNING) schedule_timeout_uninterruptible(2); #endif /* #ifdef CONFIG_PROVE_RCU_DELAY */ cond_resched(); @@ -1361,6 +1484,7 @@ int rcu_gp_fqs(struct rcu_state *rsp, int fqs_state_in) static void rcu_gp_cleanup(struct rcu_state *rsp) { unsigned long gp_duration; + int nocb = 0; struct rcu_data *rdp; struct rcu_node *rnp = rcu_get_root(rsp); @@ -1390,17 +1514,23 @@ static void rcu_gp_cleanup(struct rcu_state *rsp) */ rcu_for_each_node_breadth_first(rsp, rnp) { raw_spin_lock_irq(&rnp->lock); - rnp->completed = rsp->gpnum; + ACCESS_ONCE(rnp->completed) = rsp->gpnum; + rdp = this_cpu_ptr(rsp->rda); + if (rnp == rdp->mynode) + __rcu_process_gp_end(rsp, rnp, rdp); + nocb += rcu_future_gp_cleanup(rsp, rnp); raw_spin_unlock_irq(&rnp->lock); cond_resched(); } rnp = rcu_get_root(rsp); raw_spin_lock_irq(&rnp->lock); + rcu_nocb_gp_set(rnp, nocb); rsp->completed = rsp->gpnum; /* Declare grace period done. */ trace_rcu_grace_period(rsp->name, rsp->completed, "end"); rsp->fqs_state = RCU_GP_IDLE; rdp = this_cpu_ptr(rsp->rda); + rcu_advance_cbs(rsp, rnp, rdp); /* Reduce false positives below. */ if (cpu_needs_another_gp(rsp, rdp)) rsp->gp_flags = 1; raw_spin_unlock_irq(&rnp->lock); @@ -1476,57 +1606,62 @@ static int __noreturn rcu_gp_kthread(void *arg) /* * Start a new RCU grace period if warranted, re-initializing the hierarchy * in preparation for detecting the next grace period. The caller must hold - * the root node's ->lock, which is released before return. Hard irqs must - * be disabled. + * the root node's ->lock and hard irqs must be disabled. * * Note that it is legal for a dying CPU (which is marked as offline) to * invoke this function. This can happen when the dying CPU reports its * quiescent state. */ static void -rcu_start_gp(struct rcu_state *rsp, unsigned long flags) - __releases(rcu_get_root(rsp)->lock) +rcu_start_gp_advanced(struct rcu_state *rsp, struct rcu_node *rnp, + struct rcu_data *rdp) { - struct rcu_data *rdp = this_cpu_ptr(rsp->rda); - struct rcu_node *rnp = rcu_get_root(rsp); - - if (!rsp->gp_kthread || - !cpu_needs_another_gp(rsp, rdp)) { + if (!rsp->gp_kthread || !cpu_needs_another_gp(rsp, rdp)) { /* * Either we have not yet spawned the grace-period * task, this CPU does not need another grace period, * or a grace period is already in progress. * Either way, don't start a new grace period. */ - raw_spin_unlock_irqrestore(&rnp->lock, flags); return; } - - /* - * Because there is no grace period in progress right now, - * any callbacks we have up to this point will be satisfied - * by the next grace period. So this is a good place to - * assign a grace period number to recently posted callbacks. - */ - rcu_accelerate_cbs(rsp, rnp, rdp); - rsp->gp_flags = RCU_GP_FLAG_INIT; - raw_spin_unlock(&rnp->lock); /* Interrupts remain disabled. */ - - /* Ensure that CPU is aware of completion of last grace period. */ - rcu_process_gp_end(rsp, rdp); - local_irq_restore(flags); /* Wake up rcu_gp_kthread() to start the grace period. */ wake_up(&rsp->gp_wq); } /* + * Similar to rcu_start_gp_advanced(), but also advance the calling CPU's + * callbacks. Note that rcu_start_gp_advanced() cannot do this because it + * is invoked indirectly from rcu_advance_cbs(), which would result in + * endless recursion -- or would do so if it wasn't for the self-deadlock + * that is encountered beforehand. + */ +static void +rcu_start_gp(struct rcu_state *rsp) +{ + struct rcu_data *rdp = this_cpu_ptr(rsp->rda); + struct rcu_node *rnp = rcu_get_root(rsp); + + /* + * If there is no grace period in progress right now, any + * callbacks we have up to this point will be satisfied by the + * next grace period. Also, advancing the callbacks reduces the + * probability of false positives from cpu_needs_another_gp() + * resulting in pointless grace periods. So, advance callbacks + * then start the grace period! + */ + rcu_advance_cbs(rsp, rnp, rdp); + rcu_start_gp_advanced(rsp, rnp, rdp); +} + +/* * Report a full set of quiescent states to the specified rcu_state * data structure. This involves cleaning up after the prior grace * period and letting rcu_start_gp() start up the next grace period - * if one is needed. Note that the caller must hold rnp->lock, as - * required by rcu_start_gp(), which will release it. + * if one is needed. Note that the caller must hold rnp->lock, which + * is released before return. */ static void rcu_report_qs_rsp(struct rcu_state *rsp, unsigned long flags) __releases(rcu_get_root(rsp)->lock) @@ -2124,7 +2259,8 @@ __rcu_process_callbacks(struct rcu_state *rsp) local_irq_save(flags); if (cpu_needs_another_gp(rsp, rdp)) { raw_spin_lock(&rcu_get_root(rsp)->lock); /* irqs disabled. */ - rcu_start_gp(rsp, flags); /* releases above lock */ + rcu_start_gp(rsp); + raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags); } else { local_irq_restore(flags); } @@ -2169,7 +2305,8 @@ static void invoke_rcu_callbacks(struct rcu_state *rsp, struct rcu_data *rdp) static void invoke_rcu_core(void) { - raise_softirq(RCU_SOFTIRQ); + if (cpu_online(smp_processor_id())) + raise_softirq(RCU_SOFTIRQ); } /* @@ -2204,11 +2341,11 @@ static void __call_rcu_core(struct rcu_state *rsp, struct rcu_data *rdp, /* Start a new grace period if one not already started. */ if (!rcu_gp_in_progress(rsp)) { - unsigned long nestflag; struct rcu_node *rnp_root = rcu_get_root(rsp); - raw_spin_lock_irqsave(&rnp_root->lock, nestflag); - rcu_start_gp(rsp, nestflag); /* rlses rnp_root->lock */ + raw_spin_lock(&rnp_root->lock); + rcu_start_gp(rsp); + raw_spin_unlock(&rnp_root->lock); } else { /* Give the grace period a kick. */ rdp->blimit = LONG_MAX; @@ -2628,19 +2765,27 @@ static int rcu_pending(int cpu) } /* - * Check to see if |