Age | Commit message (Collapse) | Author |
|
commit ff451961a8b2a17667a7bfa39c86fb9b351445db upstream.
Commit a695cb58162 "tracing: Prevent deleting instances when they are being read"
tried to fix a race between deleting a trace instance and reading contents
of a trace file. But it wasn't good enough. The following could crash the kernel:
# cd /sys/kernel/debug/tracing/instances
# ( while :; do mkdir foo; rmdir foo; done ) &
# ( while :; do cat foo/trace &> /dev/null; done ) &
Luckily this can only be done by root user, but it should be fixed regardless.
The problem is that a delete of the file can happen after the reader starts
to open the file but before it grabs the trace_types_mutex.
The solution is to validate the trace array before using it. If the trace
array does not exist in the list of trace arrays, then it returns -ENODEV.
There's a possibility that a trace_array could be deleted and a new one
created and the open would open its file instead. But that is very minor as
it will just return the data of the new trace array, it may confuse the user
but it will not crash the system. As this can only be done by root anyway,
the race will only occur if root is deleting what its trying to read at
the same time.
Reported-by: Alexander Lam <azl@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit a82274151af2b075163e3c42c828529dee311487 upstream.
There are multiple places where the ftrace_trace_arrays list is accessed in
trace_events.c without the trace_types_lock held.
Link: http://lkml.kernel.org/r/1372732674-22726-1-git-send-email-azl@google.com
Signed-off-by: Alexander Z Lam <azl@google.com>
Cc: Vaibhav Nagarnaik <vnagarnaik@google.com>
Cc: David Sharp <dhsharp@google.com>
Cc: Alexander Z Lam <lambchop468@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 2d71619c59fac95a5415a326162fa046161b938c upstream.
The trace_marker file was present for each new instance created, but it
added the trace mark to the global trace buffer instead of to
the instance's buffer.
Link: http://lkml.kernel.org/r/1372717885-4543-2-git-send-email-azl@google.com
Signed-off-by: Alexander Z Lam <azl@google.com>
Cc: David Sharp <dhsharp@google.com>
Cc: Vaibhav Nagarnaik <vnagarnaik@google.com>
Cc: Alexander Z Lam <lambchop468@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 11034ae9c20f4057a6127fc965906417978e69b2 upstream.
All syscall tracing irqs-off tags are wrong, the syscall enter entry doesn't
disable irqs.
[root@jovi tracing]#echo "syscalls:sys_enter_open" > set_event
[root@jovi tracing]# cat trace
# tracer: nop
#
# entries-in-buffer/entries-written: 13/13 #P:2
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
irqbalance-513 [000] d... 56115.496766: sys_open(filename: 804e1a6, flags: 0, mode: 1b6)
irqbalance-513 [000] d... 56115.497008: sys_open(filename: 804e1bb, flags: 0, mode: 1b6)
sendmail-771 [000] d... 56115.827982: sys_open(filename: b770e6d1, flags: 0, mode: 1b6)
The reason is syscall tracing doesn't record irq_flags into buffer.
The proper display is:
[root@jovi tracing]#echo "syscalls:sys_enter_open" > set_event
[root@jovi tracing]# cat trace
# tracer: nop
#
# entries-in-buffer/entries-written: 14/14 #P:2
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
irqbalance-514 [001] .... 46.213921: sys_open(filename: 804e1a6, flags: 0, mode: 1b6)
irqbalance-514 [001] .... 46.214160: sys_open(filename: 804e1bb, flags: 0, mode: 1b6)
<...>-920 [001] .... 47.307260: sys_open(filename: 4e82a0c5, flags: 80000, mode: 0)
Link: http://lkml.kernel.org/r/1365564393-10972-3-git-send-email-jovi.zhangwei@huawei.com
Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 6e94a780374ed31b280f939d4757e8d7858dff16 upstream.
Running the following:
# cd /sys/kernel/debug/tracing
# echo p:i do_sys_open > kprobe_events
# echo p:j schedule >> kprobe_events
# cat kprobe_events
p:kprobes/i do_sys_open
p:kprobes/j schedule
# echo p:i do_sys_open >> kprobe_events
# cat kprobe_events
p:kprobes/j schedule
p:kprobes/i do_sys_open
# ls /sys/kernel/debug/tracing/events/kprobes/
enable filter j
Notice that the 'i' is missing from the kprobes directory.
The console produces:
"Failed to create system directory kprobes"
This is because kprobes passes in a allocated name for the system
and the ftrace event subsystem saves off that name instead of creating
a duplicate for it. But the kprobes may free the system name making
the pointer to it invalid.
This bug was introduced by 92edca073c37 "tracing: Use direct field, type
and system names" which switched from using kstrdup() on the system name
in favor of just keeping apointer to it, as the internal ftrace event
system names are static and exist for the life of the computer being booted.
Instead of reverting back to duplicating system names again, we can use
core_kernel_data() to determine if the passed in name was allocated or
static. Then use the MSB of the ref_count to be a flag to keep track if
the name was allocated or not. Then we can still save from having to duplicate
strings that will always exist, but still copy the ones that may be freed.
Reported-by: "zhangwei(Jovi)" <jovi.zhangwei@huawei.com>
Reported-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 058ebd0eba3aff16b144eabf4510ed9510e1416e upstream.
Jiri managed to trigger this warning:
[] ======================================================
[] [ INFO: possible circular locking dependency detected ]
[] 3.10.0+ #228 Tainted: G W
[] -------------------------------------------------------
[] p/6613 is trying to acquire lock:
[] (rcu_node_0){..-...}, at: [<ffffffff810ca797>] rcu_read_unlock_special+0xa7/0x250
[]
[] but task is already holding lock:
[] (&ctx->lock){-.-...}, at: [<ffffffff810f2879>] perf_lock_task_context+0xd9/0x2c0
[]
[] which lock already depends on the new lock.
[]
[] the existing dependency chain (in reverse order) is:
[]
[] -> #4 (&ctx->lock){-.-...}:
[] -> #3 (&rq->lock){-.-.-.}:
[] -> #2 (&p->pi_lock){-.-.-.}:
[] -> #1 (&rnp->nocb_gp_wq[1]){......}:
[] -> #0 (rcu_node_0){..-...}:
Paul was quick to explain that due to preemptible RCU we cannot call
rcu_read_unlock() while holding scheduler (or nested) locks when part
of the read side critical section was preemptible.
Therefore solve it by making the entire RCU read side non-preemptible.
Also pull out the retry from under the non-preempt to play nice with RT.
Reported-by: Jiri Olsa <jolsa@redhat.com>
Helped-out-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 06f417968beac6e6b614e17b37d347aa6a6b1d30 upstream.
The '!ctx->is_active' check has a valid scenario, so
there's no need for the warning.
The reason is that there's a time window between the
'ctx->is_active' check in the perf_event_enable() function
and the __perf_event_enable() function having:
- IRQs on
- ctx->lock unlocked
where the task could be killed and 'ctx' deactivated by
perf_event_exit_task(), ending up with the warning below.
So remove the WARN_ON_ONCE() check and add comments to
explain it all.
This addresses the following warning reported by Vince Weaver:
[ 324.983534] ------------[ cut here ]------------
[ 324.984420] WARNING: at kernel/events/core.c:1953 __perf_event_enable+0x187/0x190()
[ 324.984420] Modules linked in:
[ 324.984420] CPU: 19 PID: 2715 Comm: nmi_bug_snb Not tainted 3.10.0+ #246
[ 324.984420] Hardware name: Supermicro X8DTN/X8DTN, BIOS 4.6.3 01/08/2010
[ 324.984420] 0000000000000009 ffff88043fce3ec8 ffffffff8160ea0b ffff88043fce3f00
[ 324.984420] ffffffff81080ff0 ffff8802314fdc00 ffff880231a8f800 ffff88043fcf7860
[ 324.984420] 0000000000000286 ffff880231a8f800 ffff88043fce3f10 ffffffff8108103a
[ 324.984420] Call Trace:
[ 324.984420] <IRQ> [<ffffffff8160ea0b>] dump_stack+0x19/0x1b
[ 324.984420] [<ffffffff81080ff0>] warn_slowpath_common+0x70/0xa0
[ 324.984420] [<ffffffff8108103a>] warn_slowpath_null+0x1a/0x20
[ 324.984420] [<ffffffff81134437>] __perf_event_enable+0x187/0x190
[ 324.984420] [<ffffffff81130030>] remote_function+0x40/0x50
[ 324.984420] [<ffffffff810e51de>] generic_smp_call_function_single_interrupt+0xbe/0x130
[ 324.984420] [<ffffffff81066a47>] smp_call_function_single_interrupt+0x27/0x40
[ 324.984420] [<ffffffff8161fd2f>] call_function_single_interrupt+0x6f/0x80
[ 324.984420] <EOI> [<ffffffff816161a1>] ? _raw_spin_unlock_irqrestore+0x41/0x70
[ 324.984420] [<ffffffff8113799d>] perf_event_exit_task+0x14d/0x210
[ 324.984420] [<ffffffff810acd04>] ? switch_task_namespaces+0x24/0x60
[ 324.984420] [<ffffffff81086946>] do_exit+0x2b6/0xa40
[ 324.984420] [<ffffffff8161615c>] ? _raw_spin_unlock_irq+0x2c/0x30
[ 324.984420] [<ffffffff81087279>] do_group_exit+0x49/0xc0
[ 324.984420] [<ffffffff81096854>] get_signal_to_deliver+0x254/0x620
[ 324.984420] [<ffffffff81043057>] do_signal+0x57/0x5a0
[ 324.984420] [<ffffffff8161a164>] ? __do_page_fault+0x2a4/0x4e0
[ 324.984420] [<ffffffff8161665c>] ? retint_restore_args+0xe/0xe
[ 324.984420] [<ffffffff816166cd>] ? retint_signal+0x11/0x84
[ 324.984420] [<ffffffff81043605>] do_notify_resume+0x65/0x80
[ 324.984420] [<ffffffff81616702>] retint_signal+0x46/0x84
[ 324.984420] ---[ end trace 442ec2f04db3771a ]---
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Suggested-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1373384651-6109-2-git-send-email-jolsa@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 734df5ab549ca44f40de0f07af1c8803856dfb18 upstream.
Currently when the child context for inherited events is
created, it's based on the pmu object of the first event
of the parent context.
This is wrong for the following scenario:
- HW context having HW and SW event
- HW event got removed (closed)
- SW event stays in HW context as the only event
and its pmu is used to clone the child context
The issue starts when the cpu context object is touched
based on the pmu context object (__get_cpu_context). In
this case the HW context will work with SW cpu context
ending up with following WARN below.
Fixing this by using parent context pmu object to clone
from child context.
Addresses the following warning reported by Vince Weaver:
[ 2716.472065] ------------[ cut here ]------------
[ 2716.476035] WARNING: at kernel/events/core.c:2122 task_ctx_sched_out+0x3c/0x)
[ 2716.476035] Modules linked in: nfsd auth_rpcgss oid_registry nfs_acl nfs locn
[ 2716.476035] CPU: 0 PID: 3164 Comm: perf_fuzzer Not tainted 3.10.0-rc4 #2
[ 2716.476035] Hardware name: AOpen DE7000/nMCP7ALPx-DE R1.06 Oct.19.2012, BI2
[ 2716.476035] 0000000000000000 ffffffff8102e215 0000000000000000 ffff88011fc18
[ 2716.476035] ffff8801175557f0 0000000000000000 ffff880119fda88c ffffffff810ad
[ 2716.476035] ffff880119fda880 ffffffff810af02a 0000000000000009 ffff880117550
[ 2716.476035] Call Trace:
[ 2716.476035] [<ffffffff8102e215>] ? warn_slowpath_common+0x5b/0x70
[ 2716.476035] [<ffffffff810ab2bd>] ? task_ctx_sched_out+0x3c/0x5f
[ 2716.476035] [<ffffffff810af02a>] ? perf_event_exit_task+0xbf/0x194
[ 2716.476035] [<ffffffff81032a37>] ? do_exit+0x3e7/0x90c
[ 2716.476035] [<ffffffff810cd5ab>] ? __do_fault+0x359/0x394
[ 2716.476035] [<ffffffff81032fe6>] ? do_group_exit+0x66/0x98
[ 2716.476035] [<ffffffff8103dbcd>] ? get_signal_to_deliver+0x479/0x4ad
[ 2716.476035] [<ffffffff810ac05c>] ? __perf_event_task_sched_out+0x230/0x2d1
[ 2716.476035] [<ffffffff8100205d>] ? do_signal+0x3c/0x432
[ 2716.476035] [<ffffffff810abbf9>] ? ctx_sched_in+0x43/0x141
[ 2716.476035] [<ffffffff810ac2ca>] ? perf_event_context_sched_in+0x7a/0x90
[ 2716.476035] [<ffffffff810ac311>] ? __perf_event_task_sched_in+0x31/0x118
[ 2716.476035] [<ffffffff81050dd9>] ? mmdrop+0xd/0x1c
[ 2716.476035] [<ffffffff81051a39>] ? finish_task_switch+0x7d/0xa6
[ 2716.476035] [<ffffffff81002473>] ? do_notify_resume+0x20/0x5d
[ 2716.476035] [<ffffffff813654f5>] ? retint_signal+0x3d/0x78
[ 2716.476035] ---[ end trace 827178d8a5966c3d ]---
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1373384651-6109-1-git-send-email-jolsa@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit fa44063f9ef163c3a4c8d8c0465bb8a056b42035 upstream.
When wrong argument is passed into uprobe_events it does not return
an error:
[root@jovi tracing]# echo 'p:myprobe /bin/bash' > uprobe_events
[root@jovi tracing]#
The proper response is:
[root@jovi tracing]# echo 'p:myprobe /bin/bash' > uprobe_events
-bash: echo: write error: Invalid argument
Link: http://lkml.kernel.org/r/51B964FF.5000106@huawei.com
Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: <srikar@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit dbda92d16f8655044e082930e4e9d244b87fde77 upstream.
commit 07354eb1a74d1 ("locking printk: Annotate logbuf_lock as raw")
reintroduced a lock inversion problem which was fixed in commit
0b5e1c5255 ("printk: Release console_sem after logbuf_lock"). This
happened probably when fixing up patch rejects.
Restore the ordering and unlock logbuf_lock before releasing
console_sem.
Signed-off-by: ybu <ybu@qti.qualcomm.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/E807E903FE6CBE4D95E420FBFCC273B827413C@nasanexd01h.na.qualcomm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 1f73a9806bdd07a5106409bbcab3884078bd34fe upstream.
When the system switches from periodic to oneshot mode, the broadcast
logic causes a possibility that a CPU which has not yet switched to
oneshot mode puts its own clock event device into oneshot mode without
updating the state and the timer handler.
CPU0 CPU1
per cpu tickdev is in periodic mode
and switched to broadcast
Switch to oneshot mode
tick_broadcast_switch_to_oneshot()
cpumask_copy(tick_oneshot_broacast_mask,
tick_broadcast_mask);
broadcast device mode = oneshot
Timer interrupt
irq_enter()
tick_check_oneshot_broadcast()
dev->set_mode(ONESHOT);
tick_handle_periodic()
if (dev->mode == ONESHOT)
dev->next_event += period;
FAIL.
We fail, because dev->next_event contains KTIME_MAX, if the device was
in periodic mode before the uncontrolled switch to oneshot happened.
We must copy the broadcast bits over to the oneshot mask, because
otherwise a CPU which relies on the broadcast would not been woken up
anymore after the broadcast device switched to oneshot mode.
So we need to verify in tick_check_oneshot_broadcast() whether the CPU
has already switched to oneshot mode. If not, leave the device
untouched and let the CPU switch controlled into oneshot mode.
This is a long standing bug, which was never noticed, because the main
user of the broadcast x86 cannot run into that scenario, AFAICT. The
nonarchitected timer mess of ARM creates a gazillion of differently
broken abominations which trigger the shortcomings of that broadcast
code, which better had never been necessary in the first place.
Reported-and-tested-by: Stehle Vincent-B46079 <B46079@freescale.com>
Reviewed-by: Stephen Boyd <sboyd@codeaurora.org>
Cc: John Stultz <john.stultz@linaro.org>,
Cc: Mark Rutland <mark.rutland@arm.com>
Link: http://lkml.kernel.org/r/alpine.DEB.2.02.1307012153060.4013@ionos.tec.linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 07bd1172902e782f288e4d44b1fde7dec0f08b6f upstream.
The recent implementation of a generic dummy timer resulted in a
different registration order of per cpu local timers which made the
broadcast control logic go belly up.
If the dummy timer is the first clock event device which is registered
for a CPU, then it is installed, the broadcast timer is initialized
and the CPU is marked as broadcast target.
If a real clock event device is installed after that, we can fail to
take the CPU out of the broadcast mask. In the worst case we end up
with two periodic timer events firing for the same CPU. One from the
per cpu hardware device and one from the broadcast.
Now the problem is that we have no way to distinguish whether the
system is in a state which makes broadcasting necessary or the
broadcast bit was set due to the nonfunctional dummy timer
installment.
To solve this we need to keep track of the system state seperately and
provide a more detailed decision logic whether we keep the CPU in
broadcast mode or not.
The old decision logic only clears the broadcast mode, if the newly
installed clock event device is not affected by power states.
The new logic clears the broadcast mode if one of the following is
true:
- The new device is not affected by power states.
- The system is not in a power state affected mode
- The system has switched to oneshot mode. The oneshot broadcast is
controlled from the deep idle state. The CPU is not in idle at
this point, so it's safe to remove it from the mask.
If we clear the broadcast bit for the CPU when a new device is
installed, we also shutdown the broadcast device when this was the
last CPU in the broadcast mask.
If the broadcast bit is kept, then we leave the new device in shutdown
state and rely on the broadcast to deliver the timer interrupts via
the broadcast ipis.
Reported-and-tested-by: Stehle Vincent-B46079 <B46079@freescale.com>
Reviewed-by: Stephen Boyd <sboyd@codeaurora.org>
Cc: John Stultz <john.stultz@linaro.org>,
Cc: Mark Rutland <mark.rutland@arm.com>
Link: http://lkml.kernel.org/r/alpine.DEB.2.02.1307012153060.4013@ionos.tec.linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 9e04d3804d3ac97d8c03a41d78d0f0674b5d01e1 upstream.
Direct compare of jiffies related values does not work in the wrap
around case. Replace it with time_is_after_jiffies().
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/519BC066.5080600@acm.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 2779db8d37d4b542d9ca2575f5f178dbeaca6c86 upstream.
Commit 02725e7471b8 ('genirq: Use irq_get/put functions'),
inadvertently changed can_request_irq() to return 0 for IRQs that have
no action. This causes pcibios_lookup_irq() to select only IRQs that
already have an action with IRQF_SHARED set, or to fail if there are
none. Change can_request_irq() to return 1 for IRQs that have no
action (if the first two conditions are met).
Reported-by: Bjarni Ingi Gislason <bjarniig@rhi.hi.is>
Tested-by: Bjarni Ingi Gislason <bjarniig@rhi.hi.is> (against 3.2)
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Cc: 709647@bugs.debian.org
Link: http://bugs.debian.org/709647
Link: http://lkml.kernel.org/r/1372383630.23847.40.camel@deadeye.wl.decadent.org.uk
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 1c8158eeae0f37d0eee9f1fbe68080df6a408df2 upstream.
commit 5db9a4d99b0157a513944e9a44d29c9cec2e91dc
Author: Tejun Heo <tj@kernel.org>
Date: Sat Jul 7 16:08:18 2012 -0700
cgroup: fix cgroup hierarchy umount race
This commit fixed a race caused by the dput() in css_dput_fn(), but
the dput() in cgroup_event_remove() can also lead to the same BUG().
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 13d60f4b6ab5b702dc8d2ee20999f98a93728aec upstream.
The futex_keys of process shared futexes are generated from the page
offset, the mapping host and the mapping index of the futex user space
address. This should result in an unique identifier for each futex.
Though this is not true when futexes are located in different subpages
of an hugepage. The reason is, that the mapping index for all those
futexes evaluates to the index of the base page of the hugetlbfs
mapping. So a futex at offset 0 of the hugepage mapping and another
one at offset PAGE_SIZE of the same hugepage mapping have identical
futex_keys. This happens because the futex code blindly uses
page->index.
Steps to reproduce the bug:
1. Map a file from hugetlbfs. Initialize pthread_mutex1 at offset 0
and pthread_mutex2 at offset PAGE_SIZE of the hugetlbfs
mapping.
The mutexes must be initialized as PTHREAD_PROCESS_SHARED because
PTHREAD_PROCESS_PRIVATE mutexes are not affected by this issue as
their keys solely depend on the user space address.
2. Lock mutex1 and mutex2
3. Create thread1 and in the thread function lock mutex1, which
results in thread1 blocking on the locked mutex1.
4. Create thread2 and in the thread function lock mutex2, which
results in thread2 blocking on the locked mutex2.
5. Unlock mutex2. Despite the fact that mutex2 got unlocked, thread2
still blocks on mutex2 because the futex_key points to mutex1.
To solve this issue we need to take the normal page index of the page
which contains the futex into account, if the futex is in an hugetlbfs
mapping. In other words, we calculate the normal page mapping index of
the subpage in the hugetlbfs mapping.
Mappings which are not based on hugetlbfs are not affected and still
use page->index.
Thanks to Mel Gorman who provided a patch for adding proper evaluation
functions to the hugetlbfs code to avoid exposing hugetlbfs specific
details to the futex code.
[ tglx: Massaged changelog ]
Signed-off-by: Zhang Yi <zhang.yi20@zte.com.cn>
Reviewed-by: Jiang Biao <jiang.biao2@zte.com.cn>
Tested-by: Ma Chenggong <ma.chenggong@zte.com.cn>
Reviewed-by: 'Mel Gorman' <mgorman@suse.de>
Acked-by: 'Darren Hart' <dvhart@linux.intel.com>
Cc: 'Peter Zijlstra' <peterz@infradead.org>
Link: http://lkml.kernel.org/r/000101ce71a6%24a83c5880%24f8b50980%24@com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 8d8022e8aba85192e937f1f0f7450e256d66ae5c upstream.
v3.8-rc1-5-g1fb9341 was supposed to stop parallel kvm loads exhausting
percpu memory on large machines:
Now we have a new state MODULE_STATE_UNFORMED, we can insert the
module into the list (and thus guarantee its uniqueness) before we
allocate the per-cpu region.
In my defence, it didn't actually say the patch did this. Just that
we "can".
This patch actually *does* it.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Tested-by: Jim Hull <jim.hull@hp.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
This __put_user() could be used by unprivileged processes to write into
kernel memory. The issue here is that even if copy_siginfo_to_user()
fails, the error code is not checked before __put_user() is executed.
Luckily, ptrace_peek_siginfo() has been added within the 3.10-rc cycle,
so it has not hit a stable release yet.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Andrey Vagin <avagin@openvz.org>
Cc: Roland McGrath <roland@redhat.com>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Pedro Alves <palves@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Thomas Gleixner:
"Correct an ordering issue in the tick broadcast code. I really wish
we'd get compensation for pain and suffering for each line of code we
write to work around dysfunctional timer hardware."
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
tick: Fix tick_broadcast_pending_mask not cleared
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
"Three small fixlets"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
hw_breakpoint: Use cpu_possible_mask in {reserve,release}_bp_slot()
hw_breakpoint: Fix cpu check in task_bp_pinned(cpu)
kprobes: Fix arch_prepare_kprobe to handle copy insn failures
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Peter Anvin:
"This series fixes a couple of build failures, and fixes MTRR cleanup
and memory setup on very specific memory maps.
Finally, it fixes triggering backtraces on all CPUs, which was
inadvertently disabled on x86."
* 'x86/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/efi: Fix dummy variable buffer allocation
x86: Fix trigger_all_cpu_backtrace() implementation
x86: Fix section mismatch on load_ucode_ap
x86: fix build error and kconfig for ia32_emulation and binfmt
range: Do not add new blank slot with add_range_with_merge
x86, mtrr: Fix original mtrr range get for mtrr_cleanup
|
|
The recent modification in the cpuidle framework consolidated the
timer broadcast code across the different drivers by setting a new
flag in the idle state. It tells the cpuidle core code to enter/exit
the broadcast mode for the cpu when entering a deep idle state. The
broadcast timer enter/exit is no longer handled by the back-end
driver.
This change made the local interrupt to be enabled *before* calling
CLOCK_EVENT_NOTIFY_EXIT.
On a tegra114, a four cores system, when the flag has been introduced
in the driver, the following warning appeared:
WARNING: at kernel/time/tick-broadcast.c:578 tick_broadcast_oneshot_control
CPU: 2 PID: 0 Comm: swapper/2 Not tainted 3.10.0-rc3-next-20130529+ #15
[<c00667f8>] (tick_broadcast_oneshot_control+0x1a4/0x1d0) from [<c0065cd0>] (tick_notify+0x240/0x40c)
[<c0065cd0>] (tick_notify+0x240/0x40c) from [<c0044724>] (notifier_call_chain+0x44/0x84)
[<c0044724>] (notifier_call_chain+0x44/0x84) from [<c0044828>] (raw_notifier_call_chain+0x18/0x20)
[<c0044828>] (raw_notifier_call_chain+0x18/0x20) from [<c00650cc>] (clockevents_notify+0x28/0x170)
[<c00650cc>] (clockevents_notify+0x28/0x170) from [<c033f1f0>] (cpuidle_idle_call+0x11c/0x168)
[<c033f1f0>] (cpuidle_idle_call+0x11c/0x168) from [<c000ea94>] (arch_cpu_idle+0x8/0x38)
[<c000ea94>] (arch_cpu_idle+0x8/0x38) from [<c005ea80>] (cpu_startup_entry+0x60/0x134)
[<c005ea80>] (cpu_startup_entry+0x60/0x134) from [<804fe9a4>] (0x804fe9a4)
I don't have the hardware, so I wasn't able to reproduce the warning
but after looking a while at the code, I deduced the following:
1. the CPU2 enters a deep idle state and sets the broadcast timer
2. the timer expires, the tick_handle_oneshot_broadcast function is
called, setting the tick_broadcast_pending_mask and waking up the
idle cpu CPU2
3. the CPU2 exits idle handles the interrupt and then invokes
tick_broadcast_oneshot_control with CLOCK_EVENT_NOTIFY_EXIT which
runs the following code:
[...]
if (dev->next_event.tv64 == KTIME_MAX)
goto out;
if (cpumask_test_and_clear_cpu(cpu,
tick_broadcast_pending_mask))
goto out;
[...]
So if there is no next event scheduled for CPU2, we fulfil the
first condition and jump out without clearing the
tick_broadcast_pending_mask.
4. CPU2 goes to deep idle again and calls
tick_broadcast_oneshot_control with CLOCK_NOTIFY_EVENT_ENTER but
with the tick_broadcast_pending_mask set for CPU2, triggering the
warning.
The issue only surfaced due to the modifications of the cpuidle
framework, which resulted in interrupts being enabled before the call
to the clockevents code. If the call happens before interrupts have
been enabled, the warning cannot trigger, because there is still the
event pending which caused the broadcast timer expiry.
Move the check for the next event below the check for the pending bit,
so the pending bit gets cleared whether an event is scheduled on the
cpu or not.
[ tglx: Massaged changelog ]
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Reported-and-tested-by: Joseph Lo <josephl@nvidia.com>
Cc: Stephen Warren <swarren@nvidia.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linaro-kernel@lists.linaro.org
Link: http://lkml.kernel.org/r/1371485735-31249-1-git-send-email-daniel.lezcano@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
"Two smaller fixes - plus a context tracking tracing fix that is a bit
bigger"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
tracing/context-tracking: Add preempt_schedule_context() for tracing
sched: Fix clear NOHZ_BALANCE_KICK
sched/x86: Construct all sibling maps if smt
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
"Four fixes. The mmap ones are unfortunately larger than desired -
fuzzing uncovered bugs that needed perf context life time management
changes to fix properly"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86: Fix broken PEBS-LL support on SNB-EP/IVB-EP
perf: Fix mmap() accounting hole
perf: Fix perf mmap bugs
kprobes: Fix to free gone and unused optprobes
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull cpu idle fixes from Thomas Gleixner:
- Add a missing irq enable. Fallout of the idle conversion
- Fix stackprotector wreckage caused by the idle conversion
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
idle: Enable interrupts in the weak arch_cpu_idle() implementation
idle: Add the stack canary init to cpu_startup_entry()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Thomas Gleixner:
- Fix inconstinant clock usage in virtual time accounting
- Fix a build error in KVM caused by the NOHZ work
- Remove a pointless timekeeping duty assignment which breaks NOHZ
- Use a proper notifier return value to avoid random behaviour
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
tick: Remove useless timekeeping duty attribution to broadcast source
nohz: Fix notifier return val that enforce timekeeping
kvm: Move guest entry/exit APIs to context_tracking
vtime: Use consistent clocks among nohz accounting
|
|
fetch_bp_busy_slots() and toggle_bp_slot() use
for_each_online_cpu(), this is obviously wrong wrt cpu_up() or
cpu_down(), we can over/under account the per-cpu numbers.
For example:
# echo 0 >> /sys/devices/system/cpu/cpu1/online
# perf record -e mem:0x10 -p 1 &
# echo 1 >> /sys/devices/system/cpu/cpu1/online
# perf record -e mem:0x10,mem:0x10,mem:0x10,mem:0x10 -C1 -a &
# taskset -p 0x2 1
triggers the same WARN_ONCE("Can't find any breakpoint slot") in
arch_install_hw_breakpoint().
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20130620155009.GA6327@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
trinity fuzzer triggered WARN_ONCE("Can't find any breakpoint
slot") in arch_install_hw_breakpoint() but the problem is not
arch-specific.
The problem is, task_bp_pinned(cpu) checks "cpu == iter->cpu"
but this doesn't account the "all cpus" events with iter->cpu <
0.
This means that, say, register_user_hw_breakpoint(tsk) can
happily create the arbitrary number > HBP_NUM of breakpoints
which can not be activated. toggle_bp_task_slot() is equally
wrong by the same reason and nr_task_bp_pinned[] can have
negative entries.
Simple test:
# perl -e 'sleep 1 while 1' &
# perf record -e mem:0x10,mem:0x10,mem:0x10,mem:0x10,mem:0x10 -p `pidof perl`
Before this patch this triggers the same problem/WARN_ON(),
after the patch it correctly fails with -ENOSPC.
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20130620155006.GA6324@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Dave Jones hit the following bug report:
===============================
[ INFO: suspicious RCU usage. ]
3.10.0-rc2+ #1 Not tainted
-------------------------------
include/linux/rcupdate.h:771 rcu_read_lock() used illegally while idle!
other info that might help us debug this:
RCU used illegally from idle CPU! rcu_scheduler_active = 1, debug_locks = 0
RCU used illegally from extended quiescent state!
2 locks held by cc1/63645:
#0: (&rq->lock){-.-.-.}, at: [<ffffffff816b39fd>] __schedule+0xed/0x9b0
#1: (rcu_read_lock){.+.+..}, at: [<ffffffff8109d645>] cpuacct_charge+0x5/0x1f0
CPU: 1 PID: 63645 Comm: cc1 Not tainted 3.10.0-rc2+ #1 [loadavg: 40.57 27.55 13.39 25/277 64369]
Hardware name: Gigabyte Technology Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H, BIOS F12a 04/23/2010
0000000000000000 ffff88010f78fcf8 ffffffff816ae383 ffff88010f78fd28
ffffffff810b698d ffff88011c092548 000000000023d073 ffff88011c092500
0000000000000001 ffff88010f78fd60 ffffffff8109d7c5 ffffffff8109d645
Call Trace:
[<ffffffff816ae383>] dump_stack+0x19/0x1b
[<ffffffff810b698d>] lockdep_rcu_suspicious+0xfd/0x130
[<ffffffff8109d7c5>] cpuacct_charge+0x185/0x1f0
[<ffffffff8109d645>] ? cpuacct_charge+0x5/0x1f0
[<ffffffff8108dffc>] update_curr+0xec/0x240
[<ffffffff8108f528>] put_prev_task_fair+0x228/0x480
[<ffffffff816b3a71>] __schedule+0x161/0x9b0
[<ffffffff816b4721>] preempt_schedule+0x51/0x80
[<ffffffff816b4800>] ? __cond_resched_softirq+0x60/0x60
[<ffffffff816b6824>] ? retint_careful+0x12/0x2e
[<ffffffff810ff3cc>] ftrace_ops_control_func+0x1dc/0x210
[<ffffffff816be280>] ftrace_call+0x5/0x2f
[<ffffffff816b681d>] ? retint_careful+0xb/0x2e
[<ffffffff816b4805>] ? schedule_user+0x5/0x70
[<ffffffff816b4805>] ? schedule_user+0x5/0x70
[<ffffffff816b6824>] ? retint_careful+0x12/0x2e
------------[ cut here ]------------
What happened was that the function tracer traced the schedule_user() code
that tells RCU that the system is coming back from userspace, and to
add the CPU back to the RCU monitoring.
Because the function tracer does a preempt_disable/enable_notrace() calls
the preempt_enable_notrace() checks the NEED_RESCHED flag. If it is set,
then preempt_schedule() is called. But this is called before the user_exit()
function can inform the kernel that the CPU is no longer in user mode and
needs to be accounted for by RCU.
The fix is to create a new preempt_schedule_context() that checks if
the kernel is still in user mode and if so to switch it to kernel mode
before calling schedule. It also switches back to user mode coming back
from schedule in need be.
The only user of this currently is the preempt_enable_notrace(), which is
only used by the tracing subsystem.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1369423420.6828.226.camel@gandalf.local.home
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
I have faced a sequence where the Idle Load Balance was sometime not
triggered for a while on my platform, in the following scenario:
CPU 0 and CPU 1 are running tasks and CPU 2 is idle
CPU 1 kicks the Idle Load Balance
CPU 1 selects CPU 2 as the new Idle Load Balancer
CPU 2 sets NOHZ_BALANCE_KICK for CPU 2
CPU 2 sends a reschedule IPI to CPU 2
While CPU 3 wakes up, CPU 0 or CPU 1 migrates a waking up task A on CPU 2
CPU 2 finally wakes up, runs task A and discards the Idle Load Balance
task A quickly goes back to sleep (before a tick occurs on CPU 2)
CPU 2 goes back to idle with NOHZ_BALANCE_KICK set
Whenever CPU 2 will be selected as the ILB, no reschedule IPI will be sent
because NOHZ_BALANCE_KICK is already set and no Idle Load Balance will be
performed.
We must wait for the sched softirq to be raised on CPU 2 thanks to another
part the kernel to come back to clear NOHZ_BALANCE_KICK.
The proposed solution clears NOHZ_BALANCE_KICK in schedule_ipi if
we can't raise the sched_softirq for the Idle Load Balance.
Change since V1:
- move the clear of NOHZ_BALANCE_KICK in got_nohz_idle_kick if the ILB
can't run on this CPU (as suggested by Peter)
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1370419991-13870-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Vince's fuzzer once again found holes. This time it spotted a leak in
the locked page accounting.
When an event had redirected output and its close() was the last
reference to the buffer we didn't have a vm context to undo accounting.
Change the code to destroy the buffer on the last munmap() and detach
all redirected events at that time. This provides us the right context
to undo the vm accounting.
Reported-and-tested-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20130604084421.GI8923@twins.programming.kicks-ass.net
Cc: <stable@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Joshua reported: Commit cd7b304dfaf1 (x86, range: fix missing merge
during add range) broke mtrr cleanup on his setup in 3.9.5.
corresponding commit in upstream is fbe06b7bae7c.
The reason is add_range_with_merge could generate blank spot.
We could avoid that by searching new expanded start/end, that
new range should include all connected ranges in range array.
At last add the new expanded start/end to the range array.
Also move up left array so do not add new blank slot in the
range array.
-v2: move left array to avoid enhance add_range()
-v3: include fix from Joshua about memmove declaring when
DYN_DEBUG is used.
Reported-by: Joshua Covington <joshuacov@googlemail.com>
Tested-by: Joshua Covington <joshuacov@googlemail.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1371154622-8929-3-git-send-email-yinghai@kernel.org
Cc: <stable@vger.kernel.org> v3.9
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull VFS fixes from Al Viro:
"Several fixes + obvious cleanup (you've missed a couple of open-coded
can_lookup() back then)"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
snd_pcm_link(): fix a leak...
use can_lookup() instead of direct checks of ->i_op->lookup
move exit_task_namespaces() outside of exit_notify()
fput: task_work_add() can fail if the caller has passed exit_task_work()
ncpfs: fix rmdir returns Device or resource busy
|
|
exit_notify() does exit_task_namespaces() after
forget_original_parent(). This was needed to ensure that ->nsproxy
can't be cleared prematurely, an exiting child we are going to
reparent can do do_notify_parent() and use the parent's (ours) pid_ns.
However, after 32084504 "pidns: use task_active_pid_ns in
do_notify_parent" ->nsproxy != NULL is no longer needed, we rely
on task_active_pid_ns().
Move exit_task_namespaces() from exit_notify() to do_exit(), after
exit_fs() and before exit_task_work().
This solves the problem reported by Andrey, free_ipc_ns()->shm_destroy()
does fput() which needs task_work_add().
Note: this particular problem can be fixed if we change fput(), and
that change makes sense anyway. But there is another reason to move
the callsite. The original reason for exit_task_namespaces() from
the middle of exit_notify() was subtle and it has already gone away,
now this looks confusing. And this allows us do simplify exit_notify(),
we can avoid unlock/lock(tasklist) and we can use ->exit_state instead
of PF_EXITING in forget_original_parent().
Reported-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
PARISC bootup triggers the warning at kernel/cpu/idle.c:96. That's
caused by the weak arch_cpu_idle() implementation, which is provided
to avoid that architectures implement idle_poll over and over.
The switchover to polling mode happens in the first call of the weak
arch_cpu_idle() implementation, but that code fails to reenable
interrupts and therefor triggers the warning.
Fix this by enabling interrupts in the weak arch_cpu_idle() code.
[ tglx: Made the changelog match the patch ]
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/1371236142.2726.43.camel@dabdike
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
Pull RCU fixes from Paul McKenney:
"I must confess that this past merge window was not RCU's best showing.
This series contains three more fixes for RCU regressions:
1. A fix to __DECLARE_TRACE_RCU() that causes it to act as an
interrupt from idle rather than as a task switch from idle.
This change is needed due to the recent use of _rcuidle()
tracepoints that can be invoked from interrupt handlers as well
as from idle. Without this fix, invoking _rcuidle() tracepoints
from interrupt handlers results in splats and (more seriously)
confusion on RCU's part as to whether a given CPU is idle or not.
This confusion can in turn result in too-short grace periods and
therefore random memory corruption.
2. A fix to a subtle deadlock that could result due to RCU doing
a wakeup while holding one of its rcu_node structure's locks.
Although the probability of occurrence is low, it really
does happen. The fix, courtesy of Steven Rostedt, uses
irq_work_queue() to avoid the deadlock.
3. A fix to a silent deadlock (invisible to lockdep) due to the
interaction of timeouts posted by RCU debug code enabled by
CONFIG_PROVE_RCU_DELAY=y, grace-period initialization, and CPU
hotplug operations. This will not occur in production kernels,
but really does occur in randconfig testing. Diagnosis courtesy
of Steven Rostedt"
* 'rcu/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
rcu: Fix deadlock with CPU hotplug, RCU GP init, and timer migration
rcu: Don't call wakeup() with rcu_node structure ->lock held
trace: Allow idle-safe tracepoints to be called from irq
|
|
Merge misc fixes from Andrew Morton:
"Bunch of fixes and one little addition to math64.h"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (27 commits)
include/linux/math64.h: add div64_ul()
mm: memcontrol: fix lockless reclaim hierarchy iterator
frontswap: fix incorrect zeroing and allocation size for frontswap_map
kernel/audit_tree.c:audit_add_tree_rule(): protect `rule' from kill_rules()
mm: migration: add migrate_entry_wait_huge()
ocfs2: add missing lockres put in dlm_mig_lockres_handler
mm/page_alloc.c: fix watermark check in __zone_watermark_ok()
drivers/misc/sgi-gru/grufile.c: fix info leak in gru_get_config_info()
aio: fix io_destroy() regression by using call_rcu()
rtc-at91rm9200: use shadow IMR on at91sam9x5
rtc-at91rm9200: add shadow interrupt mask
rtc-at91rm9200: refactor interrupt-register handling
rtc-at91rm9200: add configuration support
rtc-at91rm9200: add match-table compile guard
fs/ocfs2/namei.c: remove unecessary ERROR when removing non-empty directory
swap: avoid read_swap_cache_async() race to deadlock while waiting on discard I/O completion
drivers/rtc/rtc-twl.c: fix missing device_init_wakeup() when booted with device tree
cciss: fix broken mutex usage in ioctl
audit: wait_for_auditd() should use TASK_UNINTERRUPTIBLE
drivers/rtc/rtc-cmos.c: fix accidentally enabling rtc channel
...
|
|
audit_add_tree_rule() must set 'rule->tree = NULL;' firstly, to protect
the rule itself freed in kill_rules().
The reason is when it is killed, the 'rule' itself may have already
released, we should not access it. one example: we add a rule to an
inode, just at the same time the other task is deleting this inode.
The work flow for adding a rule:
audit_receive() -> (need audit_cmd_mutex lock)
audit_receive_skb() ->
audit_receive_msg() ->
audit_receive_filter() ->
audit_add_rule() ->
audit_add_tree_rule() -> (need audit_filter_mutex lock)
...
unlock audit_filter_mutex
get_tree()
...
iterate_mounts() -> (iterate all related inodes)
tag_mount() ->
tag_trunk() ->
create_trunk() -> (assume it is 1st rule)
fsnotify_add_mark() ->
fsnotify_add_inode_mark() -> (add mark to inode->i_fsnotify_marks)
...
get_tree(); (each inode will get one)
...
lock audit_filter_mutex
The work flow for deleting an inode:
__destroy_inode() ->
fsnotify_inode_delete() ->
__fsnotify_inode_delete() ->
fsnotify_clear_marks_by_inode() -> (get mark from inode->i_fsnotify_marks)
fsnotify_destroy_mark() ->
fsnotify_destroy_mark_locked() ->
audit_tree_freeing_mark() ->
evict_chunk() ->
...
tree->goner = 1
...
kill_rules() -> (assume current->audit_context == NULL)
call_rcu() -> (rule->tree != NULL)
audit_free_rule_rcu() ->
audit_free_rule()
...
audit_schedule_prune() -> (assume current->audit_context == NULL)
kthread_run() -> (need audit_cmd_mutex and audit_filter_mutex lock)
prune_one() -> (delete it from prue_list)
put_tree(); (match the original get_tree above)
Signed-off-by: Chen Gang <gang.chen@asianux.com>
Cc: Eric Paris <eparis@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
audit_log_start() does wait_for_auditd() in a loop until
audit_backlog_wait_time passes or audit_skb_queue has a room.
If signal_pending() is true this becomes a busy-wait loop, schedule() in
TASK_INTERRUPTIBLE won't block.
Thanks to Guy for fully investigating and explaining the problem.
(akpm: that'll cause the system to lock up on a non-preemptible
uniprocessor kernel)
(Guy: "Our customer was in fact running a uniprocessor machine, and they
reported a system hang.")
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Guy Streeter <streeter@redhat.com>
Cc: Eric Paris <eparis@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The dmesg_restrict sysctl currently covers the syslog method for access
dmesg, however /dev/kmsg isn't covered by the same protections. Most
people haven't noticed because util-linux dmesg(1) defaults to using the
syslog method for access in older versions. With util-linux dmesg(1)
defaults to reading directly from /dev/kmsg.
To fix /dev/kmsg, let's compare the existing interfaces and what they
allow:
- /proc/kmsg allows:
- open (SYSLOG_ACTION_OPEN) if CAP_SYSLOG since it uses a destructive
single-reader interface (SYSLOG_ACTION_READ).
- everything, after an open.
- syslog syscall allows:
- anything, if CAP_SYSLOG.
- SYSLOG_ACTION_READ_ALL and SYSLOG_ACTION_SIZE_BUFFER, if
dmesg_restrict==0.
- nothing else (EPERM).
The use-cases were:
- dmesg(1) needs to do non-destructive SYSLOG_ACTION_READ_ALLs.
- sysklog(1) needs to open /proc/kmsg, drop privs, and still issue the
destructive SYSLOG_ACTION_READs.
AIUI, dmesg(1) is moving to /dev/kmsg, and systemd-journald doesn't
clear the ring buffer.
Based on the comments in devkmsg_llseek, it sounds like actions besides
reading aren't going to be supported by /dev/kmsg (i.e.
SYSLOG_ACTION_CLEAR), so we have a strict subset of the non-destructive
syslog syscall actions.
To this end, move the check as Josh had done, but also rename the
constants to reflect their new uses (SYSLOG_FROM_CALL becomes
SYSLOG_FROM_READER, and SYSLOG_FROM_FILE becomes SYSLOG_FROM_PROC).
SYSLOG_FROM_READER allows non-destructive actions, and SYSLOG_FROM_PROC
allows destructive actions after a capabilities-constrained
SYSLOG_ACTION_OPEN check.
- /dev/kmsg allows:
- open if CAP_SYSLOG or dmesg_restrict==0
- reading/polling, after open
Addresses https://bugzilla.redhat.com/show_bug.cgi?id=903192
[akpm@linux-foundation.org: use pr_warn_once()]
Signed-off-by: Kees Cook <keescook@chromium.org>
Reported-by: Christian Kujau <lists@nerdbynature.de>
Tested-by: Josh Boyer <jwboyer@redhat.com>
Cc: Kay Sievers <kay@vrfy.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
We recently noticed that reboot of a 1024 cpu machine takes approx 16
minutes of just stopping the cpus. The slowdown was tracked to commit
f96972f2dc63 ("kernel/sys.c: call disable_nonboot_cpus() in
kernel_restart()").
The current implementation does all the work of hot removing the cpus
before halting the system. We are switching to just migrating to the
boot cpu and then continuing with shutdown/reboot.
This also has the effect of not breaking x86's command line parameter
for specifying the reboot cpu. Note, this code was shamelessly copied
from arch/x86/kernel/reboot.c with bits removed pertaining to the
reboot_cpu command line parameter.
Signed-off-by: Robin Holt <holt@sgi.com>
Tested-by: Shawn Guo <shawn.guo@linaro.org>
Cc: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Russ Anderson <rja@sgi.com>
Cc: Robin Holt <holt@sgi.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
There are instances in the kernel where we would like to disable CPU
hotplug (from sysfs) during some important operation. Today the freezer
code depends on this and the code to do it was kinda tailor-made for
that.
Restructure the code and make it generic enough to be useful for other
usecases too.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Robin Holt <holt@sgi.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Russ Anderson <rja@sgi.com>
Cc: Robin Holt <holt@sgi.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: Shawn Guo <shawn.guo@linaro.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fix from Steven Rostedt:
"Yoshihiro Yunomae fixed a regression in the output format when using
one of the counter clocks.
The new multibuffer code changed the trace_clock file to update the
trace instances tr->clock_id but the actual traces still used the
value from the obsolete global variable trace_clock_id"
* tag 'trace-fixes-v3.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Fix outputting formats of x86-tsc and counter when use trace_clock
|
|
Moving x86 to the generic idle implementation (commit 7d1a9417 "x86:
Use generic idle loop") wreckaged the stack protector.
I stupidly missed that boot_init_stack_canary() must be inlined from a
function which never returns, but I put that call into
arch_cpu_idle_prepare() which of course returns.
I pondered to play tricks with arch_cpu_idle_prepare() first, but then
I noticed, that the other archs which have implemented the
stackprotector (ARM and SH) do not initialize the canary for the
non-boot cpus.
So I decided to move the boot_init_stack_canary() call into
cpu_startup_entry() ifdeffed with an CONFIG_X86 for now. This #ifdef
is just a temporary measure as I don't want to inflict the
boot_init_stack_canary() call on ARM and SH that late in the cycle.
I'll queue a patch for 3.11 which removes the #ifdef if the ARM/SH
maintainers have no objection.
Reported-by: Wouter van Kesteren <woutershep@gmail.com>
Cc: x86@kernel.org
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Outputting formats of x86-tsc and counter should be a raw format, but after
applying the patch(2b6080f28c7cc3efc8625ab71495aae89aeb63a0), the format was
changed to nanosec. This is because the global variable trace_clock_id was used.
When we use multiple buffers, clock_id of each sub-buffer should be used. Then,
this patch uses tr->clock_id instead of the global variable trace_clock_id.
[ Basically, this fixes a regression where the multibuffer code changed the
trace_clock file to update tr->clock_id but the traces still use the old
global trace_clock_id variable, negating the file's effect. The global
trace_clock_id variable is obsolete and removed. - SR ]
Link: http://lkml.kernel.org/r/20130423013239.22334.7394.stgit@yunodevel
Signed-off-by: Yoshihiro YUNOMAE <yoshihiro.yunomae.ez@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
|
|
The stop machine logic can lock up if all but one of the migration
threads make it through the disable-irq step and the one remaining
thread gets stuck in __do_softirq. The reason __do_softirq can hang is
that it has a bail-out based on jiffies timeout, but in the lockup case,
jiffies itself is not incremented.
To work around this, re-add the max_restart counter in __do_irq and stop
processing irqs after 10 restarts.
Thanks to Tejun Heo and Rusty Russell and others for helping me track
this down.
This was introduced in 3.9 by commit c10d73671ad3 ("softirq: reduce
latencies").
It may be worth looking into ath9k to see if it has issues with its irq
handler at a later date.
The hang stack traces look something like this:
------------[ cut here ]------------
WARNING: at kernel/watchdog.c:245 watchdog_overflow_callback+0x9c/0xa7()
Watchdog detected hard LOCKUP on cpu 2
Modules linked in: ath9k ath9k_common ath9k_hw ath mac80211 cfg80211 nfsv4 auth_rpcgss nfs fscache nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc pktgen lockd sunrpc]
Pid: 23, comm: migration/2 Tainted: G C 3.9.4+ #11
Call Trace:
<NMI> warn_slowpath_common+0x85/0x9f
warn_slowpath_fmt+0x46/0x48
watchdog_overflow_callback+0x9c/0xa7
__perf_event_overflow+0x137/0x1cb
perf_event_overflow+0x14/0x16
intel_pmu_handle_irq+0x2dc/0x359
perf_event_nmi_handler+0x19/0x1b
nmi_handle+0x7f/0xc2
do_nmi+0xbc/0x304
end_repeat_nmi+0x1e/0x2e
<<EOE>>
cpu_stopper_thread+0xae/0x162
smpboot_thread_fn+0x258/0x260
kthread+0xc7/0xcf
ret_from_fork+0x7c/0xb0
---[ end trace 4947dfa9b0a4cec3 ]---
BUG: soft lockup - CPU#1 stuck for 22s! [migration/1:17]
Modules linked in: ath9k ath9k_common ath9k_hw ath mac80211 cfg80211 nfsv4 auth_rpcgss nfs fscache nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc pktgen lockd sunrpc]
irq event stamp: 835637905
hardirqs last enabled at (835637904): __do_softirq+0x9f/0x257
hardirqs last disabled at (835637905): apic_timer_interrupt+0x6d/0x80
softirqs last enabled at (5654720): __do_softirq+0x1ff/0x257
softirqs last disabled at (5654725): irq_exit+0x5f/0xbb
CPU 1
Pid: 17, comm: migration/1 Tainted: G WC 3.9.4+ #11 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M.
RIP: tasklet_hi_action+0xf0/0xf0
Process migration/1
Call Trace:
<IRQ>
__do_softirq+0x117/0x257
irq_exit+0x5f/0xbb
smp_apic_timer_interrupt+0x8a/0x98
apic_timer_interrupt+0x72/0x80
<EOI>
printk+0x4d/0x4f
stop_machine_cpu_stop+0x22c/0x274
cpu_stopper_thread+0xae/0x162
smpboot_thread_fn+0x258/0x260
kthread+0xc7/0xcf
ret_from_fork+0x7c/0xb0
Signed-off-by: Ben Greear <greearb@candelatech.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Pekka Riikonen <priikone@iki.fi>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
In Steven Rostedt's words:
> I've been debugging the last couple of days why my tests have been
> locking up. One of my tracing tests, runs all available tracers. The
> lockup always happened with the mmiotrace, which is used to trace
> interactions between priority drivers and the kernel. But to do this
> easily, when the tracer gets registered, it disables all but the boot
> CPUs. The lockup always happened after it got done disabling the CPUs.
>
> Then I decided to try this:
>
> while :; do
> for i in 1 2 3; do
> echo 0 > /sys/devices/system/cpu/cpu$i/online
> done
> for i in 1 2 3; do
> echo 1 > /sys/devices/system/cpu/cpu$i/online
> done
> done
>
> Well, sure enough, that locked up too, with the same users. Doing a
> sysrq-w (showing all blocked tasks):
>
> [ 2991.344562] task PC stack pid father
> [ 2991.344562] rcu_preempt D ffff88007986fdf8 0 10 2 0x00000000
> [ 2991.344562] ffff88007986fc98 0000000000000002 ffff88007986fc48 0000000000000908
> [ 2991.344562] ffff88007986c280 ffff88007986ffd8 ffff88007986ffd8 00000000001d3c80
> [ 2991.344562] ffff880079248a40 ffff88007986c280 0000000000000000 00000000fffd4295
> [ 2991.344562] Call Trace:
> [ 2991.344562] [<ffffffff815437ba>] schedule+0x64/0x66
> [ 2991.344562] [<ffffffff81541750>] schedule_timeout+0xbc/0xf9
> [ 2991.344562] [<ffffffff8154bec0>] ? ftrace_call+0x5/0x2f
> [ 2991.344562] [<ffffffff81049513>] ? cascade+0xa8/0xa8
> [ 2991.344562] [<ffffffff815417ab>] schedule_timeout_uninterruptible+0x1e/0x20
> [ 2991.344562] [<ffffffff810c980c>] rcu_gp_kthread+0x502/0x94b
> [ 2991.344562] [<ffffffff81062791>] ? __init_waitqueue_head+0x50/0x50
> [ 2991.344562] [<ffffffff810c930a>] ? rcu_gp_fqs+0x64/0x64
> [ 2991.344562] [<ffffffff81061cdb>] kthread+0xb1/0xb9
> [ 2991.344562] [<ffffffff81091e31>] ? lock_release_holdtime.part.23+0x4e/0x55
> [ 2991.344562] [<ffffffff81061c2a>] ? __init_kthread_worker+0x58/0x58
> [ 2991.344562] [<ffffffff8154c1dc>] ret_from_fork+0x7c/0xb0
> [ 2991.344562] [<ffffffff81061c2a>] ? __init_kthread_worker+0x58/0x58
> [ 2991.344562] kworker/0:1 D ffffffff81a30680 0 47 2 0x00000000
> [ 2991.344562] Workqueue: events cpuset_hotplug_workfn
> [ 2991.344562] ffff880078dbbb58 0000000000000002 0000000000000006 00000000000000d8
> [ 2991.344562] ffff880078db8100 ffff880078dbbfd8 ffff880078dbbfd8 00000000001d3c80
> [ 2991.344562] ffff8800779ca5c0 ffff880078db8100 ffffffff81541fcf 0000000000000000
> [ 2991.344562] Call Trace:
> [ 2991.344562] [<ffffffff81541fcf>] ? __mutex_lock_common+0x3d4/0x609
> [ 2991.344562] [<ffffffff815437ba>] schedule+0x64/0x66
> [ 2991.344562] [<ffffffff81543a39>] schedule_preempt_disabled+0x18/0x24
> [ 2991.344562] [<ffffffff81541fcf>] __mutex_lock_common+0x3d4/0x609
> [ 2991.344562] [<ffffffff8103d11b>] ? get_online_cpus+0x3c/0x50
> [ 2991.344562] [<ffffffff8103d11b>] ? get_online_cpus+0x3c/0x50
> [ 2991.344562] [<ffffffff815422ff>] mutex_lock_nested+0x3b/0x40
> [ 2991.344562] [<ffffffff8103d11b>] get_online_cpus+0x3c/0x50
> [ 2991.344562] [<ffffffff810af7e6>] rebuild_sched_domains_locked+0x6e/0x3a8
> [ 2991.344562] [<ffffffff810b0ec6>] rebuild_sched_domains+0x1c/0x2a
> [ 2991.344562] [<ffffffff810b109b>] cpuset_hotplug_workfn+0x1c7/0x1d3
> [ 2991.344562] [<ffffffff810b0ed9>] ? cpuset_hotplug_workfn+0x5/0x1d3
> [ 2991.344562] [<ffffffff81058e07>] process_one_work+0x2d4/0x4d1
> [ 2991.344562] [<ffffffff81058d3a>] ? process_one_work+0x207/0x4d1
> [ 2991.344562] [<ffffffff8105964c>] worker_thread+0x2e7/0x3b5
> [ 2991.344562] [<ffffffff81059365>] ? rescuer_thread+0x332/0x332
> [ 2991.344562] [<ffffffff81061cdb>] kthread+0xb1/0xb9
> [ 2991.344562] [<ffffffff81061c2a>] ? __init_kthread_worker+0x58/0x58
> [ 2991.344562] [<ffffffff8154c1dc>] ret_from_fork+0x7c/0xb0
> [ 2991.344562] [<ffffffff81061c2a>] ? __init_kthread_worker+0x58/0x58
> [ 2991.344562] bash D ffffffff81a4aa80 0 2618 2612 0x10000000
> [ 2991.344562] ffff8800379abb58 0000000000000002 0000000000000006 0000000000000c2c
> [ 2991.344562] ffff880077fea140 ffff8800379abfd8 ffff8800379abfd8 00000000001d3c80
> [ 2991.344562] ffff8800779ca5c0 ffff880077fea140 ffffffff81541fcf 0000000000000000
> [ 2991.344562] Call Trace:
> [ 2991.344562] [<ffffffff81541fcf>] ? __mutex_lock_common+0x3d4/0x609
> [ 2991.344562] [<ffffffff815437ba>] schedule+0x64/0x66
> [ 2991.344562] [<ffffffff81543a39>] schedule_preempt_disabled+0x18/0x24
> [ 2991.344562] [<ffffffff81541fcf>] __mutex_lock_common+0x3d4/0x609
> [ 2991.344562] [<ffffffff81530078>] ? rcu_cpu_notify+0x2f5/0x86e
> [ 2991.344562] [<ffffffff81530078>] ? rcu_cpu_notify+0x2f5/0x86e
> [ 2991.344562] [<ffffffff815422ff>] mutex_lock_nested+0x3b/0x40
> [ 2991.344562] [<ffffffff81530078>] rcu_cpu_notify+0x2f5/0x86e
> [ 2991.344562] [<ffffffff81091c99>] ? __lock_is_held+0x32/0x53
> [ 2991.344562] [<ffffffff81548912>] notifier_call_chain+0x6b/0x98
> [ 2991.344562] [<ffffffff810671fd>] __raw_notifier_call_chain+0xe/0x10
> [ 2991.344562] [<ffffffff8103cf64>] __cpu_notify+0x20/0x32
> [ 2991.344562] [<ffffffff8103cf8d>] cpu_notify_nofail+0x17/0x36
> [ 2991.344562] [<ffffffff815225de>] _cpu_down+0x154/0x259
> [ 2991.344562] [<ffffffff81522710>] cpu_down+0x2d/0x3a
> [ 2991.344562] [<ffffffff81526351>] store_online+0x4e/0xe7
> [ 2991.344562] [<ffffffff8134d764>] dev_attr_store+0x20/0x22
> [ 2991.344562] [<ffffffff811b3c5f>] sysfs_write_file+0x108/0x144
> [ 2991.344562] [<ffffffff8114c5ef>] vfs_write+0xfd/0x158
> [ 2991.344562] [<ffffffff8114c928>] SyS_write+0x5c/0x83
> [ 2991.344562] [<ffffffff8154c494>] tracesys+0xdd/0xe2
>
> As well as held locks:
>
> [ 3034.728033] Showing all locks held in the system:
> [ 3034.728033] 1 lock held by rcu_preempt/10:
> [ 3034.728033] #0: (rcu_preempt_state.onoff_mutex){+.+...}, at: [<ffffffff810c9471>] rcu_gp_kthread+0x167/0x94b
> [ 3034.728033] 4 locks held by kworker/0:1/47:
> [ 3034.728033] #0: (events){.+.+.+}, at: [<ffffffff81058d3a>] process_one_work+0x207/0x4d1
> [ 3034.728033] #1: (cpuset_hotplug_work){+.+.+.}, at: [<ffffffff81058d3a>] process_one_work+0x207/0x4d1
> [ 3034.728033] #2: (cpuset_mutex){+.+.+.}, at: [<ffffffff810b0ec1>] rebuild_sched_domains+0x17/0x2a
> [ 3034.728033] #3: (cpu_hotplug.lock){+.+.+.}, at: [<ffffffff8103d11b>] get_online_cpus+0x3c/0x50
> [ 3034.728033] 1 lock held by mingetty/2563:
> [ 3034.728033] #0: (&ldata->atomic_read_lock){+.+...}, at: [<ffffffff8131e28a>] n_tty_read+0x252/0x7e8
> [ 3034.728033] 1 lock held by mingetty/2565:
> [ 3034.728033] #0: (&ldata->atomic_read_lock){+.+...}, at: [<ffffffff8131e28a>] n_tty_read+0x252/0x7e8
> [ 3034.728033] 1 lock held by mingetty/2569:
> [ 3034.728033] #0: (&ldata->atomic_read_lock){+.+...}, at: [<ffffffff8131e28a>] n_tty_read+0x252/0x7e8
> [ 3034.728033] 1 lock held by mingetty/2572:
> [ 3034.728033] #0: (&ldata->atomic_read_lock){+.+...}, at: [<ffffffff8131e28a>] n_tty_read+0x252/0x7e8
> [ 3034.728033] 1 lock held by mingetty/2575:
> [ 3034.728033] #0: (&ldata->atomic_read_lock){+.+...}, at: [<ffffffff8131e28a>] n_tty_read+0x252/0x7e8
> [ 3034.728033] 7 locks held by bash/2618:
> [ 3034.728033] #0: (sb_writers#5){.+.+.+}, at: [<ffffffff8114bc3f>] file_start_write+0x2a/0x2c
> [ 3034.728033] #1: (&buffer->mutex#2){+.+.+.}, at: [<ffffffff811b3b93>] sysfs_write_file+0x3c/0x144
> [ 3034.728033] #2: (s_active#54){.+.+.+}, at: [<ffffffff811b3c3e>] sysfs_write_file+0xe7/0x144
> [ 3034.728033] #3: (x86_cpu_hotplug_driver_mutex){+.+.+.}, at: [<ffffffff810217c2>] cpu_hotplug_driver_lock+0x17/0x19
> [ 3034.728033] #4: (cpu_add_remove_lock){+.+.+.}, at: [<ffffffff8103d196>] cpu_maps_update_begin+0x17/0x19
> [ 3034.728033] #5: (cpu_hotplug.lock){+.+.+.}, at: [<ffffffff8103cfd8>] cpu_hotplug_begin+0x2c/0x6d
> [ 3034.728033] #6: (rcu_preempt_state.onoff_mutex){+.+...}, at: [<ffffffff81530078>] rcu_cpu_notify+0x2f5/0x86e
> [ 3034.728033] 1 lock held by bash/2980:
> [ 3034.728033] #0: (&ldata->atomic_read_lock){+.+...}, at: [<ffffffff8131e28a>] n_tty_read+0x252/0x7e8
>
> Things looked a little weird. Also, this is a deadlock that lockdep did
> not catch. But what we have here does not look like a circular lock
> issue:
>
> Bash is blocked in rcu_cpu_notify():
>
> 1961 /* Exclude any attempts to start a new grace period. */
> 1962 mutex_lock(&rsp->onoff_mutex);
>
>
> kworker is blocked in get_online_cpus(), which makes sense as we are
> currently taking down a CPU.
>
> But rcu_preempt is not blocked on anything. It is simply sleeping in
> rcu_gp_kthread (really rcu_gp_init) here:
>
> 1453 #ifdef CONFIG_PROVE_RCU_DELAY
> 1454 if ((prandom_u32() % (rcu_num_nodes * 8)) == 0 &&
> 1455 system_state == SYSTEM_RUNNING)
> 1456 schedule_timeout_uninterruptible(2);
> 1457 #endif /* #ifdef CONFIG_PROVE_RCU_DELAY */
>
> And it does this while holding the onoff_mutex that bash is waiting for.
>
> Doing a function trace, it showed me where it happened:
>
> [ 125.940066] rcu_pree-10 3.... 28384115273: schedule_timeout_uninterruptible <-rcu_gp_kthread
> [...]
> [ 125.940066] rcu_pree-10 3d..3 28384202439: sched_switch: prev_comm=rcu_preempt prev_pid=10 prev_prio=120 prev_state=D ==> next_comm=watchdog/3 next_pid=38 next_prio=120
>
> The watchdog ran, and then:
>
> [ 125.940066] watchdog-38 3d..3 28384692863: sched_switch: prev_comm=watchdog/3 prev_pid=38 prev_prio=120 prev_state=P ==> next_comm=modprobe next_pid=2848 next_prio=118
>
> Not sure what modprobe was doing, but shortly after that:
>
> [ 125.940066] modprobe-2848 3d..3 28385041749: sched_switch: prev_comm=modprobe prev_pid=2848 prev_prio=118 prev_state=R+ ==> next_comm=migration/3 next_pid=40 next_prio=0
>
> Where the migration thread took down the CPU:
>
> [ 125.940066] migratio-40 3d..3 28389148276: sched_switch: prev_comm=migration/3 prev_pid=40 prev_prio=0 prev_state=P ==> next_comm=swapper/3 next_pid=0 next_prio=120
>
> which finally did:
>
> [ 125.940066] <idle>-0 3...1 28389282142: arch_cpu_idle_dead <-cpu_startup_entry
> [ 125.940066] <idle>-0 3...1 28389282548: native_play_dead <-arch_cpu_idle_dead
> [ 125.940066] <idle>-0 3...1 28389282924: play_dead_common <-native_play_dead
> [ 125.940066] <idle>-0 3...1 28389283468: idle_task_exit <-play_dead_common
> [ 125.940066] <idle>-0 3...1 28389284644: amd_e400_remove_cpu <-play_dead_common
>
>
> CPU 3 is now offline, the rcu_preempt thread that ran on CPU 3 is still
> doing a schedule_timeout_uninterruptible() and it registered it's
> timeout to the timer base for CPU 3. You would think that it would get
> migrated right? The issue here is that the timer migration happens at
> the CPU notifier for CPU_DEAD. The problem is that the rcu notifier for
> CPU_DOWN is blocked waiting for the onoff_mutex to be released, which is
> held by the thread that just put itself into a uninterruptible sleep,
> that wont wake up until the CPU_DEAD notifier of the timer
> infrastructure is called, which wont happen until the rcu notifier
> finishes. Here's our deadlock!
This commit breaks this deadlock cycle by substituting a shorter udelay()
for the previous schedule_timeout_uninterruptible(), while at the same
time increasing the probability of the delay. This maintains the intensity
of the testing.
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Steven Rostedt <rostedt@goodmis.org>
|
|
This commit fixes a lockdep-detected deadlock by moving a wake_up()
call out from a rnp->lock critical section. Please see below for
the long version of this story.
On Tue, 2013-05-28 at 16:13 -0400, Dave Jones wrote:
> [12572.705832] ======================================================
> [12572.750317] [ INFO: possible circular locking dependency detected ]
> [12572.796978] 3.10.0-rc3+ #39 Not tainted
> [12572.833381] -------------------------------------------------------
> [12572.862233] trinity-child17/31341 is trying to acquire lock:
> [12572.870390] (rcu_node_0){..-.-.}, at: [<ffffffff811054ff>] rcu_read_unlock_special+0x9f/0x4c0
> [12572.878859]
> but task is already holding lock:
> [12572.894894] (&ctx->lock){-.-...}, at: [<ffffffff811390ed>] perf_lock_task_context+0x7d/0x2d0
> [12572.903381]
> which lock already depends on the new lock.
>
> [12572.927541]
> the existing dependency chain (in reverse order) is:
> [12572.943736]
> -> #4 (&ctx->lock){-.-...}:
> [12572.960032] [<ffffffff810b9851>] lock_acquire+0x91/0x1f0
> [12572.968337] [<ffffffff816ebc90>] _raw_spin_lock+0x40/0x80
> [12572.976633] [<ffffffff8113c987>] __perf_event_task_sched_out+0x2e7/0x5e0
> [12572.984969] [<ffffffff81088953>] perf_event_task_sched_out+0x93/0xa0
> [12572.993326] [<ffffffff816ea0bf>] __schedule+0x2cf/0x9c0
> [12573.001652] [<ffffffff816eacfe>] schedule_user+0x2e/0x70
> [12573.009998] [<ffffffff816ecd64>] retint_careful+0x12/0x2e
> [12573.018321]
> -> #3 (&rq->lock){-.-.-.}:
> [12573.034628] [<ffffffff810b9851>] lock_acquire+0x91/0x1f0
> [12573.042930] [<ffffffff816ebc90>] _raw_spin_lock+0x40/0x80
> [12573.051248] [<ffffffff8108e6a7>] wake_up_new_task+0xb7/0x260
> [12573.059579] [<ffffffff810492f5>] do_fork+0x105/0x470
> [12573.067880] [<ffffffff81049686>] kernel_thread+0x26/0x30
> [12573.076202] [<ffffffff816cee63>] rest_init+0x23/0x140
> [12573.084508] [<ffffffff81ed8e1f>] start_kernel+0x3f1/0x3fe
> [12573.092852] [<ffffffff81ed856f>] x86_64_start_reservations+0x2a/0x2c
> [12573.101233] [<ffffffff81ed863d>] x86_64_start_kernel+0xcc/0xcf
> [12573.109528]
> -> #2 (&p->pi_lock){-.-.-.}:
> [12573.125675] [<ffffffff810b9851>] lock_acquire+0x91/0x1f0
> [12573.133829] [<ffffffff816ebe9b>] _raw_spin_lock_irqsave+0x4b/0x90
> [12573.141964] [<ffffffff8108e881>] try_to_wake_up+0x31/0x320
> [12573.150065] [<ffffffff8108ebe2>] default_wake_function+0x12/0x20
> [12573.158151] [<ffffffff8107bbf8>] autoremove_wake_function+0x18/0x40
> [12573.166195] [<ffffffff81085398>] __wake_up_common+0x58/0x90
> [12573.174215] [<ffffffff81086909>] __wake_up+0x39/0x50
> [12573.182146] [<ffffffff810fc3da>] rcu_start_gp_advanced.isra.11+0x4a/0x50
> [12573.190119] [<ffffffff810fdb09>] rcu_start_future_gp+0x1c9/0x1f0
> [12573.198023] [<ffffffff810fe2c4>] rcu_nocb_kthread+0x114/0x930
> [12573.205860] [<ffffffff8107a91d>] kthread+0xed/0x100
> [12573.213656] [<ffffffff816f4b1c>] ret_from_fork+0x7c/0xb0
> [12573.221379]
> -> #1 (&rsp->gp_wq){..-.-.}:
> [12573.236329] [<ffffffff810b9851>] lock_acquire+0x91/0x1f0
> [12573.243783] [<ffffffff816ebe9b>] _raw_spin_lock_irqsave+0x4b/0x90
> [12573.251178] [<ffffffff810868f3>] __wake_up+0x23/0x50
> [12573.258505] [<ffffffff810fc3da>] rcu_start_gp_advanced.isra.11+0x4a/0x50
> [12573.265891] [<ffffffff810fdb09>] rcu_start_future_gp+0x1c9/0x1f0
> [12573.273248] [<ffffffff810fe2c4>] rcu_nocb_kthread+0x114/0x930
> [12573.280564] [<ffffffff8107a91d>] kthread+0xed/0x100
> [12573.287807] [<ffffffff816f4b1c>] ret_from_fork+0x7c/0xb0
Notice the above call chain.
rcu_start_future_gp() is called with the rnp->lock held. Then it calls
rcu_start_gp_advance, which does a wakeup.
You can't do wakeups while holding the rnp->lock, as that would mean
that you could not do a rcu_read_unlock() while holding the rq lock, or
any lock that was taken while holding the rq lock. This is because...
(See below).
> [12573.295067]
> -> #0 (rcu_node_0){..-.-.}:
> [12573.309293] [<ffffffff810b8d36>] __lock_acquire+0x1786/0x1af0
> [12573.316568] [<ffffffff810b9851>] lock_acquire+0x91/0x1f0
> [12573.323825] [<ffffffff816ebc90>] _raw_spin_lock+0x40/0x80
> [12573.331081] [<ffffffff811054ff>] rcu_read_unlock_special+0x9f/0x4c0
> [12573.338377] [<ffffffff810760a6>] __rcu_read_unlock+0x96/0xa0
> [12573.345648] [<ffffffff811391b3>] perf_lock_task_context+0x143/0x2d0
> [12573.352942] [<ffffffff8113938e>] find_get_context+0x4e/0x1f0
> [12573.360211] [<ffffffff811403f4>] SYSC_perf_event_open+0x514/0xbd0
> [12573.367514] [<ffffffff81140e49>] SyS_perf_event_open+0x9/0x10
> [12573.374816] [<ffffffff816f4dd4>] tracesys+0xdd/0xe2
Notice the above trace.
perf took its own ctx->lock, which can be taken while holding the rq
lock. While holding this lock, it did a rcu_read_unlock(). The
perf_lock_task_context() basically looks like:
rcu_read_lock();
raw_spin_lock(ctx->lock);
rcu_read_unlock();
Now, what looks to have happened, is that we scheduled after taking that
first rcu_read_lock() but before taking the spin lock. When we scheduled
back in and took the ctx->lock, the following rcu_read_unlock()
triggered the "special" code.
The rcu_read_unlock_special() takes the rnp->lock, which gives us a
possible deadlock scenario.
CPU0 CPU1 CPU2
---- ---- ----
rcu_nocb_kthread()
lock(rq->lock);
lock(ctx->lock);
lock(rnp->lock);
wake_up();
lock(rq->lock);
rcu_read_unlock();
rcu_read_unlock_special();
lock(rnp->lock);
lock(ctx->lock);
**** DEADLOCK ****
> [12573.382068]
> other info that might help us debug this:
>
> [12573.403229] Chain exists of:
> rcu_node_0 --> &rq->lock --> &ctx->lock
>
> [12573.424471] Possible unsafe locking scenario:
>
> [12573.438499] CPU0 CPU1
> [12573.445599] ---- ----
> [12573.452691] lock(&ctx->lock);
> [12573.459799] lock(&rq->lock);
> [12573.467010] lock(&ctx->lock);
> [12573.474192] lock(rcu_node_0);
> [12573.481262]
> *** DEADLOCK ***
>
> [12573.501931] 1 lock held by trinity-child17/31341:
> [12573.508990] #0: (&ctx->lock){-.-...}, at: [<ffffffff811390ed>] perf_lock_task_context+0x7d/0x2d0
> [12573.516475]
> stack backtrace:
> [12573.530395] CPU: 1 PID: 31341 Comm: trinity-child17 Not tainted 3.10.0-rc3+ #39
> [12573.545357] ffffffff825b4f90 ffff880219f1dbc0 ffffffff816e375b ffff880219f1dc00
> [12573.552868] ffffffff816dfa5d ffff880219f1dc50 ffff88023ce4d1f8 ffff88023ce4ca40
> [12573.560353] 0000000000000001 0000000000000001 ffff88023ce4d1f8 ffff880219f1dcc0
> [12573.567856] Call Trace:
> [12573.575011] [<ffffffff816e375b>] dump_stack+0x19/0x1b
> [12573.582284] [<ffffffff816dfa5d>] print_circular_bug+0x200/0x20f
> [12573.589637] [<ffffffff810b8d36>] __lock_acquire+0x1786/0x1af0
> [12573.596982] [<ffffffff810918f5>] ? sched_clock_cpu+0xb5/0x100
> [12573.604344] [<ffffffff810b9851>] lock_acquire+0x91/0x1f0
> [12573.611652] [<ffffffff811054ff>] ? rcu_read_unlock_special+0x9f/0x4c0
> [12573.619030] [<ffffffff816ebc90>] _raw_spin_lock+0x40/0x80
> [12573.626331] [<ffffffff811054ff>] ? rcu_read_unlock_special+0x9f/0x4c0
> [12573.633671] [<ffffffff811054ff>] rcu_read_unlock_special+0x9f/0x4c0
> [12573.640992] [<ffffffff811390ed>] ? perf_lock_task_context+0x7d/0x2d0
> [12573.648330] [<ffffffff810b429e>] ? put_lock_stats.isra.29+0xe/0x40
> [12573.655662] [<ffffffff813095a0>] ? delay_tsc+0x90/0xe0
> [12573.662964] [<ffffffff810760a6>] __rcu_read_unlock+0x96/0xa0
> [12573.670276] [<ffffffff811391b3>] perf_lock_task_context+0x143/0x2d0
> [12573.677622] [<ffffffff81139070>] ? __perf_event_enable+0x370/0x370
> [12573.684981] [<ffffffff8113938e>] find_get_context+0x4e/0x1f0
> [12573.692358] [<ffffffff811403f4>] SYSC_perf_event_open+0x514/0xbd0
> [12573.699753] [<ffffffff8108cd9d>] ? get_parent_ip+0xd/0x50
> [12573.707135] [<ffffffff810b71fd>] ? trace_hardirqs_on_caller+0xfd/0x1c0
> [12573.714599] [<ffffffff81140e49>] SyS_perf_event_open+0x9/0x10
> [12573.721996] [<ffffffff816f4dd4>] tracesys+0xdd/0xe2
This commit delays the wakeup via irq_work(), which is what
perf and ftrace use to perform wakeups in critical sections.
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Thomas Gleixner:
- Trivial: unused variable removal
- Posix-timers: Add the clock ID to the new proc interface to make it
useful. The interface is new and should be functional when we reach
the final 3.10 release.
- Cure a false positive warning in the tick code introduced by the
overhaul in 3.10
- Fix for a persistent clock detection regression introduced in this
cycle
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timekeeping: Correct run-time detection of persistent_clock.
ntp: Remove unused variable flags in __hardpps
posix-timers: Show clock ID in proc file
tick: Cure broadcast false positive pending bit warning
|
|
Pull irqdomain bug fixes from Grant Likely:
"This branch contains a set of straight forward bug fixes to the
irqdomain code and to a couple of drivers that make use of it."
* tag 'irqdomain-for-linus' of git://git.secretlab.ca/git/linux:
irqchip: Return -EPERM for reserved IRQs
irqdomain: document the simple domain first_irq
kernel/irq/irqdomain.c: before use 'irq_data', need check it whether valid.
irqdomain: export irq_domain_add_simple
|