aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2014-05-05exit: call disassociate_ctty() before exit_task_namespaces()Oleg Nesterov
commit c39df5fa37b0623589508c95515b4aa1531c524e upstream. Commit 8aac62706ada ("move exit_task_namespaces() outside of exit_notify()") breaks pppd and the exiting service crashes the kernel: BUG: unable to handle kernel NULL pointer dereference at 0000000000000028 IP: ppp_register_channel+0x13/0x20 [ppp_generic] Call Trace: ppp_asynctty_open+0x12b/0x170 [ppp_async] tty_ldisc_open.isra.2+0x27/0x60 tty_ldisc_hangup+0x1e3/0x220 __tty_hangup+0x2c4/0x440 disassociate_ctty+0x61/0x270 do_exit+0x7f2/0xa50 ppp_register_channel() needs ->net_ns and current->nsproxy == NULL. Move disassociate_ctty() before exit_task_namespaces(), it doesn't make sense to delay it after perf_event_exit_task() or cgroup_exit(). This also allows to use task_work_add() inside the (nontrivial) code paths in disassociate_ctty(). Investigated by Peter Hurley. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reported-by: Sree Harsha Totakura <sreeharsha@totakura.in> Cc: Peter Hurley <peter@hurleysoftware.com> Cc: Sree Harsha Totakura <sreeharsha@totakura.in> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Jeff Dike <jdike@addtoit.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Andrey Vagin <avagin@openvz.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-05-05wait: fix reparent_leader() vs EXIT_DEAD->EXIT_ZOMBIE raceOleg Nesterov
commit dfccbb5e49a621c1b21a62527d61fc4305617aca upstream. wait_task_zombie() first does EXIT_ZOMBIE->EXIT_DEAD transition and drops tasklist_lock. If this task is not the natural child and it is traced, we change its state back to EXIT_ZOMBIE for ->real_parent. The last transition is racy, this is even documented in 50b8d257486a "ptrace: partially fix the do_wait(WEXITED) vs EXIT_DEAD->EXIT_ZOMBIE race". wait_consider_task() tries to detect this transition and clear ->notask_error but we can't rely on ptrace_reparented(), debugger can exit and do ptrace_unlink() before its sub-thread sets EXIT_ZOMBIE. And there is another problem which were missed before: this transition can also race with reparent_leader() which doesn't reset >exit_signal if EXIT_DEAD, assuming that this task must be reaped by someone else. So the tracee can be re-parented with ->exit_signal != SIGCHLD, and if /sbin/init doesn't use __WALL it becomes unreapable. Change reparent_leader() to update ->exit_signal even if EXIT_DEAD. Note: this is the simple temporary hack for -stable, it doesn't try to solve all problems, it will be reverted by the next changes. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reported-by: Jan Kratochvil <jan.kratochvil@redhat.com> Reported-by: Michal Schmidt <mschmidt@redhat.com> Tested-by: Michal Schmidt <mschmidt@redhat.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Lennart Poettering <lpoetter@redhat.com> Cc: Roland McGrath <roland@hack.frob.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-05-05pid_namespace: pidns_get() should check task_active_pid_ns() != NULLOleg Nesterov
commit d23082257d83e4bc89727d5aedee197e907999d2 upstream. pidns_get()->get_pid_ns() can hit ns == NULL. This task_struct can't go away, but task_active_pid_ns(task) is NULL if release_task(task) was already called. Alternatively we could change get_pid_ns(ns) to check ns != NULL, but it seems that other callers are fine. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Eric W. Biederman ebiederm@xmission.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-05-05user namespace: fix incorrect memory barriersMikulas Patocka
commit e79323bd87808fdfbc68ce6c5371bd224d9672ee upstream. smp_read_barrier_depends() can be used if there is data dependency between the readers - i.e. if the read operation after the barrier uses address that was obtained from the read operation before the barrier. In this file, there is only control dependency, no data dependecy, so the use of smp_read_barrier_depends() is incorrect. The code could fail in the following way: * the cpu predicts that idx < entries is true and starts executing the body of the for loop * the cpu fetches map->extent[0].first and map->extent[0].count * the cpu fetches map->nr_extents * the cpu verifies that idx < extents is true, so it commits the instructions in the body of the for loop The problem is that in this scenario, the cpu read map->extent[0].first and map->nr_extents in the wrong order. We need a full read memory barrier to prevent it. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-04-13futex: Allow architectures to skip futex_atomic_cmpxchg_inatomic() testHeiko Carstens
commit 03b8c7b623c80af264c4c8d6111e5c6289933666 upstream. If an architecture has futex_atomic_cmpxchg_inatomic() implemented and there is no runtime check necessary, allow to skip the test within futex_init(). This allows to get rid of some code which would always give the same result, and also allows the compiler to optimize a couple of if statements away. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Finn Thain <fthain@telegraphics.com.au> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Link: http://lkml.kernel.org/r/20140302120947.GA3641@osiris Signed-off-by: Thomas Gleixner <tglx@linutronix.de> [geert: Backported to v3.10..v3.13] Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-31printk: fix syslog() overflowing user bufferLinus Torvalds
commit e4178d809fdaee32a56833fff1f5056c99e90a1a upstream. This is not a buffer overflow in the traditional sense: we don't overflow any *kernel* buffers, but we do mis-count the amount of data we copy back to user space for the SYSLOG_ACTION_READ_ALL case. In particular, if the user buffer is too small to hold everything, and *if* there is a continuation line at just the right place, we can end up giving the user more data than he asked for. The reason is that we first count up the number of bytes all the log records contains, then we walk the records again until we've skipped the records at the beginning that won't fit, and then we walk the rest of the records and copy them to the user space buffer. And in between that "skip the initial records that won't fit" and the "copy the records that *will* fit to user space", we reset the 'prev' variable that contained the record information for the last record not copied. That meant that when we started copying to user space, we now had a different character count than what we had originally calculated in the first record walk-through. The fix is to simply not clear the 'prev' flags value (in both cases where we had the same logic: syslog_print_all and kmsg_dump_get_buffer: the latter is used for pstore-like dumping) Reported-and-tested-by: Debabrata Banerjee <dbanerje@akamai.com> Acked-by: Kay Sievers <kay@vrfy.org> Cc: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Josh Hunt <joshhunt00@gmail.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-31tracing: Fix array size mismatch in format stringVaibhav Nagarnaik
commit 87291347c49dc40aa339f587b209618201c2e527 upstream. In event format strings, the array size is reported in two locations. One in array subscript and then via the "size:" attribute. The values reported there have a mismatch. For e.g., in sched:sched_switch the prev_comm and next_comm character arrays have subscript values as [32] where as the actual field size is 16. name: sched_switch ID: 301 format: field:unsigned short common_type; offset:0; size:2; signed:0; field:unsigned char common_flags; offset:2; size:1; signed:0; field:unsigned char common_preempt_count; offset:3; size:1;signed:0; field:int common_pid; offset:4; size:4; signed:1; field:char prev_comm[32]; offset:8; size:16; signed:1; field:pid_t prev_pid; offset:24; size:4; signed:1; field:int prev_prio; offset:28; size:4; signed:1; field:long prev_state; offset:32; size:8; signed:1; field:char next_comm[32]; offset:40; size:16; signed:1; field:pid_t next_pid; offset:56; size:4; signed:1; field:int next_prio; offset:60; size:4; signed:1; After bisection, the following commit was blamed: 92edca0 tracing: Use direct field, type and system names This commit removes the duplication of strings for field->name and field->type assuming that all the strings passed in __trace_define_field() are immutable. This is not true for arrays, where the type string is created in event_storage variable and field->type for all array fields points to event_storage. Use __stringify() to create a string constant for the type string. Also, get rid of event_storage and event_storage_mutex that are not needed anymore. also, an added benefit is that this reduces the overhead of events a bit more: text data bss dec hex filename 8424787 2036472 1302528 11763787 b3804b vmlinux 8420814 2036408 1302528 11759750 b37086 vmlinux.patched Link: http://lkml.kernel.org/r/1392349908-29685-1-git-send-email-vnagarnaik@google.com Cc: Laurent Chavey <chavey@google.com> Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-26cgroup: protect modifications to cgroup_idr with cgroup_mutexLi Zefan
commit 0ab02ca8f887908152d1a96db5130fc661d36a1e upstream. Setup cgroupfs like this: # mount -t cgroup -o cpuacct xxx /cgroup # mkdir /cgroup/sub1 # mkdir /cgroup/sub2 Then run these two commands: # for ((; ;)) { mkdir /cgroup/sub1/tmp && rmdir /mnt/sub1/tmp; } & # for ((; ;)) { mkdir /cgroup/sub2/tmp && rmdir /mnt/sub2/tmp; } & After seconds you may see this warning: ------------[ cut here ]------------ WARNING: CPU: 1 PID: 25243 at lib/idr.c:527 sub_remove+0x87/0x1b0() idr_remove called for id=6 which is not allocated. ... Call Trace: [<ffffffff8156063c>] dump_stack+0x7a/0x96 [<ffffffff810591ac>] warn_slowpath_common+0x8c/0xc0 [<ffffffff81059296>] warn_slowpath_fmt+0x46/0x50 [<ffffffff81300aa7>] sub_remove+0x87/0x1b0 [<ffffffff810f3f02>] ? css_killed_work_fn+0x32/0x1b0 [<ffffffff81300bf5>] idr_remove+0x25/0xd0 [<ffffffff810f2bab>] cgroup_destroy_css_killed+0x5b/0xc0 [<ffffffff810f4000>] css_killed_work_fn+0x130/0x1b0 [<ffffffff8107cdbc>] process_one_work+0x26c/0x550 [<ffffffff8107eefe>] worker_thread+0x12e/0x3b0 [<ffffffff81085f96>] kthread+0xe6/0xf0 [<ffffffff81570bac>] ret_from_fork+0x7c/0xb0 ---[ end trace 2d1577ec10cf80d0 ]--- It's because allocating/removing cgroup ID is not properly synchronized. The bug was introduced when we converted cgroup_ida to cgroup_idr. While synchronization is already done inside ida_simple_{get,remove}(), users are responsible for concurrent calls to idr_{alloc,remove}(). [mhocko@suse.cz: ported to 3.12] Fixes: 4e96ee8e981b ("cgroup: convert cgroup_ida to cgroup_idr") Cc: <stable@vger.kernel.org> #3.12+ Reported-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-22tracing: Do not add event files for modules that fail tracepointsSteven Rostedt (Red Hat)
commit 45ab2813d40d88fc575e753c38478de242d03f88 upstream. If a module fails to add its tracepoints due to module tainting, do not create the module event infrastructure in the debugfs directory. As the events will not work and worse yet, they will silently fail, making the user wonder why the events they enable do not display anything. Having a warning on module load and the events not visible to the users will make the cause of the problem much clearer. Link: http://lkml.kernel.org/r/20140227154923.265882695@goodmis.org Fixes: 6d723736e472 "tracing/events: add support for modules to TRACE_EVENT" Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-22cpuset: fix a race condition in __cpuset_node_allowed_softwall()Li Zefan
commit 99afb0fd5f05aac467ffa85c36778fec4396209b upstream. It's not safe to access task's cpuset after releasing task_lock(). Holding callback_mutex won't help. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-22cpuset: fix a locking issue in cpuset_migrate_mm()Li Zefan
commit 4729583006772b9530404bc1bb7c3aa4a10ffd4d upstream. I can trigger a lockdep warning: # mount -t cgroup -o cpuset xxx /cgroup # mkdir /cgroup/cpuset # mkdir /cgroup/tmp # echo 0 > /cgroup/tmp/cpuset.cpus # echo 0 > /cgroup/tmp/cpuset.mems # echo 1 > /cgroup/tmp/cpuset.memory_migrate # echo $$ > /cgroup/tmp/tasks # echo 1 > /cgruop/tmp/cpuset.mems =============================== [ INFO: suspicious RCU usage. ] 3.14.0-rc1-0.1-default+ #32 Not tainted ------------------------------- include/linux/cgroup.h:682 suspicious rcu_dereference_check() usage! ... [<ffffffff81582174>] dump_stack+0x72/0x86 [<ffffffff810b8f01>] lockdep_rcu_suspicious+0x101/0x140 [<ffffffff81105ba1>] cpuset_migrate_mm+0xb1/0xe0 ... We used to hold cgroup_mutex when calling cpuset_migrate_mm(), but now we hold cpuset_mutex, which causes task_css() to complain. This is not a false-positive but a real issue. Holding cpuset_mutex won't prevent a task from migrating to another cpuset, and it won't prevent the original task->cgroup from destroying during this change. Fixes: 5d21cc2db040 (cpuset: replace cgroup_mutex locking with cpuset internal locking) Signed-off-by: Li Zefan <lizefan@huawei.com> Sigend-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-22genirq: Remove racy waitqueue_active checkChuansheng Liu
commit c685689fd24d310343ac33942e9a54a974ae9c43 upstream. We hit one rare case below: T1 calling disable_irq(), but hanging at synchronize_irq() always; The corresponding irq thread is in sleeping state; And all CPUs are in idle state; After analysis, we found there is one possible scenerio which causes T1 is waiting there forever: CPU0 CPU1 synchronize_irq() wait_event() spin_lock() atomic_dec_and_test(&threads_active) insert the __wait into queue spin_unlock() if(waitqueue_active) atomic_read(&threads_active) wake_up() Here after inserted the __wait into queue on CPU0, and before test if queue is empty on CPU1, there is no barrier, it maybe cause it is not visible for CPU1 immediately, although CPU0 has updated the queue list. It is similar for CPU0 atomic_read() threads_active also. So we'd need one smp_mb() before waitqueue_active.that, but removing the waitqueue_active() check solves it as wel l and it makes things simple and clear. Signed-off-by: Chuansheng Liu <chuansheng.liu@intel.com> Cc: Xiaoming Wang <xiaoming.wang@intel.com> Link: http://lkml.kernel.org/r/1393212590-32543-1-git-send-email-chuansheng.liu@intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-22sched: Fix double normalization of vruntimeGeorge McCollister
commit 791c9e0292671a3bfa95286bb5c08129d8605618 upstream. dequeue_entity() is called when p->on_rq and sets se->on_rq = 0 which appears to guarentee that the !se->on_rq condition is met. If the task has done set_current_state(TASK_INTERRUPTIBLE) without schedule() the second condition will be met and vruntime will be incorrectly adjusted twice. In certain cases this can result in the task's vruntime never increasing past the vruntime of other tasks on the CFS' run queue, starving them of CPU time. This patch changes switched_from_fair() to use !p->on_rq instead of !se->on_rq. I'm able to cause a task with a priority of 120 to starve all other tasks with the same priority on an ARM platform running 3.2.51-rt72 PREEMPT RT by writing one character at time to a serial tty (16550 UART) in a tight loop. I'm also able to verify making this change corrects the problem on that platform and kernel version. Signed-off-by: George McCollister <george.mccollister@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1392767811-28916-1-git-send-email-george.mccollister@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-12sched/rt: Remove redundant nr_cpus_allowed testShawn Bohrer
commit 6bfa687c19b7ab8adee03f0d43c197c2945dd869 upstream. In 76854c7e8f3f4172fef091e78d88b3b751463ac6 ("sched: Use rt.nr_cpus_allowed to recover select_task_rq() cycles") an optimization was added to select_task_rq_rt() that immediately returns when p->nr_cpus_allowed == 1 at the beginning of the function. This makes the latter p->nr_cpus_allowed > 1 check redundant, which can now be removed. Signed-off-by: Shawn Bohrer <sbohrer@rgmadvisors.com> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Cc: Mike Galbraith <mgalbraith@suse.de> Cc: tomk@rgmadvisors.com Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1380914693-24634-1-git-send-email-shawn.bohrer@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-12sched/rt: Add missing rmb()Peter Zijlstra
commit 7c3f2ab7b844f1a859afbc3d41925e8a0faba5fa upstream. While discussing the proposed SCHED_DEADLINE patches which in parts mimic the existing FIFO code it was noticed that the wmb in rt_set_overloaded() didn't have a matching barrier. The only site using rt_overloaded() to test the rto_count is pull_rt_task() and we should issue a matching rmb before then assuming there's an rto_mask bit set. Without that smp_rmb() in there we could actually miss seeing the rto_mask bit. Also, change to using smp_[wr]mb(), even though this is SMP only code; memory barriers without smp_ always make me think they're against hardware of some sort. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: vincent.guittot@linaro.org Cc: luca.abeni@unitn.it Cc: bruce.ashfield@windriver.com Cc: dhaval.giani@gmail.com Cc: rostedt@goodmis.org Cc: hgu1972@gmail.com Cc: oleg@redhat.com Cc: fweisbec@gmail.com Cc: darren@dvhart.com Cc: johan.eker@ericsson.com Cc: p.faure@akatech.ch Cc: paulmck@linux.vnet.ibm.com Cc: raistlin@linux.it Cc: claudio@evidence.eu.com Cc: insop.song@gmail.com Cc: michael@amarulasolutions.com Cc: liming.wang@windriver.com Cc: fchecconi@gmail.com Cc: jkacur@redhat.com Cc: tommaso.cucinotta@sssup.it Cc: Juri Lelli <juri.lelli@gmail.com> Cc: harald.gustafsson@ericsson.com Cc: nicola.manica@disi.unitn.it Cc: tglx@linutronix.de Link: http://lkml.kernel.org/r/20131015103507.GF10651@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-12sched: Assign correct scheduling domain to 'sd_llc'Mel Gorman
commit 5d4cf996cf134e8ddb4f906b8197feb9267c2b77 upstream. Commit 42eb088e (sched: Avoid NULL dereference on sd_busy) corrected a NULL dereference on sd_busy but the fix also altered what scheduling domain it used for the 'sd_llc' percpu variable. One impact of this is that a task selecting a runqueue may consider idle CPUs that are not cache siblings as candidates for running. Tasks are then running on CPUs that are not cache hot. This was found through bisection where ebizzy threads were not seeing equal performance and it looked like a scheduling fairness issue. This patch mitigates but does not completely fix the problem on all machines tested implying there may be an additional bug or a common root cause. Here are the average range of performance seen by individual ebizzy threads. It was tested on top of candidate patches related to x86 TLB range flushing. 4-core machine 3.13.0-rc3 3.13.0-rc3 vanilla fixsd-v3r3 Mean 1 0.00 ( 0.00%) 0.00 ( 0.00%) Mean 2 0.34 ( 0.00%) 0.10 ( 70.59%) Mean 3 1.29 ( 0.00%) 0.93 ( 27.91%) Mean 4 7.08 ( 0.00%) 0.77 ( 89.12%) Mean 5 193.54 ( 0.00%) 2.14 ( 98.89%) Mean 6 151.12 ( 0.00%) 2.06 ( 98.64%) Mean 7 115.38 ( 0.00%) 2.04 ( 98.23%) Mean 8 108.65 ( 0.00%) 1.92 ( 98.23%) 8-core machine Mean 1 0.00 ( 0.00%) 0.00 ( 0.00%) Mean 2 0.40 ( 0.00%) 0.21 ( 47.50%) Mean 3 23.73 ( 0.00%) 0.89 ( 96.25%) Mean 4 12.79 ( 0.00%) 1.04 ( 91.87%) Mean 5 13.08 ( 0.00%) 2.42 ( 81.50%) Mean 6 23.21 ( 0.00%) 69.46 (-199.27%) Mean 7 15.85 ( 0.00%) 101.72 (-541.77%) Mean 8 109.37 ( 0.00%) 19.13 ( 82.51%) Mean 12 124.84 ( 0.00%) 28.62 ( 77.07%) Mean 16 113.50 ( 0.00%) 24.16 ( 78.71%) It's eliminated for one machine and reduced for another. Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Alex Shi <alex.shi@linaro.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: H Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20131217092124.GV11295@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-12sched: Initialize power_orig for overlapping groupsPeter Zijlstra
commit 8e8339a3a1069141985daaa2521ba304509ddecd upstream. Yinghai reported that he saw a /0 in sg_capacity on his EX parts. Make sure to always initialize power_orig now that we actually use it. Ideally build_sched_domains() -> init_sched_groups_power() would also initialize this; but for some yet unexplained reason some setups seem to miss updates there. Reported-by: Yinghai Lu <yinghai@kernel.org> Tested-by: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-l8ng2m9uml6fhibln8wqpom7@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-12sched: Avoid NULL dereference on sd_busyPeter Zijlstra
commit 42eb088ed246a5a817bb45a8b32fe234cf1c0f8b upstream. Commit 37dc6b50cee9 ("sched: Remove unnecessary iteration over sched domains to update nr_busy_cpus") forgot to clear 'sd_busy' under some conditions leading to a possible NULL deref in set_cpu_sd_state_idle(). Reported-by: Anton Blanchard <anton@samba.org> Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20131118113701.GF3866@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-12sched: Remove unnecessary iteration over sched domains to update nr_busy_cpusPreeti U Murthy
commit 37dc6b50cee97954c4e6edcd5b1fa614b76038ee upstream. nr_busy_cpus parameter is used by nohz_kick_needed() to find out the number of busy cpus in a sched domain which has SD_SHARE_PKG_RESOURCES flag set. Therefore instead of updating nr_busy_cpus at every level of sched domain, since it is irrelevant, we can update this parameter only at the parent domain of the sd which has this flag set. Introduce a per-cpu parameter sd_busy which represents this parent domain. In nohz_kick_needed() we directly query the nr_busy_cpus parameter associated with the groups of sd_busy. By associating sd_busy with the highest domain which has SD_SHARE_PKG_RESOURCES flag set, we cover all lower level domains which could have this flag set and trigger nohz_idle_balancing if any of the levels have more than one busy cpu. sd_busy is irrelevant for asymmetric load balancing. However sd_asym has been introduced to represent the highest sched domain which has SD_ASYM_PACKING flag set so that it can be queried directly when required. While we are at it, we might as well change the nohz_idle parameter to be updated at the sd_busy domain level alone and not the base domain level of a CPU. This will unify the concept of busy cpus at just one level of sched domain where it is currently used. Signed-off-by: Preeti U Murthy<preeti@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: svaidy@linux.vnet.ibm.com Cc: vincent.guittot@linaro.org Cc: bitbucket@online.de Cc: benh@kernel.crashing.org Cc: anton@samba.org Cc: Morten.Rasmussen@arm.com Cc: pjt@google.com Cc: peterz@infradead.org Cc: mikey@neuling.org Link: http://lkml.kernel.org/r/20131030031252.23426.4417.stgit@preeti.in.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-12rcu: Throttle rcu_try_advance_all_cbs() executionPaul E. McKenney
commit c229828ca6bc62d6c654f64b1d1b8a9ebd8a56f3 upstream. The rcu_try_advance_all_cbs() function is invoked on each attempted entry to and every exit from idle. If this function determines that there are callbacks ready to invoke, the caller will invoke the RCU core, which in turn will result in a pair of context switches. If a CPU enters and exits idle extremely frequently, this can result in an excessive number of context switches and high CPU overhead. This commit therefore causes rcu_try_advance_all_cbs() to throttle itself, refusing to do work more than once per jiffy. Reported-by: Tibor Billes <tbilles@gmx.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Tested-by: Tibor Billes <tbilles@gmx.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-12rcu: Throttle invoke_rcu_core() invocations due to non-lazy callbacksPaul E. McKenney
commit c337f8f58ed7cf150651d232af8222421a71463d upstream. If a non-lazy callback arrives on a CPU that has previously gone idle with no non-lazy callbacks, invoke_rcu_core() forces the RCU core to run. However, it does not update the conditions, which could result in several closely spaced invocations of the RCU core, which in turn could result in an excessively high context-switch rate and resulting high overhead. This commit therefore updates the ->all_lazy and ->nonlazy_posted_snap fields to prevent closely spaced invocations. Reported-by: Tibor Billes <tbilles@gmx.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Tested-by: Tibor Billes <tbilles@gmx.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-12sched: Fix asymmetric scheduling for POWER7Vaidyanathan Srinivasan
commit 2042abe7977222ef606306faa2dce8fd51e98e65 upstream. Asymmetric scheduling within a core is a scheduler loadbalancing feature that is triggered when SD_ASYM_PACKING flag is set. The goal for the load balancer is to move tasks to lower order idle SMT threads within a core on a POWER7 system. In nohz_kick_needed(), we intend to check if our sched domain (core) is completely busy or we have idle cpu. The following check for SD_ASYM_PACKING: (cpumask_first_and(nohz.idle_cpus_mask, sched_domain_span(sd)) < cpu) already covers the case of checking if the domain has an idle cpu, because cpumask_first_and() will not yield any set bits if this domain has no idle cpu. Hence, nr_busy check against group weight can be removed. Reported-by: Michael Neuling <michael.neuling@au1.ibm.com> Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> Tested-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: vincent.guittot@linaro.org Cc: bitbucket@online.de Cc: benh@kernel.crashing.org Cc: anton@samba.org Cc: Morten.Rasmussen@arm.com Cc: pjt@google.com Link: http://lkml.kernel.org/r/20131030031242.23426.13019.stgit@preeti.in.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-12NOHZ: Check for nohz active instead of nohz enabledThomas Gleixner
commit d689fe222a858c767cb8594faf280048e532b53f upstream. RCU and the fine grained idle time accounting functions check tick_nohz_enabled. But that variable is merily telling that NOHZ has been enabled in the config and not been disabled on the command line. But it does not tell anything about nohz being active. That's what all this should check for. Matthew reported, that the idle accounting on his old P1 machine showed bogus values, when he enabled NOHZ in the config and did not disable it on the kernel command line. The reason is that his machine uses (refined) jiffies as a clocksource which explains why the "fine" grained accounting went into lala land, because it depends on when the system goes and leaves idle relative to the jiffies increment. Provide a tick_nohz_active indicator and let RCU and the accounting code use this instead of tick_nohz_enable. Reported-and-tested-by: Matthew Whitehead <tedheadster@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: john.stultz@linaro.org Cc: mwhitehe@redhat.com Link: http://lkml.kernel.org/r/alpine.DEB.2.02.1311132052240.30673@ionos.tec.linutronix.de Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-12nohz: Fix another inconsistency between CONFIG_NO_HZ=n and nohz=offThomas Gleixner
commit 0e576acbc1d9600cf2d9b4a141a2554639959d50 upstream. If CONFIG_NO_HZ=n tick_nohz_get_sleep_length() returns NSEC_PER_SEC/HZ. If CONFIG_NO_HZ=y and the nohz functionality is disabled via the command line option "nohz=off" or not enabled due to missing hardware support, then tick_nohz_get_sleep_length() returns 0. That happens because ts->sleep_length is never set in that case. Set it to NSEC_PER_SEC/HZ when the NOHZ mode is inactive. Reported-by: Michal Hocko <mhocko@suse.cz> Reported-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-12futex: move user address verification up to common codeLinus Torvalds
commit 5cdec2d833748fbd27d3682f7209225c504c79c5 upstream. When debugging the read-only hugepage case, I was confused by the fact that get_futex_key() did an access_ok() only for the non-shared futex case, since the user address checking really isn't in any way specific to the private key handling. Now, it turns out that the shared key handling does effectively do the equivalent checks inside get_user_pages_fast() (it doesn't actually check the address range on x86, but does check the page protections for being a user page). So it wasn't actually a bug, but the fact that we treat the address differently for private and shared futexes threw me for a loop. Just move the check up, so that it gets done for both cases. Also, use the 'rw' parameter for the type, even if it doesn't actually matter any more (it's a historical artifact of the old racy i386 "page faults from kernel space don't check write protections"). Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-12blktrace: Send BLK_TN_PROCESS events to all running tracesJan Kara
commit a404d5576bbe586a1097a8bc2f32c5f22651b0aa upstream. Currently each task sends BLK_TN_PROCESS event to the first traced device it interacts with after a new trace is started. When there are several traced devices and the task accesses more devices, this logic can result in BLK_TN_PROCESS being sent several times to some devices while it is never sent to other devices. Thus blkparse doesn't display command name when parsing some blktrace files. Fix the problem by sending BLK_TN_PROCESS event to all traced devices when a task interacts with any of them. Signed-off-by: Jan Kara <jack@suse.cz> Review-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-05perf: Fix hotplug splatPeter Zijlstra
commit e3703f8cdfcf39c25c4338c3ad8e68891cca3731 upstream. Drew Richardson reported that he could make the kernel go *boom* when hotplugging while having perf events active. It turned out that when you have a group event, the code in __perf_event_exit_context() fails to remove the group siblings from the context. We then proceed with destroying and freeing the event, and when you re-plug the CPU and try and add another event to that CPU, things go *boom* because you've still got dead entries there. Reported-by: Drew Richardson <drew.richardson@arm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Will Deacon <will.deacon@arm.com> Link: http://lkml.kernel.org/n/tip-k6v5wundvusvcseqj1si0oz0@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-05workqueue: ensure @task is valid across kthread_stop()Lai Jiangshan
commit 5bdfff96c69a4d5ab9c49e60abf9e070ecd2acbb upstream. When a kworker should die, the kworkre is notified through WORKER_DIE flag instead of kthread_should_stop(). This, IIRC, is primarily to keep the test synchronized inside worker_pool lock. WORKER_DIE is first set while holding pool->lock, the lock is dropped and kthread_stop() is called. Unfortunately, this means that there's a slight chance that the target kworker may see WORKER_DIE before kthread_stop() finishes and exits and frees the target task before or during kthread_stop(). Fix it by pinning the target task before setting WORKER_DIE and putting it after kthread_stop() is done. tj: Improved patch description and comment. Moved pinning above WORKER_DIE for better signify what it's protecting. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-05cgroup: update cgroup_enable_task_cg_lists() to grab siglockTejun Heo
commit 532de3fc72adc2a6525c4d53c07bf81e1732083d upstream. Currently, there's nothing preventing cgroup_enable_task_cg_lists() from missing set PF_EXITING and race against cgroup_exit(). Depending on the timing, cgroup_exit() may finish with the task still linked on css_set leading to list corruption. Fix it by grabbing siglock in cgroup_enable_task_cg_lists() so that PF_EXITING is guaranteed to be visible. This whole on-demand cg_list optimization is extremely fragile and has ample possibility to lead to bugs which can cause things like once-a-year oops during boot. I'm wondering whether the better approach would be just adding "cgroup_disable=all" handling which disables the whole cgroup rather than tempting fate with this on-demand craziness. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-05cgroup: fix locking in cgroup_cfts_commit()Tejun Heo
commit 48573a893303986e3b0b2974d6fb11f3d1bb7064 upstream. cgroup_cfts_commit() walks the cgroup hierarchy that the target subsystem is attached to and tries to apply the file changes. Due to the convolution with inode locking, it can't keep cgroup_mutex locked while iterating. It currently holds only RCU read lock around the actual iteration and then pins the found cgroup using dget(). Unfortunately, this is incorrect. Although the iteration does check cgroup_is_dead() before invoking dget(), there's nothing which prevents the dentry from going away inbetween. Note that this is different from the usual css iterations where css_tryget() is used to pin the css - css_tryget() tests whether the css can be pinned and fails if not. The problem can be solved by simply holding cgroup_mutex instead of RCU read lock around the iteration, which actually reduces LOC. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-05cgroup: fix error return from cgroup_create()Tejun Heo
commit b58c89986a77a23658682a100eb15d8edb571ebb upstream. cgroup_create() was returning 0 after allocation failures. Fix it. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-05cgroup: fix error return value in cgroup_mount()Tejun Heo
commit eb46bf89696972b856a9adb6aebd5c7b65c266e4 upstream. When cgroup_mount() fails to allocate an id for the root, it didn't set ret before jumping to unlock_drop ending up returning 0 after a failure. Fix it. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-02-26perf: Enforce 1 as lower limit for perf_event_max_sample_rateKnut Petersen
commit 723478c8a471403c53cf144999701f6e0c4bbd11 upstream. /proc/sys/kernel/perf_event_max_sample_rate will accept negative values as well as 0. Negative values are unreasonable, and 0 causes a divide by zero exception in perf_proc_update_handler. This patch enforces a lower limit of 1. Signed-off-by: Knut Petersen <Knut_Petersen@t-online.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/5242DB0C.4070005@t-online.de Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-02-22genirq: Add missing irq_to_desc export for CONFIG_SPARSE_IRQ=nPaul Gortmaker
commit 2c45aada341121438affc4cb8d5b4cfaa2813d3d upstream. In allmodconfig builds for sparc and any other arch which does not set CONFIG_SPARSE_IRQ, the following will be seen at modpost: CC [M] lib/cpu-notifier-error-inject.o CC [M] lib/pm-notifier-error-inject.o ERROR: "irq_to_desc" [drivers/gpio/gpio-mcp23s08.ko] undefined! make[2]: *** [__modpost] Error 1 This happens because commit 3911ff30f5 ("genirq: export handle_edge_irq() and irq_to_desc()") added one export for it, but there were actually two instances of it, in an if/else clause for CONFIG_SPARSE_IRQ. Add the second one. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Jiri Kosina <jkosina@suse.cz> Link: http://lkml.kernel.org/r/1392057610-11514-1-git-send-email-paul.gortmaker@windriver.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-22ring-buffer: Fix first commit on sub-buffer having non-zero deltaSteven Rostedt (Red Hat)
commit d651aa1d68a2f0a7ee65697b04c6a92f8c0a12f2 upstream. Each sub-buffer (buffer page) has a full 64 bit timestamp. The events on that page use a 27 bit delta against that timestamp in order to save on bits written to the ring buffer. If the time between events is larger than what the 27 bits can hold, a "time extend" event is added to hold the entire 64 bit timestamp again and the events after that hold a delta from that timestamp. As a "time extend" is always paired with an event, it is logical to just allocate the event with the time extend, to make things a bit more efficient. Unfortunately, when the pairing code was written, it removed the "delta = 0" from the first commit on a page, causing the events on the page to be slightly skewed. Fixes: 69d1b839f7ee "ring-buffer: Bind time extend and data events together" Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-22time: Fix overflow when HZ is smaller than 60Mikulas Patocka
commit 80d767d770fd9c697e434fd080c2db7b5c60c6dd upstream. When compiling for the IA-64 ski emulator, HZ is set to 32 because the emulation is slow and we don't want to waste too many cycles processing timers. Alpha also has an option to set HZ to 32. This causes integer underflow in kernel/time/jiffies.c: kernel/time/jiffies.c:66:2: warning: large integer implicitly truncated to unsigned type [-Woverflow] .mult = NSEC_PER_JIFFY << JIFFIES_SHIFT, /* details above */ ^ This patch reduces the JIFFIES_SHIFT value to avoid the overflow. Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz> Link: http://lkml.kernel.org/r/alpine.LRH.2.02.1401241639100.23871@file01.intranet.prod.int.rdu2.redhat.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-22tick: Clear broadcast pending bit when switching to oneshotThomas Gleixner
commit dd5fd9b91a77b4c9c28b7ef9c181b1a875820d0a upstream. AMD systems which use the C1E workaround in the amd_e400_idle routine trigger the WARN_ON_ONCE in the broadcast code when onlining a CPU. The reason is that the idle routine of those AMD systems switches the cpu into forced broadcast mode early on before the newly brought up CPU can switch over to high resolution / NOHZ mode. The timer related CPU1 bringup looks like this: clockevent_register_device(local_apic); tick_setup(local_apic); ... idle() tick_broadcast_on_off(FORCE); tick_broadcast_oneshot_control(ENTER) cpumask_set(cpu, broadcast_oneshot_mask); halt(); Now the broadcast interrupt on CPU0 sets CPU1 in the broadcast_pending_mask and wakes CPU1. So CPU1 continues: local_apic_timer_interrupt() tick_handle_periodic(); softirq() tick_init_highres(); cpumask_clr(cpu, broadcast_oneshot_mask); tick_broadcast_oneshot_control(ENTER) WARN_ON(cpumask_test(cpu, broadcast_pending_mask); So while we remove CPU1 from the broadcast_oneshot_mask when we switch over to highres mode, we do not clear the pending bit, which then triggers the warning when we go back to idle. The reason why this is only visible on C1E affected AMD systems is that the other machines enter the deep sleep states via acpi_idle/intel_idle and exit the broadcast mode before executing the remote triggered local_apic_timer_interrupt. So the pending bit is already cleared when the switch over to highres mode is clearing the oneshot mask. The solution is simple: Clear the pending bit together with the mask bit when we switch over to highres mode. Stanislaw came up independently with the same patch by enforcing the C1E workaround and debugging the fallout. I picked mine, because mine has a changelog :) Reported-by: poma <pomidorabelisima@gmail.com> Debugged-by: Stanislaw Gruszka <sgruszka@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Olaf Hering <olaf@aepfle.de> Cc: Dave Jones <davej@redhat.com> Cc: Justin M. Forbes <jforbes@redhat.com> Cc: Josh Boyer <jwboyer@redhat.com> Link: http://lkml.kernel.org/r/alpine.DEB.2.02.1402111434180.21991@ionos.tec.linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-20genirq: Generic irq chip requires IRQ_DOMAINNitin A Kamble
commit 923fa4ea382f592dee2ba3b205befb90cbddf3af upstream. The generic_chip.c uses interfaces from irq_domain.c which is controlled by the IRQ_DOMAIN config option, but there is no Kconfig dependency so the build can fail: linux/kernel/irq/generic-chip.c:400:11: error: 'irq_domain_xlate_onetwocell' undeclared here (not in a function) Select IRQ_DOMAIN when GENERIC_IRQ_CHIP is selected. Signed-off-by: Nitin A Kamble <nitin.a.kamble@intel.com> Link: http://lkml.kernel.org/r/1391129410-54548-2-git-send-email-nitin.a.kamble@intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-13timekeeping: Fix missing timekeeping_update in suspend pathJohn Stultz
commit 330a1617b0a6268d427aa5922c94d082b1d3e96d upstream. Since 48cdc135d4840 (Implement a shadow timekeeper), we have to call timekeeping_update() after any adjustment to the timekeeping structure in order to make sure that any adjustments to the structure persist. In the timekeeping suspend path, we udpate the timekeeper structure, so we should be sure to update the shadow-timekeeper before releasing the timekeeping locks. Currently this isn't done. In most cases, the next time related code to run would be timekeeping_resume, which does update the shadow-timekeeper, but in an abundence of caution, this patch adds the call to timekeeping_update() in the suspend path. Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-13timekeeping: Fix CLOCK_TAI timer/nanosleep delaysJohn Stultz
commit 04005f6011e3b504cd4d791d9769f7cb9a3b2eae upstream. A think-o in the calculation of the monotonic -> tai time offset results in CLOCK_TAI timers and nanosleeps to expire late (the latency is ~2x the tai offset). Fix this by adding the tai offset from the realtime offset instead of subtracting. Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-133.13.y: timekeeping: Fix clock_set/clock_was_set think-oJohn Stultz
In backporting 6fdda9a9c5db367130cf32df5d6618d08b89f46a (timekeeping: Avoid possible deadlock from clock_was_set_delayed), I ralized the patch had a think-o where instead of checking clock_set I accidentally typed clock_was_set (which is a function - so the conditional always is true). Upstream this was resolved in the immediately following patch 47a1b796306356f358e515149d86baf0cc6bf007 (tick/timekeeping: Call update_wall_time outside the jiffies lock). But since that patch really isn't -stable material, so this patch only pulls the name change. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-13timekeeping: Avoid possible deadlock from clock_was_set_delayedJohn Stultz
commit 6fdda9a9c5db367130cf32df5d6618d08b89f46a upstream. As part of normal operaions, the hrtimer subsystem frequently calls into the timekeeping code, creating a locking order of hrtimer locks -> timekeeping locks clock_was_set_delayed() was suppoed to allow us to avoid deadlocks between the timekeeping the hrtimer subsystem, so that we could notify the hrtimer subsytem the time had changed while holding the timekeeping locks. This was done by scheduling delayed work that would run later once we were out of the timekeeing code. But unfortunately the lock chains are complex enoguh that in scheduling delayed work, we end up eventually trying to grab an hrtimer lock. Sasha Levin noticed this in testing when the new seqlock lockdep enablement triggered the following (somewhat abrieviated) message: [ 251.100221] ====================================================== [ 251.100221] [ INFO: possible circular locking dependency detected ] [ 251.100221] 3.13.0-rc2-next-20131206-sasha-00005-g8be2375-dirty #4053 Not tainted [ 251.101967] ------------------------------------------------------- [ 251.101967] kworker/10:1/4506 is trying to acquire lock: [ 251.101967] (timekeeper_seq){----..}, at: [<ffffffff81160e96>] retrigger_next_event+0x56/0x70 [ 251.101967] [ 251.101967] but task is already holding lock: [ 251.101967] (hrtimer_bases.lock#11){-.-...}, at: [<ffffffff81160e7c>] retrigger_next_event+0x3c/0x70 [ 251.101967] [ 251.101967] which lock already depends on the new lock. [ 251.101967] [ 251.101967] [ 251.101967] the existing dependency chain (in reverse order) is: [ 251.101967] -> #5 (hrtimer_bases.lock#11){-.-...}: [snipped] -> #4 (&rt_b->rt_runtime_lock){-.-...}: [snipped] -> #3 (&rq->lock){-.-.-.}: [snipped] -> #2 (&p->pi_lock){-.-.-.}: [snipped] -> #1 (&(&pool->lock)->rlock){-.-...}: [ 251.101967] [<ffffffff81194803>] validate_chain+0x6c3/0x7b0 [ 251.101967] [<ffffffff81194d9d>] __lock_acquire+0x4ad/0x580 [ 251.101967] [<ffffffff81194ff2>] lock_acquire+0x182/0x1d0 [ 251.101967] [<ffffffff84398500>] _raw_spin_lock+0x40/0x80 [ 251.101967] [<ffffffff81153e69>] __queue_work+0x1a9/0x3f0 [ 251.101967] [<ffffffff81154168>] queue_work_on+0x98/0x120 [ 251.101967] [<ffffffff81161351>] clock_was_set_delayed+0x21/0x30 [ 251.101967] [<ffffffff811c4bd1>] do_adjtimex+0x111/0x160 [ 251.101967] [<ffffffff811e2711>] compat_sys_adjtimex+0x41/0x70 [ 251.101967] [<ffffffff843a4b49>] ia32_sysret+0x0/0x5 [ 251.101967] -> #0 (timekeeper_seq){----..}: [snipped] [ 251.101967] other info that might help us debug this: [ 251.101967] [ 251.101967] Chain exists of: timekeeper_seq --> &rt_b->rt_runtime_lock --> hrtimer_bases.lock#11 [ 251.101967] Possible unsafe locking scenario: [ 251.101967] [ 251.101967] CPU0 CPU1 [ 251.101967] ---- ---- [ 251.101967] lock(hrtimer_bases.lock#11); [ 251.101967] lock(&rt_b->rt_runtime_lock); [ 251.101967] lock(hrtimer_bases.lock#11); [ 251.101967] lock(timekeeper_seq); [ 251.101967] [ 251.101967] *** DEADLOCK *** [ 251.101967] [ 251.101967] 3 locks held by kworker/10:1/4506: [ 251.101967] #0: (events){.+.+.+}, at: [<ffffffff81154960>] process_one_work+0x200/0x530 [ 251.101967] #1: (hrtimer_work){+.+...}, at: [<ffffffff81154960>] process_one_work+0x200/0x530 [ 251.101967] #2: (hrtimer_bases.lock#11){-.-...}, at: [<ffffffff81160e7c>] retrigger_next_event+0x3c/0x70 [ 251.101967] [ 251.101967] stack backtrace: [ 251.101967] CPU: 10 PID: 4506 Comm: kworker/10:1 Not tainted 3.13.0-rc2-next-20131206-sasha-00005-g8be2375-dirty #4053 [ 251.101967] Workqueue: events clock_was_set_work So the best solution is to avoid calling clock_was_set_delayed() while holding the timekeeping lock, and instead using a flag variable to decide if we should call clock_was_set() once we've released the locks. This works for the case here, where the do_adjtimex() was the deadlock trigger point. Unfortuantely, in update_wall_time() we still hold the jiffies lock, which would deadlock with the ipi triggered by clock_was_set(), preventing us from calling it even after we drop the timekeeping lock. So instead call clock_was_set_delayed() at that point. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Sasha Levin <sasha.levin@oracle.com> Reported-by: Sasha Levin <sasha.levin@oracle.com> Tested-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-13timekeeping: Fix potential lost pv notification of time changeJohn Stultz
commit 5258d3f25c76f6ab86e9333abf97a55a877d3870 upstream. In 780427f0e11 (Indicate that clock was set in the pvclock gtod notifier), logic was added to pass a CLOCK_WAS_SET notification to the pvclock notifier chain. While that patch added a action flag returned from accumulate_nsecs_to_secs(), it only uses the returned value in one location, and not in the logarithmic accumulation. This means if a leap second triggered during the logarithmic accumulation (which is most likely where it would happen), the notification that the clock was set would not make it to the pv notifiers. This patch extends the logarithmic_accumulation pass down that action flag so proper notification will occur. This patch also changes the varialbe action -> clock_set per Ingo's suggestion. Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Richard Cochran <richardcochran@gmail.com> Cc: <xen-devel@lists.xen.org> Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-13timekeeping: Fix lost updates to tai adjustmentJohn Stultz
commit f55c07607a38f84b5c7e6066ee1cfe433fa5643c upstream. Since 48cdc135d4840 (Implement a shadow timekeeper), we have to call timekeeping_update() after any adjustment to the timekeeping structure in order to make sure that any adjustments to the structure persist. Unfortunately, the updates to the tai offset via adjtimex do not trigger this update, causing adjustments to the tai offset to be made and then over-written by the previous value at the next update_wall_time() call. This patch resovles the issue by calling timekeeping_update() right after setting the tai offset. Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-13ftrace: Have function graph only trace based on global_ops filtersSteven Rostedt (Red Hat)
commit 23a8e8441a0a74dd612edf81dc89d1600bc0a3d1 upstream. Doing some different tests, I discovered that function graph tracing, when filtered via the set_ftrace_filter and set_ftrace_notrace files, does not always keep with them if another function ftrace_ops is registered to trace functions. The reason is that function graph just happens to trace all functions that the function tracer enables. When there was only one user of function tracing, the function graph tracer did not need to worry about being called by functions that it did not want to trace. But now that there are other users, this becomes a problem. For example, one just needs to do the following: # cd /sys/kernel/debug/tracing # echo schedule > set_ftrace_filter # echo function_graph > current_tracer # cat trace [..] 0) | schedule() { ------------------------------------------ 0) <idle>-0 => rcu_pre-7 ------------------------------------------ 0) ! 2980.314 us | } 0) | schedule() { ------------------------------------------ 0) rcu_pre-7 => <idle>-0 ------------------------------------------ 0) + 20.701 us | } # echo 1 > /proc/sys/kernel/stack_tracer_enabled # cat trace [..] 1) + 20.825 us | } 1) + 21.651 us | } 1) + 30.924 us | } /* SyS_ioctl */ 1) | do_page_fault() { 1) | __do_page_fault() { 1) 0.274 us | down_read_trylock(); 1) 0.098 us | find_vma(); 1) | handle_mm_fault() { 1) | _raw_spin_lock() { 1) 0.102 us | preempt_count_add(); 1) 0.097 us | do_raw_spin_lock(); 1) 2.173 us | } 1) | do_wp_page() { 1) 0.079 us | vm_normal_page(); 1) 0.086 us | reuse_swap_page(); 1) 0.076 us | page_move_anon_rmap(); 1) | unlock_page() { 1) 0.082 us | page_waitqueue(); 1) 0.086 us | __wake_up_bit(); 1) 1.801 us | } 1) 0.075 us | ptep_set_access_flags(); 1) | _raw_spin_unlock() { 1) 0.098 us | do_raw_spin_unlock(); 1) 0.105 us | preempt_count_sub(); 1) 1.884 us | } 1) 9.149 us | } 1) + 13.083 us | } 1) 0.146 us | up_read(); When the stack tracer was enabled, it enabled all functions to be traced, which now the function graph tracer also traces. This is a side effect that should not occur. To fix this a test is added when the function tracing is changed, as well as when the graph tracer is enabled, to see if anything other than the ftrace global_ops function tracer is enabled. If so, then the graph tracer calls a test trampoline that will look at the function that is being traced and compare it with the filters defined by the global_ops. As an optimization, if there's no other function tracers registered, or if the only registered function tracers also use the global ops, the function graph infrastructure will call the registered function graph callback directly and not go through the test trampoline. Fixes: d2d45c7a03a2 "tracing: Have stack_tracer use a separate list of functions" Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-13ftrace: Fix synchronization location disabling and freeing ftrace_opsSteven Rostedt (Red Hat)
commit a4c35ed241129dd142be4cadb1e5a474a56d5464 upstream. The synchronization needed after ftrace_ops are unregistered must happen after the callback is disabled from becing called by functions. The current location happens after the function is being removed from the internal lists, but not after the function callbacks were disabled, leaving the functions susceptible of being called after their callbacks are freed. This affects perf and any externel users of function tracing (LTTng and SystemTap). Fixes: cdbe61bfe704 "ftrace: Allow dynamically allocated function tracers" Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-13ftrace: Synchronize setting function_trace_op with ftrace_trace_functionSteven Rostedt (Red Hat)
commit 405e1d834807e51b2ebd3dea81cb51e53fb61504 upstream. ftrace_trace_function is a variable that holds what function will be called directly by the assembly code (mcount). If just a single function is registered and it handles recursion itself, then the assembly will call that function directly without any helper function. It also passes in the ftrace_op that was registered with the callback. The ftrace_op to send is stored in the function_trace_op variable. The ftrace_trace_function and function_trace_op needs to be coordinated such that the called callback wont be called with the wrong ftrace_op, otherwise bad things can happen if it expected a different op. Luckily, there's no callback that doesn't use the helper functions that requires this. But there soon will be and this needs to be fixed. Use a set_function_trace_op to store the ftrace_op to set the function_trace_op to when it is safe to do so (during the update function within the breakpoint or stop machine calls). Or if dynamic ftrace is not being used (static tracing) then we have to do a bit more synchronization when the ftrace_trace_function is set as that takes affect immediately (as oppose to dynamic ftrace doing it with the modification of the trampoline). Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-13audit: reset audit backlog wait time after error recoveryRichard Guy Briggs
commit e789e561a50de0aaa8c695662d97aaa5eac9d55f upstream. When the audit queue overflows and times out (audit_backlog_wait_time), the audit queue overflow timeout is set to zero. Once the audit queue overflow timeout condition recovers, the timeout should be reset to the original value. See also: https://lkml.org/lkml/2013/9/2/473 Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com> Signed-off-by: Dan Duval <dan.duval@oracle.com> Signed-off-by: Chuck Anderson <chuck.anderson@oracle.com> Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-13tracing: Check if tracing is enabled in trace_puts()Steven Rostedt (Red Hat)
commit 3132e107d608f8753240d82d61303c500fd515b4 upstream. If trace_puts() is used very early in boot up, it can crash the machine if it is called before the ring buffer is allocated. If a trace_printk() is used with no arguments, then it will be converted into a trace_puts() and suffer the same fate. Fixes: 09ae72348ecc "tracing: Add trace_puts() for even faster trace_printk() tracing" Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-13tracing: Have trace buffer point back to trace_arraySteven Rostedt (Red Hat)
commit dced341b2d4f06668efaab33f88de5d287c0f45b upstream. The trace buffer has a descriptor pointer that goes back to the trace array. But it was never assigned. Luckily, nothing uses it (yet), but it will in the future. Although nothing currently uses this, if any of the new features get backported to older kernels, and because this is such a simple change, I'm marking it for stable too. Fixes: 12883efb670c "tracing: Consolidate max_tr into main trace_array structure" Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>