linux/kernel, branch v3.12.4

ntp: Make periodic RTC update more reliable

2013-12-08T15:29:16Z

commit a97ad0c4b447a132a322cedc3a5f7fa4cab4b304 upstream. The current code requires that the scheduled update of the RTC happens in the closest tick to the half of the second. This seems to be difficult to achieve reliably. The scheduled work may be missing the target time by a tick or two and be constantly rescheduled every second. Relax the limit to 10 ticks. As a typical RTC drifts in the 11-minute update interval by several milliseconds, this shouldn't affect the overall accuracy of the RTC much. Signed-off-by: Miroslav Lichvar Signed-off-by: John Stultz Cc: Josh Boyer Signed-off-by: Greg Kroah-Hartman

cpuset: Fix memory allocator deadlock

2013-12-04T19:05:55Z

commit 0fc0287c9ed1ffd3706f8b4d9b314aa102ef1245 upstream. Juri hit the below lockdep report: [ 4.303391] ====================================================== [ 4.303392] [ INFO: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected ] [ 4.303394] 3.12.0-dl-peterz+ #144 Not tainted [ 4.303395] ------------------------------------------------------ [ 4.303397] kworker/u4:3/689 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire: [ 4.303399] (&p->mems_allowed_seq){+.+...}, at: [] new_slab+0x6c/0x290 [ 4.303417] [ 4.303417] and this task is already holding: [ 4.303418] (&(&q->__queue_lock)->rlock){..-...}, at: [] blk_execute_rq_nowait+0x5b/0x100 [ 4.303431] which would create a new lock dependency: [ 4.303432] (&(&q->__queue_lock)->rlock){..-...} -> (&p->mems_allowed_seq){+.+...} [ 4.303436] [ 4.303898] the dependencies between the lock to be acquired and SOFTIRQ-irq-unsafe lock: [ 4.303918] -> (&p->mems_allowed_seq){+.+...} ops: 2762 { [ 4.303922] HARDIRQ-ON-W at: [ 4.303923] [] __lock_acquire+0x65a/0x1ff0 [ 4.303926] [] lock_acquire+0x93/0x140 [ 4.303929] [] kthreadd+0x86/0x180 [ 4.303931] [] ret_from_fork+0x7c/0xb0 [ 4.303933] SOFTIRQ-ON-W at: [ 4.303933] [] __lock_acquire+0x68c/0x1ff0 [ 4.303935] [] lock_acquire+0x93/0x140 [ 4.303940] [] kthreadd+0x86/0x180 [ 4.303955] [] ret_from_fork+0x7c/0xb0 [ 4.303959] INITIAL USE at: [ 4.303960] [] __lock_acquire+0x344/0x1ff0 [ 4.303963] [] lock_acquire+0x93/0x140 [ 4.303966] [] kthreadd+0x86/0x180 [ 4.303969] [] ret_from_fork+0x7c/0xb0 [ 4.303972] } Which reports that we take mems_allowed_seq with interrupts enabled. A little digging found that this can only be from cpuset_change_task_nodemask(). This is an actual deadlock because an interrupt doing an allocation will hit get_mems_allowed()->...->__read_seqcount_begin(), which will spin forever waiting for the write side to complete. Cc: John Stultz Cc: Mel Gorman Reported-by: Juri Lelli Signed-off-by: Peter Zijlstra Tested-by: Juri Lelli Acked-by: Li Zefan Acked-by: Mel Gorman Signed-off-by: Tejun Heo Signed-off-by: Greg Kroah-Hartman

cgroup: fix cgroup_subsys_state leak for seq_files

2013-12-04T19:05:55Z

commit e605b36575e896edd8161534550c9ea021b03bc0 upstream. If a cgroup file implements either read_map() or read_seq_string(), such file is served using seq_file by overriding file->f_op to cgroup_seqfile_operations, which also overrides the release method to single_release() from cgroup_file_release(). Because cgroup_file_open() didn't use to acquire any resources, this used to be fine, but since f7d58818ba42 ("cgroup: pin cgroup_subsys_state when opening a cgroupfs file"), cgroup_file_open() pins the css (cgroup_subsys_state) which is put by cgroup_file_release(). The patch forgot to update the release path for seq_files and each open/release cycle leaks a css reference. Fix it by updating cgroup_file_release() to also handle seq_files and using it for seq_file release path too. Signed-off-by: Tejun Heo Signed-off-by: Greg Kroah-Hartman

cgroup: use a dedicated workqueue for cgroup destruction

2013-12-04T19:05:55Z

commit e5fca243abae1445afbfceebda5f08462ef869d3 upstream. Since be44562613851 ("cgroup: remove synchronize_rcu() from cgroup_diput()"), cgroup destruction path makes use of workqueue. css freeing is performed from a work item from that point on and a later commit, ea15f8ccdb430 ("cgroup: split cgroup destruction into two steps"), moves css offlining to workqueue too. As cgroup destruction isn't depended upon for memory reclaim, the destruction work items were put on the system_wq; unfortunately, some controller may block in the destruction path for considerable duration while holding cgroup_mutex. As large part of destruction path is synchronized through cgroup_mutex, when combined with high rate of cgroup removals, this has potential to fill up system_wq's max_active of 256. Also, it turns out that memcg's css destruction path ends up queueing and waiting for work items on system_wq through work_on_cpu(). If such operation happens while system_wq is fully occupied by cgroup destruction work items, work_on_cpu() can't make forward progress because system_wq is full and other destruction work items on system_wq can't make forward progress because the work item waiting for work_on_cpu() is holding cgroup_mutex, leading to deadlock. This can be fixed by queueing destruction work items on a separate workqueue. This patch creates a dedicated workqueue - cgroup_destroy_wq - for this purpose. As these work items shouldn't have inter-dependencies and mostly serialized by cgroup_mutex anyway, giving high concurrency level doesn't buy anything and the workqueue's @max_active is set to 1 so that destruction work items are executed one by one on each CPU. Hugh Dickins: Because cgroup_init() is run before init_workqueues(), cgroup_destroy_wq can't be allocated from cgroup_init(). Do it from a separate core_initcall(). In the future, we probably want to reorder so that workqueue init happens before cgroup_init(). Signed-off-by: Tejun Heo Reported-by: Hugh Dickins Reported-by: Shawn Bohrer Link: http://lkml.kernel.org/r/20131111220626.GA7509@sbohrermbp13-local.rgmadvisors.com Link: http://lkml.kernel.org/g/alpine.LNX.2.00.1310301606080.2333@eggly.anvils Signed-off-by: Greg Kroah-Hartman

workqueue: fix ordered workqueues in NUMA setups

2013-12-04T19:05:55Z

commit 8a2b75384444488fc4f2cbb9f0921b6a0794838f upstream. An ordered workqueue implements execution ordering by using single pool_workqueue with max_active == 1. On a given pool_workqueue, work items are processed in FIFO order and limiting max_active to 1 enforces the queued work items to be processed one by one. Unfortunately, 4c16bd327c ("workqueue: implement NUMA affinity for unbound workqueues") accidentally broke this guarantee by applying NUMA affinity to ordered workqueues too. On NUMA setups, an ordered workqueue would end up with separate pool_workqueues for different nodes. Each pool_workqueue still limits max_active to 1 but multiple work items may be executed concurrently and out of order depending on which node they are queued to. Fix it by using dedicated ordered_wq_attrs[] when creating ordered workqueues. The new attrs match the unbound ones except that no_numa is always set thus forcing all NUMA nodes to share the default pool_workqueue. While at it, add sanity check in workqueue creation path which verifies that an ordered workqueues has only the default pool_workqueue. Signed-off-by: Tejun Heo Reported-by: Libin Cc: Lai Jiangshan Signed-off-by: Greg Kroah-Hartman

ftrace: Fix function graph with loading of modules

2013-12-04T19:05:41Z

commit 8a56d7761d2d041ae5e8215d20b4167d8aa93f51 upstream. Commit 8c4f3c3fa9681 "ftrace: Check module functions being traced on reload" fixed module loading and unloading with respect to function tracing, but it missed the function graph tracer. If you perform the following # cd /sys/kernel/debug/tracing # echo function_graph > current_tracer # modprobe nfsd # echo nop > current_tracer You'll get the following oops message: ------------[ cut here ]------------ WARNING: CPU: 2 PID: 2910 at /linux.git/kernel/trace/ftrace.c:1640 __ftrace_hash_rec_update.part.35+0x168/0x1b9() Modules linked in: nfsd exportfs nfs_acl lockd ipt_MASQUERADE sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables uinput snd_hda_codec_idt CPU: 2 PID: 2910 Comm: bash Not tainted 3.13.0-rc1-test #7 Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007 0000000000000668 ffff8800787efcf8 ffffffff814fe193 ffff88007d500000 0000000000000000 ffff8800787efd38 ffffffff8103b80a 0000000000000668 ffffffff810b2b9a ffffffff81a48370 0000000000000001 ffff880037aea000 Call Trace: [] dump_stack+0x4f/0x7c [] warn_slowpath_common+0x81/0x9b [] ? __ftrace_hash_rec_update.part.35+0x168/0x1b9 [] warn_slowpath_null+0x1a/0x1c [] __ftrace_hash_rec_update.part.35+0x168/0x1b9 [] ? __mutex_lock_slowpath+0x364/0x364 [] ftrace_shutdown+0xd7/0x12b [] unregister_ftrace_graph+0x49/0x78 [] graph_trace_reset+0xe/0x10 [] tracing_set_tracer+0xa7/0x26a [] tracing_set_trace_write+0x8b/0xbd [] ? ftrace_return_to_handler+0xb2/0xde [] ? __sb_end_write+0x5e/0x5e [] vfs_write+0xab/0xf6 [] ftrace_graph_caller+0x85/0x85 [] SyS_write+0x59/0x82 [] ftrace_graph_caller+0x85/0x85 [] system_call_fastpath+0x16/0x1b ---[ end trace 940358030751eafb ]--- The above mentioned commit didn't go far enough. Well, it covered the function tracer by adding checks in __register_ftrace_function(). The problem is that the function graph tracer circumvents that (for a slight efficiency gain when function graph trace is running with a function tracer. The gain was not worth this). The problem came with ftrace_startup() which should always be called after __register_ftrace_function(), if you want this bug to be completely fixed. Anyway, this solution moves __register_ftrace_function() inside of ftrace_startup() and removes the need to call them both. Reported-by: Dave Wysochanski Fixes: ed926f9b35cd ("ftrace: Use counters to enable functions to trace") Signed-off-by: Steven Rostedt Signed-off-by: Greg Kroah-Hartman

audit: log the audit_names record type

2013-12-04T19:05:39Z

commit d3aea84a4ace5ff9ce7fb7714cee07bebef681c2 upstream. ...to make it clear what the intent behind each record's operation was. In many cases you can infer this, based on the context of the syscall and the result. In other cases it's not so obvious. For instance, in the case where you have a file being renamed over another, you'll have two different records with the same filename but different inode info. By logging this information we can clearly tell which one was created and which was deleted. This fixes what was broken in commit bfcec708. Commit 79f6530c should also be backported to stable v3.7+. Signed-off-by: Jeff Layton Signed-off-by: Eric Paris Signed-off-by: Richard Guy Briggs Signed-off-by: Eric Paris Signed-off-by: Greg Kroah-Hartman

audit: fix info leak in AUDIT_GET requests

2013-12-04T19:05:39Z

commit 64fbff9ae0a0a843365d922e0057fc785f23f0e3 upstream. We leak 4 bytes of kernel stack in response to an AUDIT_GET request as we miss to initialize the mask member of status_set. Fix that. Cc: Al Viro Cc: Eric Paris Signed-off-by: Mathias Krause Signed-off-by: Richard Guy Briggs Signed-off-by: Eric Paris Signed-off-by: Greg Kroah-Hartman

audit: use nlmsg_len() to get message payload length

2013-12-04T19:05:39Z

commit 4d8fe7376a12bf4524783dd95cbc00f1fece6232 upstream. Using the nlmsg_len member of the netlink header to test if the message is valid is wrong as it includes the size of the netlink header itself. Thereby allowing to send short netlink messages that pass those checks. Use nlmsg_len() instead to test for the right message length. The result of nlmsg_len() is guaranteed to be non-negative as the netlink message already passed the checks of nlmsg_ok(). Also switch to min_t() to please checkpatch.pl. Cc: Al Viro Cc: Eric Paris Signed-off-by: Mathias Krause Signed-off-by: Richard Guy Briggs Signed-off-by: Eric Paris Signed-off-by: Greg Kroah-Hartman

audit: printk USER_AVC messages when audit isn't enabled

2013-12-04T19:05:39Z

commit 0868a5e150bc4c47e7a003367cd755811eb41e0b upstream. When the audit=1 kernel parameter is absent and auditd is not running, AUDIT_USER_AVC messages are being silently discarded. AUDIT_USER_AVC messages should be sent to userspace using printk(), as mentioned in the commit message of 4a4cd633 ("AUDIT: Optimise the audit-disabled case for discarding user messages"). When audit_enabled is 0, audit_receive_msg() discards all user messages except for AUDIT_USER_AVC messages. However, audit_log_common_recv_msg() refuses to allocate an audit_buffer if audit_enabled is 0. The fix is to special case AUDIT_USER_AVC messages in both functions. It looks like commit 50397bd1 ("[AUDIT] clean up audit_receive_msg()") introduced this bug. Signed-off-by: Tyler Hicks Cc: Al Viro Cc: Eric Paris Cc: linux-audit@redhat.com Acked-by: Kees Cook Signed-off-by: Richard Guy Briggs Signed-off-by: Eric Paris Signed-off-by: Greg Kroah-Hartman