linux/kernel, branch v2.6.38.4

next_pidmap: fix overflow condition

2011-04-21T21:32:52Z

commit c78193e9c7bcbf25b8237ad0dec82f805c4ea69b upstream. next_pidmap() just quietly accepted whatever 'last' pid that was passed in, which is not all that safe when one of the users is /proc. Admittedly the proc code should do some sanity checking on the range (and that will be the next commit), but that doesn't mean that the helper functions should just do that pidmap pointer arithmetic without checking the range of its arguments. So clamp 'last' to PID_MAX_LIMIT. The fact that we then do "last+1" doesn't really matter, the for-loop does check against the end of the pidmap array properly (it's only the actual pointer arithmetic overflow case we need to worry about, and going one bit beyond isn't going to overflow). [ Use PID_MAX_LIMIT rather than pid_max as per Eric Biederman ] Reported-by: Tavis Ormandy Analyzed-by: Robert Święcki Cc: Eric W. Biederman Cc: Pavel Emelyanov Signed-off-by: Linus Torvalds Signed-off-by: Greg Kroah-Hartman

sched: Fix erroneous all_pinned logic

2011-04-21T21:32:48Z

commit b30aef17f71cf9e24b10c11cbb5e5f0ebe8a85ab upstream. The scheduler load balancer has specific code to deal with cases of unbalanced system due to lots of unmovable tasks (for example because of hard CPU affinity). In those situation, it excludes the busiest CPU that has pinned tasks for load balance consideration such that it can perform second 2nd load balance pass on the rest of the system. This all works as designed if there is only one cgroup in the system. However, when we have multiple cgroups, this logic has false positives and triggers multiple load balance passes despite there are actually no pinned tasks at all. The reason it has false positives is that the all pinned logic is deep in the lowest function of can_migrate_task() and is too low level: load_balance_fair() iterates each task group and calls balance_tasks() to migrate target load. Along the way, balance_tasks() will also set a all_pinned variable. Given that task-groups are iterated, this all_pinned variable is essentially the status of last group in the scanning process. Task group can have number of reasons that no load being migrated, none due to cpu affinity. However, this status bit is being propagated back up to the higher level load_balance(), which incorrectly think that no tasks were moved. It kick off the all pinned logic and start multiple passes attempt to move load onto puller CPU. To fix this, move the all_pinned aggregation up at the iterator level. This ensures that the status is aggregated over all task-groups, not just last one in the list. Signed-off-by: Ken Chen Signed-off-by: Peter Zijlstra Link: http://lkml.kernel.org/r/BANLkTi=ernzNawaR5tJZEsV_QVnfxqXmsQ@mail.gmail.com Signed-off-by: Ingo Molnar Signed-off-by: Greg Kroah-Hartman

futex: Set FLAGS_HAS_TIMEOUT during futex_wait restart setup

2011-04-21T21:32:41Z

commit 0cd9c6494ee5c19aef085152bc37f3a4e774a9e1 upstream. The FLAGS_HAS_TIMEOUT flag was not getting set, causing the restart_block to restart futex_wait() without a timeout after a signal. Commit b41277dc7a18ee332d in 2.6.38 introduced the regression by accidentally removing the the FLAGS_HAS_TIMEOUT assignment from futex_wait() during the setup of the restart block. Restore the originaly behavior. Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=32922 Reported-by: Tim Smith Reported-by: Torsten Hilbrich Signed-off-by: Darren Hart Signed-off-by: Eric Dumazet Cc: Peter Zijlstra Cc: John Kacur Link: http://lkml.kernel.org/r/%3Cdaac0eb3af607f72b9a4d3126b2ba8fb5ed3b883.1302820917.git.dvhart%40linux.intel.com%3E Signed-off-by: Thomas Gleixner Signed-off-by: Greg Kroah-Hartman

perf: Rebase max unprivileged mlock threshold on top of page size

2011-04-14T20:02:14Z

commit 20443384fe090c5f8aeb016e7e85659c5bbdd69f upstream. Ensure we allow 512 kiB + 1 page for user control without assuming a 4096 bytes page size. Reported-by: Peter Zijlstra Signed-off-by: Frederic Weisbecker Signed-off-by: Peter Zijlstra Cc: Arnaldo Carvalho de Melo Cc: Paul Mackerras Cc: Stephane Eranian LKML-Reference: <1301535209-9679-1-git-send-email-fweisbec@gmail.com> Signed-off-by: Ingo Molnar Signed-off-by: Greg Kroah-Hartman

perf: Fix task_struct reference leak

2011-04-14T20:02:14Z

commit fd1edb3aa2c1d92618d8f0c6d15d44ea41fcac6a upstream. sys_perf_event_open() had an imbalance in the number of task refs it took causing memory leakage Cc: Jiri Olsa Cc: Oleg Nesterov Signed-off-by: Peter Zijlstra LKML-Reference: Signed-off-by: Ingo Molnar Signed-off-by: Greg Kroah-Hartman

Relax si_code check in rt_sigqueueinfo and rt_tgsigqueueinfo

2011-04-14T20:02:02Z

commit 243b422af9ea9af4ead07a8ad54c90d4f9b6081a upstream. Commit da48524eb206 ("Prevent rt_sigqueueinfo and rt_tgsigqueueinfo from spoofing the signal code") made the check on si_code too strict. There are several legitimate places where glibc wants to queue a negative si_code different from SI_QUEUE: - This was first noticed with glibc's aio implementation, which wants to queue a signal with si_code SI_ASYNCIO; the current kernel causes glibc's tst-aio4 test to fail because rt_sigqueueinfo() fails with EPERM. - Further examination of the glibc source shows that getaddrinfo_a() wants to use SI_ASYNCNL (which the kernel does not even define). The timer_create() fallback code wants to queue signals with SI_TIMER. As suggested by Oleg Nesterov , loosen the check to forbid only the problematic SI_TKILL case. Reported-by: Klaus Dittrich Acked-by: Julien Tinnes Signed-off-by: Roland Dreier Signed-off-by: Linus Torvalds Signed-off-by: Greg Kroah-Hartman

perf: Better fit max unprivileged mlock pages for tools needs

2011-04-14T20:01:58Z

commit 880f57318450dbead6a03f9e31a1468924d6dd88 upstream. The maximum kilobytes of locked memory that an unprivileged user can reserve is of 512 kB = 128 pages by default, scaled to the number of onlined CPUs, which fits well with the tools that use 128 data pages by default. However tools actually use 129 pages, because they need one more for the user control page. Thus the default mlock threshold is not sufficient for the default tools needs and we always end up to evaluate the constant mlock rlimit policy, which doesn't have this scaling with the number of online CPUs. Hence, on systems that have more than 16 CPUs, we overlap the rlimit threshold and fail to mmap: $ perf record ls Error: failed to mmap with 1 (Operation not permitted) Just increase the max unprivileged mlock threshold by one page so that it supports well perf tools even after 16 CPUs. Reported-by: Han Pingtian Reported-by: Peter Zijlstra Reported-by: Arnaldo Carvalho de Melo Signed-off-by: Frederic Weisbecker Acked-by: Arnaldo Carvalho de Melo Cc: Stephane Eranian LKML-Reference: <1300904979-5508-1-git-send-email-fweisbec@gmail.com> Signed-off-by: Ingo Molnar Signed-off-by: Greg Kroah-Hartman

perf: Fix tear-down of inherited group events

2011-03-27T18:36:35Z

commit 38b435b16c36b0d863efcf3f07b34a6fac9873fd upstream. When destroying inherited events, we need to destroy groups too, otherwise the event iteration in perf_event_exit_task_context() will miss group siblings and we leak events with all the consequences. Reported-and-tested-by: Vince Weaver Signed-off-by: Peter Zijlstra LKML-Reference: <1300196470.2203.61.camel@twins> Signed-off-by: Ingo Molnar Signed-off-by: Greg Kroah-Hartman

sysctl: restrict write access to dmesg_restrict

2011-03-27T18:36:01Z

commit bfdc0b497faa82a0ba2f9dddcf109231dd519fcc upstream. When dmesg_restrict is set to 1 CAP_SYS_ADMIN is needed to read the kernel ring buffer. But a root user without CAP_SYS_ADMIN is able to reset dmesg_restrict to 0. This is an issue when e.g. LXC (Linux Containers) are used and complete user space is running without CAP_SYS_ADMIN. A unprivileged and jailed root user can bypass the dmesg_restrict protection. With this patch writing to dmesg_restrict is only allowed when root has CAP_SYS_ADMIN. Signed-off-by: Richard Weinberger Acked-by: Dan Rosenberg Acked-by: Serge E. Hallyn Cc: Eric Paris Cc: Kees Cook Cc: James Morris Cc: Eugene Teo Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Greg Kroah-Hartman

Prevent rt_sigqueueinfo and rt_tgsigqueueinfo from spoofing the signal code

2011-03-27T18:35:55Z

commit da48524eb20662618854bb3df2db01fc65f3070c upstream. Userland should be able to trust the pid and uid of the sender of a signal if the si_code is SI_TKILL. Unfortunately, the kernel has historically allowed sigqueueinfo() to send any si_code at all (as long as it was negative - to distinguish it from kernel-generated signals like SIGILL etc), so it could spoof a SI_TKILL with incorrect siginfo values. Happily, it looks like glibc has always set si_code to the appropriate SI_QUEUE, so there are probably no actual user code that ever uses anything but the appropriate SI_QUEUE flag. So just tighten the check for si_code (we used to allow any negative value), and add a (one-time) warning in case there are binaries out there that might depend on using other si_code values. Signed-off-by: Julien Tinnes Acked-by: Oleg Nesterov Signed-off-by: Linus Torvalds Signed-off-by: Greg Kroah-Hartman