aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2008-07-02sched: fix cpu hotplugDmitry Adamushko
Commit 79c537998d143b127c8c662a403c3356cb885f1c upstream the CPU hotplug problems (crashes under high-volume unplug+replug tests) seem to be related to migrate_dead_tasks(). Firstly I added traces to see all tasks being migrated with migrate_live_tasks() and migrate_dead_tasks(). On my setup the problem pops up (the one with "se == NULL" in the loop of pick_next_task_fair()) shortly after the traces indicate that some has been migrated with migrate_dead_tasks()). btw., I can reproduce it much faster now with just a plain cpu down/up loop. [disclaimer] Well, unless I'm really missing something important in this late hour [/desclaimer] pick_next_task() is not something appropriate for migrate_dead_tasks() :-) the following change seems to eliminate the problem on my setup (although, I kept it running only for a few minutes to get a few messages indicating migrate_dead_tasks() does move tasks and the system is still ok) Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-07-02futexes: fix fault handling in futex_lock_piThomas Gleixner
commit 1b7558e457ed0de61023cfc913d2c342c7c3d9f2 upstream This patch addresses a very sporadic pi-futex related failure in highly threaded java apps on large SMP systems. David Holmes reported that the pi_state consistency check in lookup_pi_state triggered with his test application. This means that the kernel internal pi_state and the user space futex variable are out of sync. First we assumed that this is a user space data corruption, but deeper investigation revieled that the problem happend because the pi-futex code is not handling a fault in the futex_lock_pi path when the user space variable needs to be fixed up. The fault happens when a fork mapped the anon memory which contains the futex readonly for COW or the page got swapped out exactly between the unlock of the futex and the return of either the new futex owner or the task which was the expected owner but failed to acquire the kernel internal rtmutex. The current futex_lock_pi() code drops out with an inconsistent in case it faults and returns -EFAULT to user space. User space has no way to fixup that state. When we wrote this code we thought that we could not drop the hash bucket lock at this point to handle the fault. After analysing the code again it turned out to be wrong because there are only two tasks involved which might modify the pi_state and the user space variable: - the task which acquired the rtmutex - the pending owner of the pi_state which did not get the rtmutex Both tasks drop into the fixup_pi_state() function before returning to user space. The first task which acquired the hash bucket lock faults in the fixup of the user space variable, drops the spinlock and calls futex_handle_fault() to fault in the page. Now the second task could acquire the hash bucket lock and tries to fixup the user space variable as well. It either faults as well or it succeeds because the first task already faulted the page in. One caveat is to avoid a double fixup. After returning from the fault handling we reacquire the hash bucket lock and check whether the pi_state owner has been modified already. Reported-by: David Holmes <david.holmes@sun.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: David Holmes <david.holmes@sun.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-06-09capabilities: remain source compatible with 32-bit raw legacy capability ↵Andrew G. Morgan
support. upstream commit: ca05a99a54db1db5bca72eccb5866d2a86f8517f Source code out there hard-codes a notion of what the _LINUX_CAPABILITY_VERSION #define means in terms of the semantics of the raw capability system calls capget() and capset(). Its unfortunate, but true. Since the confusing header file has been in a released kernel, there is software that is erroneously using 64-bit capabilities with the semantics of 32-bit compatibilities. These recently compiled programs may suffer corruption of their memory when sys_getcap() overwrites more memory than they are coded to expect, and the raising of added capabilities when using sys_capset(). As such, this patch does a number of things to clean up the situation for all. It 1. forces the _LINUX_CAPABILITY_VERSION define to always retain its legacy value. 2. adopts a new #define strategy for the kernel's internal implementation of the preferred magic. 3. deprecates v2 capability magic in favor of a new (v3) magic number. The functionality of v3 is entirely equivalent to v2, the only difference being that the v2 magic causes the kernel to log a "deprecated" warning so the admin can find applications that may be using v2 inappropriately. [User space code continues to be encouraged to use the libcap API which protects the application from details like this. libcap-2.10 is the first to support v3 capabilities.] Fixes issue reported in https://bugzilla.redhat.com/show_bug.cgi?id=447518. Thanks to Bojan Smojver for the report. [akpm@linux-foundation.org: s/depreciate/deprecate/g] [akpm@linux-foundation.org: be robust about put_user size] [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Andrew G. Morgan <morgan@kernel.org> Cc: Serge E. Hallyn <serue@us.ibm.com> Cc: Bojan Smojver <bojan@rexursive.com> Cc: stable@kernel.org Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Chris Wright <chrisw@sous-sol.org>
2008-06-09cgroups: remove node_ prefix_from ns subsystemCedric Le Goater
upstream commit: 5c02b575780d0d785815a1e7b79a98edddee895a This is a slight change in the namespace cgroup subsystem api. The change is that previously when cgroup_clone() was called (currently only from the unshare path in ns_proxy cgroup, you'd get a new group named "node_$pid" whereas now you'll get a group named after just your pid.) The only users who would notice it are those who are using the ns_proxy cgroup subsystem to auto-create cgroups when namespaces are unshared - something of an experimental feature, which I think really needs more complete container/namespace support in order to be useful. I suspect the only users are Cedric and Serge, or maybe a few others on containers@lists.linux-foundation.org. And in fact it would only be noticed by the users who make the assumption about how the name is generated, rather than getting it from the /proc/<pid>/cgroups file for the process in question. Whether the change is actually needed or not I'm fairly agnostic on, but I guess it is more elegant to just use the pid as the new group name rather than adding a fairly arbitrary "node_" prefix on the front. [menage@google.com: provided changelog] Signed-off-by: Cedric Le Goater <clg@fr.ibm.com> Cc: "Paul Menage" <menage@google.com> Cc: "Serge E. Hallyn" <serue@us.ibm.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Chris Wright <chrisw@sous-sol.org>
2008-05-09sched: fix hrtick_start_fair and CPU-HotplugPeter Zijlstra
commit: b328ca182f01c2a04b85e0ee8a410720b104fbcc upstream Gautham R Shenoy reported: > While running the usual CPU-Hotplug stress tests on linux-2.6.25, > I noticed the following in the console logs. > > This is a wee bit difficult to reproduce. In the past 10 runs I hit this > only once. > > ------------[ cut here ]------------ > > WARNING: at kernel/sched.c:962 hrtick+0x2e/0x65() > > Just wondering if we are doing a good job at handling the cancellation > of any per-cpu scheduler timers during CPU-Hotplug. This looks like its indeed not cancelled at all and migrates the it to another cpu. Fix it via a proper hotplug notifier mechanism. Reported-by: Gautham R Shenoy <ego@in.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-05-01hrtimer: raise softirq unlocked to avoid circular lock dependencyThomas Gleixner
commit 0c96c5979a522c3323c30a078a70120e29b5bdbc upstream The scheduler hrtimer bits in 2.6.25 introduced a circular lock dependency in a rare code path: ======================================================= [ INFO: possible circular locking dependency detected ] 2.6.25-sched-devel.git-x86-latest.git #19 ------------------------------------------------------- X/2980 is trying to acquire lock: (&rq->rq_lock_key#2){++..}, at: [<ffffffff80230146>] task_rq_lock+0x56/0xa0 but task is already holding lock: (&cpu_base->lock){++..}, at: [<ffffffff80257ae1>] lock_hrtimer_base+0x31/0x60 which lock already depends on the new lock. The scenario which leads to this is: posix-timer signal is delivered -> posix-timer is rearmed timer is already expired in hrtimer_enqueue() -> softirq is raised To prevent this we need to move the raise of the softirq out of the base->lock protected code path. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-05-01hrtimer: timeout too long when using HRTIMER_CB_SOFTIRQBodo Stroesser
commit d7b41a24bfb5d7fa02f7b49be1293d468814e424 upstream When using hrtimer with timer->cb_mode == HRTIMER_CB_SOFTIRQ in some cases the clockevent is not programmed. This happens, if: - a timer is rearmed while it's state is HRTIMER_STATE_CALLBACK - hrtimer_reprogram() returns -ETIME, when it is called after CALLBACK is finished. This occurs if the new timer->expires is in the past when CALLBACK is done. In this case, the timer needs to be removed from the tree and put onto the pending list again. The patch is against 2.6.22.5, but AFAICS, it is relevant for 2.6.25 also (in run_hrtimer_pending()). Signed-off-by: Bodo Stroesser <bstroesser@fujitsu-siemens.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-05-01cgroup: fix a race condition in manipulating tsk->cg_listLi Zefan
commit: 0e04388f0189fa1f6812a8e1cb6172136eada87e When I ran a test program to fork mass processes and at the same time 'cat /cgroup/tasks', I got the following oops: ------------[ cut here ]------------ kernel BUG at lib/list_debug.c:72! invalid opcode: 0000 [#1] SMP Pid: 4178, comm: a.out Not tainted (2.6.25-rc9 #72) ... Call Trace: [<c044a5f9>] ? cgroup_exit+0x55/0x94 [<c0427acf>] ? do_exit+0x217/0x5ba [<c0427ed7>] ? do_group_exit+0.65/0x7c [<c0427efd>] ? sys_exit_group+0xf/0x11 [<c0404842>] ? syscall_call+0x7/0xb [<c05e0000>] ? init_cyrix+0x2fa/0x479 ... EIP: [<c04df671>] list_del+0x35/0x53 SS:ESP 0068:ebc7df4 ---[ end trace caffb7332252612b ]--- Fixing recursive fault but reboot is needed! After digging into the code and debugging, I finlly found out a race situation: do_exit() ->cgroup_exit() ->if (!list_empty(&tsk->cg_list)) list_del(&tsk->cg_list); cgroup_iter_start() ->cgroup_enable_task_cg_list() ->list_add(&tsk->cg_list, ..); In this case the list won't be deleted though the process has exited. We got two bug reports in the past, which seem to be the same bug as this one: http://lkml.org/lkml/2008/3/5/332 http://lkml.org/lkml/2007/10/17/224 Actually sometimes I got oops on list_del, sometimes oops on list_add. And I can change my test program a bit to trigger other oops. The patch has been tested both on x86_32 and x86_64. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Paul Menage <menage@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-04-15Fix locking bug in "acquire_console_semaphore_for_printk()"Linus Torvalds
When I cleaned up printk() and split up the printk locking logic in commit 266c2e0abeca649fa6667a1a427ad1da507c6375 ("Make printk() console semaphore accesses sensible") I had incorrectly moved the call to have_callable_console() outside of the console semaphore. That was buggy. The console semaphore protects the console_drivers list that is used by have_callable_console(). Thanks go to Bongani Hlope who saw this as a hang on shutdown and reboot and bisected the bug to the right commit, and tested this patch. See http://lkml.org/lkml/2008/4/11/315 Bisected-and-tested-by: Bongani Hlope <bonganilinux@mweb.co.za> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-14revert "sched: fix fair sleepers"Ingo Molnar
revert "sched: fix fair sleepers" (e22ecef1d2658ba54ed7d3fdb5d60829fb434c23), because it is causing audio skipping, see: http://bugzilla.kernel.org/show_bug.cgi?id=10428 the patch is correct and the real cause of the skipping is not understood (tracing makes it go away), but time has run out so we'll revert it and re-try in 2.6.26. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-11cgroups: include hierarchy ids in /proc/<pid>/cgroupPaul Menage
Extend the /proc/<pid>/cgroup file to include the appropriate hierarchy ID on each line. Currently this ID isn't really needed since a hierarchy can be completely identified by the set of subsystems bound to it, but this is likely to change in the near future in order to support stateless subsystems and merging/rebinding of subsystems. Getting this change into 2.6.25 reduces the need for an API change later. Signed-off-by: Paul Menage <menage@google.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-10asmlinkage_protect replaces prevent_tail_callRoland McGrath
The prevent_tail_call() macro works around the problem of the compiler clobbering argument words on the stack, which for asmlinkage functions is the caller's (user's) struct pt_regs. The tail/sibling-call optimization is not the only way that the compiler can decide to use stack argument words as scratch space, which we have to prevent. Other optimizations can do it too. Until we have new compiler support to make "asmlinkage" binding on the compiler's own use of the stack argument frame, we have work around all the manifestations of this issue that crop up. More cases seem to be prevented by also keeping the incoming argument variables live at the end of the function. This makes their original stack slots attractive places to leave those variables, so the compiler tends not clobber them for something else. It's still no guarantee, but it handles some observed cases that prevent_tail_call() did not. Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-04cgroups: add cgroup support for enabling controllers at boot timePaul Menage
The effects of cgroup_disable=foo are: - foo isn't auto-mounted if you mount all cgroups in a single hierarchy - foo isn't visible as an individually mountable subsystem As a result there will only ever be one call to foo->create(), at init time; all processes will stay in this group, and the group will never be mounted on a visible hierarchy. Any additional effects (e.g. not allocating metadata) are up to the foo subsystem. This doesn't handle early_init subsystems (their "disabled" bit isn't set be, but it could easily be extended to do so if any of the early_init systems wanted it - I think it would just involve some nastier parameter processing since it would occur before the command-line argument parser had been run. Hugh said: Ballpark figures, I'm trying to get this question out rather than processing the exact numbers: CONFIG_CGROUP_MEM_RES_CTLR adds 15% overhead to the affected paths, booting with cgroup_disable=memory cuts that back to 1% overhead (due to slightly bigger struct page). I'm no expert on distros, they may have no interest whatever in CONFIG_CGROUP_MEM_RES_CTLR=y; and the rest of us can easily build with or without it, or apply the cgroup_disable=memory patches. Unix bench's execl test result on x86_64 was == just after boot without mounting any cgroup fs.== mem_cgorup=off : Execl Throughput 43.0 3150.1 732.6 mem_cgroup=on : Execl Throughput 43.0 2932.6 682.0 == [lizf@cn.fujitsu.com: fix boot option parsing] Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Paul Menage <menage@google.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Sudhir Kumar <skumar@linux.vnet.ibm.com> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-02markers: use synchronize_sched()Mathieu Desnoyers
Markers do not mix well with CONFIG_PREEMPT_RCU because it uses preempt_disable/enable() and not rcu_read_lock/unlock for minimal intrusiveness. We would need call_sched and sched_barrier primitives. Currently, the modification (connection and disconnection) of probes from markers requires changes to the data structure done in RCU-style : a new data structure is created, the pointer is changed atomically, a quiescent state is reached and then the old data structure is freed. The quiescent state is reached once all the currently running preempt_disable regions are done running. We use the call_rcu mechanism to execute kfree() after such quiescent state has been reached. However, the new CONFIG_PREEMPT_RCU version of call_rcu and rcu_barrier does not guarantee that all preempt_disable code regions have finished, hence the race. The "proper" way to do this is to use rcu_read_lock/unlock, but we don't want to use it to minimize intrusiveness on the traced system. (we do not want the marker code to call into much of the OS code, because it would quickly restrict what can and cannot be instrumented, such as the scheduler). The temporary fix, until we get call_rcu_sched and rcu_barrier_sched in mainline, is to use synchronize_sched before each call_rcu calls, so we wait for the quiescent state in the system call code path. It will slow down batch marker enable/disable, but will make sure the race is gone. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-30futex_compat __user annotationAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-30NULL noise: fs/*, mm/*, kernel/*Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-28audit: silence two kerneldoc warnings in kernel/audit.cDave Jones
Silence two kerneldoc warnings. Warning(kernel/audit.c:1276): No description found for parameter 'string' Warning(kernel/audit.c:1276): No description found for parameter 'len' [also fix a typo for bonus points] Signed-off-by: Dave Jones <davej@codemonkey.org.uk> Acked-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-28memcgroup: fix spurious EBUSY on memory cgroup removalYAMAMOTO Takashi
Call mm_free_cgroup earlier. Otherwise a reference due to lazy mm switching can prevent cgroup removal. Signed-off-by: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-27Give futex init a proper nameBenjamin Herrenschmidt
The futex init function is called init(). This is a pain in the neck when debugging when you code dies in ... init :-) This renames it to futex_init(). Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-26Merge branch 'master' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tglx/linux-2.6-hrt * 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/tglx/linux-2.6-hrt: NOHZ: reevaluate idle sleep length after add_timer_on() clocksource: revert: use init_timer_deferrable for clocksource_watchdog
2008-03-26relay: set an spd_release() hook for spliceJens Axboe
relay doesn't reference the pages it adds, however we need a non-NULL hook or splice_to_pipe() can oops. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-03-26set relay file can not be read by pread(2)Lai Jiangshan
I found that relay files can be read by pread(2). I fix it, for relay files are not capable of seeking. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-03-26NOHZ: reevaluate idle sleep length after add_timer_on()Thomas Gleixner
add_timer_on() can add a timer on a CPU which is currently in a long idle sleep, but the timer wheel is not reevaluated by the nohz code on that CPU. So a timer can be delayed for quite a long time. This triggered a false positive in the clocksource watchdog code. To avoid this we need to wake up the idle CPU and enforce the reevaluation of the timer wheel for the next timer event. Add a function, which checks a given CPU for idle state, marks the idle task with NEED_RESCHED and sends a reschedule IPI to notify the other CPU of the change in the timer wheel. Call this function from add_timer_on(). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: stable@kernel.org -- include/linux/sched.h | 6 ++++++ kernel/sched.c | 43 +++++++++++++++++++++++++++++++++++++++++++ kernel/timer.c | 10 +++++++++- 3 files changed, 58 insertions(+), 1 deletion(-)
2008-03-25clocksource: revert: use init_timer_deferrable for clocksource_watchdogThomas Gleixner
Revert commit 1077f5a917b7c630231037826b344b2f7f5b903f Author: Parag Warudkar <parag.warudkar@gmail.com> Date: Wed Jan 30 13:30:01 2008 +0100 clocksource.c: use init_timer_deferrable for clocksource_watchdog clocksource_watchdog can use a deferrable timer - reduces wakeups from idle per second. The watchdog timer needs to run with the specified interval. Otherwise it will miss the possible wrap of the watchdog clocksource. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@kernel.org
2008-03-24Make printk() console semaphore accesses sensibleLinus Torvalds
The printk() logic on when/how to get the console semaphore was unreadable, this splits the code up into a few helper functions and makes it easier to follow what is going on. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-24bsd_acct: using task_struct->tgid is not right in pid-namespacesPavel Emelyanov
In case we're accounting from a sub-namespace, the tgids reported will not refer to the right namespace. Save the pid_namespace we're accounting in on the acct_glbs and use it in do_acct_process. Two less :) places using the task_struct.tgid member. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-24bsd_acct: plain current->real_parent access is not always safePavel Emelyanov
This is minor, but dereferencing even current real_parent is not safe on debug kernels, since the memory, this points to, can be unmapped - RCU protection is required. Besides, the tgid field is deprecated and is to be replaced with task_tgid_xxx call (the 2nd patch), so RCU will be required anyway. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-24markers: remove ACCESS_ONCEMathieu Desnoyers
As Paul pointed out, the ACCESS_ONCE are not needed because we already have the explicit surrounding memory barriers. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Cc: Christoph Hellwig <hch@infradead.org> Cc: Mike Mason <mmlnx@us.ibm.com> Cc: Dipankar Sarma <dipankar@in.ibm.com> Cc: David Smith <dsmith@redhat.com> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Adrian Bunk <adrian.bunk@movial.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-24markers: update preempt_disable. call_rcu, rcu_barrier commentsMathieu Desnoyers
Add comments requested by Andrew. Updated comments about synchronize_sched(). Since we use call_rcu and rcu_barrier now, these comments were out of sync with the code. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Cc: Christoph Hellwig <hch@infradead.org> Cc: Mike Mason <mmlnx@us.ibm.com> Cc: Dipankar Sarma <dipankar@in.ibm.com> Cc: David Smith <dsmith@redhat.com> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Adrian Bunk <adrian.bunk@movial.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-24Don't 'printk()' while holding xtime lock for writingLinus Torvalds
The printk() can deadlock because it can wake up klogd(), and task enqueueing will try to read the time in order to set a hrtimer. Reported-by: Marcin Slusarz <marcin.slusarz@gmail.com> Debugged-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-21sched: add arch_update_cpu_topology hook.Heiko Carstens
Will be called each time the scheduling domains are rebuild. Needed for architectures that don't have a static cpu topology. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-03-21sched: add exported arch_reinit_sched_domains() to header file.Heiko Carstens
Needed so it can be called from outside of sched.c. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-03-21sched: remove double unlikely from schedule()Roel Kluin
Combine two unlikely's Signed-off-by: Roel Kluin <12o3l@tiscali.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-03-21sched: cleanup old and rarely used 'debug' features.Peter Zijlstra
TREE_AVG and APPROX_AVG are initial task placement policies that have been disabled for a long while.. time to remove them. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> CC: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-03-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6Linus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (46 commits) [NET] ifb: set separate lockdep classes for queue locks [IPV6] KCONFIG: Fix description about IPV6_TUNNEL. [TCP]: Fix shrinking windows with window scaling netpoll: zap_completion_queue: adjust skb->users counter bridge: use time_before() in br_fdb_cleanup() [TG3]: Fix build warning on sparc32. MAINTAINERS: bluez-devel is subscribers-only audit: netlink socket can be auto-bound to pid other than current->pid (v2) [NET]: Fix permissions of /proc/net [SCTP]: Fix a race between module load and protosw access [NETFILTER]: ipt_recent: sanity check hit count [NETFILTER]: nf_conntrack_h323: logical-bitwise & confusion in process_setup() [RT2X00] drivers/net/wireless/rt2x00/rt2x00dev.c: remove dead code, fix warning [IPV4]: esp_output() misannotations [8021Q]: vlan_dev misannotations xfrm: ->eth_proto is __be16 [IPV4]: ipv4_is_lbcast() misannotations [SUNRPC]: net/* NULL noise [SCTP]: fix misannotated __sctp_rcv_asconf_lookup() [PKT_SCHED]: annotate cls_u32 ...
2008-03-20audit: netlink socket can be auto-bound to pid other than current->pid (v2)Pavel Emelyanov
From: Pavel Emelyanov <xemul@openvz.org> This patch is based on the one from Thomas. The kauditd_thread() calls the netlink_unicast() and passes the audit_pid to it. The audit_pid, in turn, is received from the user space and the tool (I've checked the audit v1.6.9) uses getpid() to pass one in the kernel. Besides, this tool doesn't bind the netlink socket to this id, but simply creates it allowing the kernel to auto-bind one. That's the preamble. The problem is that netlink_autobind() _does_not_ guarantees that the socket will be auto-bound to the current pid. Instead it uses the current pid as a hint to start looking for a free id. So, in case of conflict, the audit messages can be sent to a wrong socket. This can happen (it's unlikely, but can be) in case some task opens more than one netlink sockets and then the audit one starts - in this case the audit's pid can be busy and its socket will be bound to another id. The proposal is to introduce an audit_nlk_pid in audit subsys, that will point to the netlink socket to send packets to. It will most often be equal to audit_pid. The socket id can be got from the skb's netlink CB right in the audit_receive_msg. The audit_nlk_pid reset to 0 is not required, since all the decisions are taken based on audit_pid value only. Later, if the audit tools will bind the socket themselves, the kernel will have to provide a way to setup the audit_nlk_pid as well. A good side effect of this patch is that audit_pid can later be converted to struct pid, as it is not longer safe to use pid_t-s in the presence of pid namespaces. But audit code still uses the tgid from task_struct in the audit_signal_info and in the audit_filter_syscall. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Acked-by: Eric Paris <eparis@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-19revert "clocksource: make clocksource watchdog cycle through online CPUs"Andrew Morton
Revert commit 1ada5cba6a0318f90e45b38557e7b5206a9cba38 ("clocksource: make clocksource watchdog cycle through online CPUs") due to the regression reported by Gabriel C at http://lkml.org/lkml/2008/2/24/281 (short vesion: it makes TSC be marked as always unstable on his machine). Cc: Andi Kleen <ak@suse.de> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Robert Hancock <hancockr@shaw.ca> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Gabriel C <nix.or.die@googlemail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-19sched: retune wake granularityIngo Molnar
reduce wake-up granularity for better interactivity. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-03-19sched: wakeup-buddy tasks are cache-hotIngo Molnar
Wakeup-buddy tasks are cache-hot - this makes it a bit harder for the load-balancer to tear them apart. (but it's still possible, if the load is sufficiently assymetric) Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-03-19sched: improve affine wakeupsIngo Molnar
improve affine wakeups. Maintain the 'overlap' metric based on CFS's sum_exec_runtime - which means the amount of time a task executes after it wakes up some other task. Use the 'overlap' for the wakeup decisions: if the 'overlap' is short, it means there's strong workload coupling between this task and the woken up task. If the 'overlap' is large then the workload is decoupled and the scheduler will move them to separate CPUs more easily. ( Also slightly move the preempt_check within try_to_wake_up() - this has no effect on functionality but allows 'early wakeups' (for still-on-rq tasks) to be correctly accounted as well.) Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-03-19sched: clean up wakeup balancing, code flowIngo Molnar
Clean up the code flow. No code changed: kernel/sched.o: text data bss dec hex filename 42521 2858 232 45611 b22b sched.o.before 42521 2858 232 45611 b22b sched.o.after md5: 09b31c44e9aff8666f72773dc433e2df sched.o.before.asm 09b31c44e9aff8666f72773dc433e2df sched.o.after.asm Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-03-19sched: clean up wakeup balancing, rename variablesIngo Molnar
rename 'cpu' to 'prev_cpu'. No code changed: kernel/sched.o: text data bss dec hex filename 42521 2858 232 45611 b22b sched.o.before 42521 2858 232 45611 b22b sched.o.after md5: 09b31c44e9aff8666f72773dc433e2df sched.o.before.asm 09b31c44e9aff8666f72773dc433e2df sched.o.after.asm Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-03-19sched: clean up wakeup balancing, move wake_affine()Ingo Molnar
split out the affine-wakeup bits. No code changed: kernel/sched.o: text data bss dec hex filename 42521 2858 232 45611 b22b sched.o.before 42521 2858 232 45611 b22b sched.o.after md5: 9d76738f1272aa82f0b7affd2f51df6b sched.o.before.asm 09b31c44e9aff8666f72773dc433e2df sched.o.after.asm (the md5's changed because stack slots changed and some registers get scheduled by gcc in a different order - but otherwise the before and after assembly is instruction for instruction equivalent.) Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-03-17relay: fix subbuf_splice_actor() adding too many pagesJens Axboe
If subbuf_pages was larger than the max number of pages the pipe buffer will hold, subbuf_splice_actor() would happily go beyond the array size. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-03-15sched: simplify sched_slice()Ingo Molnar
Use the existing calc_delta_mine() calculation for sched_slice(). This saves a divide and simplifies the code because we share it with the other /cfs_rq->load users. It also improves code size: text data bss dec hex filename 42659 2740 144 45543 b1e7 sched.o.before 42093 2740 144 44977 afb1 sched.o.after Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
2008-03-15sched: fix fair sleepersIngo Molnar
Fair sleepers need to scale their latency target down by runqueue weight. Otherwise busy systems will gain ever larger sleep bonus. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
2008-03-15sched: fix overload performance: buddy wakeupsPeter Zijlstra
Currently we schedule to the leftmost task in the runqueue. When the runtimes are very short because of some server/client ping-pong, especially in over-saturated workloads, this will cycle through all tasks trashing the cache. Reduce cache trashing by keeping dependent tasks together by running newly woken tasks first. However, by not running the leftmost task first we could starve tasks because the wakee can gain unlimited runtime. Therefore we only run the wakee if its within a small (wakeup_granularity) window of the leftmost task. This preserves fairness, but does alternate server/client task groups. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-03-15sched: fix calc_delta_mine()Ingo Molnar
lw->weight can be 0 for a short time during bootup. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
2008-03-15sched: fix update_load_add()/sub()Ingo Molnar
Clear the cached inverse value when updating load. This is needed for calc_delta_mine() to work correctly when using the rq load. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
2008-03-15sched: min_vruntime fixPeter Zijlstra
Current min_vruntime tracking is incorrect and will cause serious problems when we don't run the leftmost task for some reason. min_vruntime does two things; 1) it's used to determine a forward direction when the u64 vruntime wraps, 2) it's used to track the leftmost vruntime to position newly enqueued tasks from. The current logic advances min_vruntime whenever the current task's vruntime advance. Because the current task may pass the leftmost task still waiting we're failing the second goal. This causes new tasks to be placed too far ahead and thus penalizes their runtime. Fix this by making min_vruntime the min_vruntime of the waiting tasks by tracking it in enqueue/dequeue, and compare against current's vruntime to obtain the absolute minimum when placing new tasks. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>