aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2007-10-17softlockup: improve debug outputIngo Molnar
Improve the debuggability of kernel lockups by enhancing the debug output of the softlockup detector: print the task that causes the lockup and try to print a more intelligent backtrace. The old format was: BUG: soft lockup detected on CPU#1! [<c0105e4a>] show_trace_log_lvl+0x19/0x2e [<c0105f43>] show_trace+0x12/0x14 [<c0105f59>] dump_stack+0x14/0x16 [<c015f6bc>] softlockup_tick+0xbe/0xd0 [<c013457d>] run_local_timers+0x12/0x14 [<c01346b8>] update_process_times+0x3e/0x63 [<c0145fb8>] tick_sched_timer+0x7c/0xc0 [<c0140a75>] hrtimer_interrupt+0x135/0x1ba [<c011bde7>] smp_apic_timer_interrupt+0x6e/0x80 [<c0105aa3>] apic_timer_interrupt+0x33/0x38 [<c0104f8a>] syscall_call+0x7/0xb ======================= The new format is: BUG: soft lockup detected on CPU#1! [prctl:2363] Pid: 2363, comm: prctl EIP: 0060:[<c013915f>] CPU: 1 EIP is at sys_prctl+0x24/0x18c EFLAGS: 00000213 Not tainted (2.6.22-cfs-v20 #26) EAX: 00000001 EBX: 000003e7 ECX: 00000001 EDX: f6df0000 ESI: 000003e7 EDI: 000003e7 EBP: f6df0fb0 DS: 007b ES: 007b FS: 00d8 CR0: 8005003b CR2: 4d8c3340 CR3: 3731d000 CR4: 000006d0 [<c0105e4a>] show_trace_log_lvl+0x19/0x2e [<c0105f43>] show_trace+0x12/0x14 [<c01040be>] show_regs+0x1ab/0x1b3 [<c015f807>] softlockup_tick+0xef/0x108 [<c013457d>] run_local_timers+0x12/0x14 [<c01346b8>] update_process_times+0x3e/0x63 [<c0145fcc>] tick_sched_timer+0x7c/0xc0 [<c0140a89>] hrtimer_interrupt+0x135/0x1ba [<c011bde7>] smp_apic_timer_interrupt+0x6e/0x80 [<c0105aa3>] apic_timer_interrupt+0x33/0x38 [<c0104f8a>] syscall_call+0x7/0xb ======================= Note that in the old format we only knew that some system call locked up, we didnt know _which_. With the new format we know that it's at a specific place in sys_prctl(). [which was where i created an artificial kernel lockup to test the new format.] This is also useful if the lockup happens in user-space - the user-space EIP (and other registers) will be printed too. (such a lockup would either suggest that the task was running at SCHED_FIFO:99 and looping for more than 10 seconds, or that the softlockup detector has a false-positive.) The task name is printed too first, just in case we dont manage to print a useful backtrace. [satyam@infradead.org: fix warning] Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Satyam Sharma <satyam@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17fix the softlockup watchdog to actually workIngo Molnar
this Xen related commit: commit 966812dc98e6a7fcdf759cbfa0efab77500a8868 Author: Jeremy Fitzhardinge <jeremy@goop.org> Date: Tue May 8 00:28:02 2007 -0700 Ignore stolen time in the softlockup watchdog broke the softlockup watchdog to never report any lockups. (!) print_timestamp defaults to 0, this makes the following condition always true: if (print_timestamp < (touch_timestamp + 1) || and we'll in essence never report soft lockups. apparently the functionality of the soft lockup watchdog was never actually tested with that patch applied ... Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17softlockup: use cpu_clock() instead of sched_clock()Ingo Molnar
sched_clock() is not a reliable time-source, use cpu_clock() instead. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17oom: compare cpuset mems_allowed instead of exclusive ancestorsDavid Rientjes
Instead of testing for overlap in the memory nodes of the the nearest exclusive ancestor of both current and the candidate task, it is better to simply test for intersection between the task's mems_allowed in their task descriptors. This does not require taking callback_mutex since it is only used as a hint in the badness scoring. Tasks that do not have an intersection in their mems_allowed with the current task are not explicitly restricted from being OOM killed because it is quite possible that the candidate task has allocated memory there before and has since changed its mems_allowed. Cc: Andrea Arcangeli <andrea@suse.de> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17oom: add oom_kill_allocating_task sysctlDavid Rientjes
Adds a new sysctl, 'oom_kill_allocating_task', which will automatically kill the OOM-triggering task instead of scanning through the tasklist to find a memory-hogging target. This is helpful for systems with an insanely large number of tasks where scanning the tasklist significantly degrades performance. Cc: Andrea Arcangeli <andrea@suse.de> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17Slab API: remove useless ctor parameter and reorder parametersChristoph Lameter
Slab constructors currently have a flags parameter that is never used. And the order of the arguments is opposite to other slab functions. The object pointer is placed before the kmem_cache pointer. Convert ctor(void *object, struct kmem_cache *s, unsigned long flags) to ctor(struct kmem_cache *s, void *object) throughout the kernel [akpm@linux-foundation.org: coupla fixes] Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17mm: dirty balancing for tasksPeter Zijlstra
Based on ideas of Andrew: http://marc.info/?l=linux-kernel&m=102912915020543&w=2 Scale the bdi dirty limit inversly with the tasks dirty rate. This makes heavy writers have a lower dirty limit than the occasional writer. Andrea proposed something similar: http://lwn.net/Articles/152277/ The main disadvantage to his patch is that he uses an unrelated quantity to measure time, which leaves him with a workload dependant tunable. Other than that the two approaches appear quite similar. [akpm@linux-foundation.org: fix warning] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17mm: per device dirty thresholdPeter Zijlstra
Scale writeback cache per backing device, proportional to its writeout speed. By decoupling the BDI dirty thresholds a number of problems we currently have will go away, namely: - mutual interference starvation (for any number of BDIs); - deadlocks with stacked BDIs (loop, FUSE and local NFS mounts). It might be that all dirty pages are for a single BDI while other BDIs are idling. By giving each BDI a 'fair' share of the dirty limit, each one can have dirty pages outstanding and make progress. A global threshold also creates a deadlock for stacked BDIs; when A writes to B, and A generates enough dirty pages to get throttled, B will never start writeback until the dirty pages go away. Again, by giving each BDI its own 'independent' dirty limit, this problem is avoided. So the problem is to determine how to distribute the total dirty limit across the BDIs fairly and efficiently. A DBI that has a large dirty limit but does not have any dirty pages outstanding is a waste. What is done is to keep a floating proportion between the DBIs based on writeback completions. This way faster/more active devices get a larger share than slower/idle devices. [akpm@linux-foundation.org: fix warnings] [hugh@veritas.com: Fix occasional hang when a task couldn't get out of balance_dirty_pages] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16time: introduce xtime_secondsIngo Molnar
improve performance of sys_time(). sys_time() returns time in seconds, but it does so by calling do_gettimeofday() and then returning the tv_sec portion of the GTOD time. But the data structure "xtime", which is updated by every timer/scheduler tick, already offers HZ granularity time. the patch improves the sysbench oltp macrobenchmark by 4-5% on an AMD dual-core system: v2.6.23: #threads 1: transactions: 4073 (407.23 per sec.) 2: transactions: 8530 (852.81 per sec.) 3: transactions: 8321 (831.88 per sec.) 4: transactions: 8407 (840.58 per sec.) 5: transactions: 8070 (806.74 per sec.) v2.6.23 + sys_time-speedup.patch: 1: transactions: 4281 (428.09 per sec.) 2: transactions: 8910 (890.85 per sec.) 3: transactions: 8659 (865.79 per sec.) 4: transactions: 8676 (867.34 per sec.) 5: transactions: 8532 (852.91 per sec.) and by 4-5% on an Intel dual-core system too: 2.6.23: 1: transactions: 4560 (455.94 per sec.) 2: transactions: 10094 (1009.30 per sec.) 3: transactions: 9755 (975.36 per sec.) 4: transactions: 9859 (985.78 per sec.) 5: transactions: 9701 (969.72 per sec.) 2.6.23 + sys_time-speedup.patch: 1: transactions: 4779 (477.84 per sec.) 2: transactions: 10103 (1010.14 per sec.) 3: transactions: 10141 (1013.93 per sec.) 4: transactions: 10371 (1036.89 per sec.) 5: transactions: 10178 (1017.50 per sec.) (the more CPUs the system has, the more speedup this patch gives for this particular workload.) Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16kprobes: support kretprobe blacklistMasami Hiramatsu
Introduce architecture dependent kretprobe blacklists to prohibit users from inserting return probes on the function in which kprobes can be inserted but kretprobes can not. This patch also removes "__kprobes" mark from "__switch_to" on x86_64 and registers "__switch_to" to the blacklist on x86-64, because that mark is to prohibit user from inserting only kretprobe. Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: Prasanna S Panchamukhi <prasanna@in.ibm.com> Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16cpuset: remove sched domain hooks from cpusetsPaul Jackson
Remove the cpuset hooks that defined sched domains depending on the setting of the 'cpu_exclusive' flag. The cpu_exclusive flag can only be set on a child if it is set on the parent. This made that flag painfully unsuitable for use as a flag defining a partitioning of a system. It was entirely unobvious to a cpuset user what partitioning of sched domains they would be causing when they set that one cpu_exclusive bit on one cpuset, because it depended on what CPUs were in the remainder of that cpusets siblings and child cpusets, after subtracting out other cpu_exclusive cpusets. Furthermore, there was no way on production systems to query the result. Using the cpu_exclusive flag for this was simply wrong from the get go. Fortunately, it was sufficiently borked that so far as I know, almost no successful use has been made of this. One real time group did use it to affectively isolate CPUs from any load balancing efforts. They are willing to adapt to alternative mechanisms for this, such as someway to manipulate the list of isolated CPUs on a running system. They can do without this present cpu_exclusive based mechanism while we develop an alternative. There is a real risk, to the best of my understanding, of users accidentally setting up a partitioned scheduler domains, inhibiting desired load balancing across all their CPUs, due to the nonobvious (from the cpuset perspective) side affects of the cpu_exclusive flag. Furthermore, since there was no way on a running system to see what one was doing with sched domains, this change will be invisible to any using code. Unless they have real insight to the scheduler load balancing choices, they will be unable to detect that this change has been made in the kernel's behaviour. Initial discussion on lkml of this patch has generated much comment. My (probably controversial) take on that discussion is that it has reached a rough concensus that the current cpuset cpu_exclusive mechanism for defining sched domains is borked. There is no concensus on the replacement. But since we can remove this mechanism, and since its continued presence risks causing unwanted partitioning of the schedulers load balancing, we should remove it while we can, as we proceed to work the replacement scheduler domain mechanisms. Signed-off-by: Paul Jackson <pj@sgi.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Dinakar Guniguntala <dino@in.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16m32r: convert to generic sys_ptraceChristoph Hellwig
Convert m32r to the generic sys_ptrace. The conversion requires an architecture hook after ptrace_attach which this patch adds. The hook will also be needed for a conersion of ia64 to the generic ptrace code. Thanks to Hirokazu Takata for fixing a bug in the first version of this code. Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: Hirokazu Takata <takata@linux-m32r.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16hugetlb: Add hugetlb_dynamic_pool sysctlAdam Litke
The maximum size of the huge page pool can be controlled using the overall size of the hugetlb filesystem (via its 'size' mount option). However in the common case the this will not be set as the pool is traditionally fixed in size at boot time. In order to maintain the expected semantics, we need to prevent the pool expanding by default. This patch introduces a new sysctl controlling dynamic pool resizing. When this is enabled the pool will expand beyond its base size up to the size of the hugetlb filesystem. It is disabled by default. Signed-off-by: Adam Litke <agl@us.ibm.com> Acked-by: Andy Whitcroft <apw@shadowen.org> Acked-by: Dave McCracken <dave.mccracken@oracle.com> Cc: William Irwin <bill.irwin@oracle.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Ken Chen <kenchen@google.com> Cc: Badari Pulavarty <pbadari@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16memory unplug: memory hotplug cleanupKAMEZAWA Hiroyuki
A clean up patch for "scanning memory resource [start, end)" operation. Now, find_next_system_ram() function is used in memory hotplug, but this interface is not easy to use and codes are complicated. This patch adds walk_memory_resouce(start,len,arg,func) function. The function 'func' is called per valid memory resouce range in [start,pfn). [pbadari@us.ibm.com: Error handling in walk_memory_resource()] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16Group short-lived and reclaimable kernel allocationsMel Gorman
This patch marks a number of allocations that are either short-lived such as network buffers or are reclaimable such as inode allocations. When something like updatedb is called, long-lived and unmovable kernel allocations tend to be spread throughout the address space which increases fragmentation. This patch groups these allocations together as much as possible by adding a new MIGRATE_TYPE. The MIGRATE_RECLAIMABLE type is for allocations that can be reclaimed on demand, but not moved. i.e. they can be migrated by deleting them and re-reading the information from elsewhere. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: Andy Whitcroft <apw@shadowen.org> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16Memoryless nodes: Use N_HIGH_MEMORY for cpusetsChristoph Lameter
cpusets try to ensure that any node added to a cpuset's mems_allowed is on-line and contains memory. The assumption was that online nodes contained memory. Thus, it is possible to add memoryless nodes to a cpuset and then add tasks to this cpuset. This results in continuous series of oom-kill and apparent system hang. Change cpusets to use node_states[N_HIGH_MEMORY] [a.k.a. node_memory_map] in place of node_online_map when vetting memories. Return error if admin attempts to write a non-empty mems_allowed node mask containing only memoryless-nodes. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Bob Picco <bob.picco@hp.com> Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@skynet.ie> Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16Memoryless nodes: Allow profiling data to fall back to other nodesChristoph Lameter
Processors on memoryless nodes must be able to fall back to remote nodes in order to get a profiling buffer. This may lead to excessive NUMA traffic but I think we should allow this rather than failing. Signed-off-by: Christoph Lameter <clameter@sgi.com> Acked-by: Nishanth Aravamudan <nacc@us.ibm.com> Acked-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: Bob Picco <bob.picco@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@skynet.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16x86: optimize page faults like all other achitectures and kill notifier cruftChristoph Hellwig
x86(-64) are the last architectures still using the page fault notifier cruft for the kprobes page fault hook. This patch converts them to the proper direct calls, and removes the now unused pagefault notifier bits aswell as the cruft in kprobes.c that was related to this mess. I know Andi didn't really like this, but all other architecture maintainers agreed the direct calls are much better and besides the obvious cruft removal a common way of dealing with kprobes across architectures is important aswell. [akpm@linux-foundation.org: build fix] [akpm@linux-foundation.org: fix sparc64] Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: Andi Kleen <ak@suse.de> Cc: <linux-arch@vger.kernel.org> Cc: Prasanna S Panchamukhi <prasanna@in.ibm.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16Convert cpu_sibling_map to be a per cpu variableMike Travis
Convert cpu_sibling_map from a static array sized by NR_CPUS to a per_cpu variable. This saves sizeof(cpumask_t) * NR unused cpus. Access is mostly from startup and CPU HOTPLUG functions. Signed-off-by: Mike Travis <travis@sgi.com> Cc: Andi Kleen <ak@suse.de> Cc: Christoph Lameter <clameter@sgi.com> Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: "Luck, Tony" <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16slow down printk during bootRandy Dunlap
Optionally add a boot delay after each kernel printk() call, crudely measured in milliseconds, with a maximum delay of 10 seconds per printk. Enable CONFIG_BOOT_PRINTK_DELAY=y and then add (e.g.): "lpj=loops_per_jiffy boot_delay=100" to the kernel command line. It has been useful in cases like "during boot, my machine just reboots or the screen goes black" by slowing down printk, (and adding initcall_debug), we can usually see the last thing that happened before the lights went out which is usually a valuable clue. [akpm@linux-foundation.org: not all architectures implement CONFIG_HZ] [akpm@linux-foundation.org: fix lots of stuff] [bunk@stusta.de: kernel/printk.c: make 2 variables static] [heiko.carstens@de.ibm.com: fix slow down printk on boot compile error] Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Dave Jones <davej@redhat.com> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16Consolidate PTRACE_DETACHAlexey Dobriyan
Identical handlers of PTRACE_DETACH go into ptrace_request(). Not touching compat code. Not touching archs that don't call ptrace_request. Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru> Acked-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-15Merge git://git.linux-nfs.org/pub/linux/nfs-2.6Linus Torvalds
* git://git.linux-nfs.org/pub/linux/nfs-2.6: (131 commits) NFSv4: Fix a typo in nfs_inode_reclaim_delegation NFS: Add a boot parameter to disable 64 bit inode numbers NFS: nfs_refresh_inode should clear cache_validity flags on success NFS: Fix a connectathon regression in NFSv3 and NFSv4 NFS: Use nfs_refresh_inode() in ops that aren't expected to change the inode SUNRPC: Don't call xprt_release in call refresh SUNRPC: Don't call xprt_release() if call_allocate fails SUNRPC: Fix buggy UDP transmission [23/37] Clean up duplicate includes in [2.6 patch] net/sunrpc/rpcb_clnt.c: make struct rpcb_program static SUNRPC: Use correct type in buffer length calculations SUNRPC: Fix default hostname created in rpc_create() nfs: add server port to rpc_pipe info file NFS: Get rid of some obsolete macros NFS: Simplify filehandle revalidation NFS: Ensure that nfs_link() returns a hashed dentry NFS: Be strict about dentry revalidation when doing exclusive create NFS: Don't zap the readdir caches upon error NFS: Remove the redundant nfs_reval_fsid() NFSv3: Always use directory post-op attributes in nfs3_proc_lookup ... Fix up trivial conflict due to sock_owned_by_user() cleanup manually in net/sunrpc/xprtsock.c
2007-10-15Merge branch 'v2.6.24-lockdep' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/peterz/linux-2.6-lockdep * 'v2.6.24-lockdep' of git://git.kernel.org/pub/scm/linux/kernel/git/peterz/linux-2.6-lockdep: lockdep: annotate dir vs file i_mutex lockdep: per filesystem inode lock class lockdep: annotate kprobes irq fiddling lockdep: annotate rcu_read_{,un}lock{,_bh} lockdep: annotate journal_start() lockdep: s390: connect the sysexit hook lockdep: x86_64: connect the sysexit hook lockdep: i386: connect the sysexit hook lockdep: syscall exit check lockdep: fixup mutex annotations lockdep: fix mismatched lockdep_depth/curr_chain_hash lockdep: Avoid /proc/lockdep & lock_stat infinite output lockdep: maintainers
2007-10-15sched: sync wakeups preempt tooIngo Molnar
make sure sync wakeups preempt too - the scheduler will not overschedule as we've got various throttles against that. As a result, sync wakeups can be used more widely in the kernel (to signal wakeup affinity between tasks), and no arbitrary latencies will be introduced either. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-15sched: affine sync wakeupsIngo Molnar
make sync wakeups affine for cache-cold tasks: if a cache-cold task is woken up by a sync wakeup then use the opportunity to migrate it straight away. (the two tasks are 'related' because they communicate) Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-15sched: guest CPU accounting: maintain stats in account_system_time()Laurent Vivier
modify account_system_time() to add cputime to cpustat->guest if we are running a VCPU. We add this cputime to cpustat->user instead of cpustat->system because this part of KVM code is in fact user code although it is executed in the kernel. We duplicate VCPU time between guest and user to allow an unmodified "top(1)" to display correct value. A modified "top(1)" is able to display good cpu user time and cpu guest time by subtracting cpu guest time from cpu user time. Update "gtime" in task_struct accordingly. Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net> Acked-by: Avi Kivity <avi@qumranet.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-15sched: guest CPU accounting: add guest-CPU /proc/<pid>/stat fieldsLaurent Vivier
like for cpustat, introduce the "gtime" (guest time of the task) and "cgtime" (guest time of the task children) fields for the tasks. Modify signal_struct and task_struct. Modify /proc/<pid>/stat to display these new fields. Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net> Acked-by: Avi Kivity <avi@qumranet.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-15sched: domain sysctl fixes: add terminator commentMilton Miller
we had an incorrect-terminator bug in sd_alloc_ctl_domain_table() before, so add a comment that documents it. Signed-off-by: Milton Miller <miltonm@bga.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-15sched: domain sysctl fixes: do not crash on allocation failureMilton Miller
Now that we are calling this at runtime, a more relaxed error path is suggested. If an allocation fails, we just register the partial table, which will show empty directories. Signed-off-by: Milton Miller <miltonm@bga.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-15sched: domain sysctl fixes: unregister the sysctl table before domainsMilton Miller
Unregister and free the sysctl table before destroying domains, then rebuild and register after creating the new domains. This prevents the sysctl table from pointing to freed memory for root to write. Signed-off-by: Milton Miller <miltonm@bga.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-15sched: domain sysctl fixes: use for_each_online_cpu()Milton Miller
init_sched_domain_sysctl was walking cpus 0-n and referencing per_cpu variables. If the cpus_possible mask is not contigious this will result in a crash referencing unallocated data. If the online mask is not contigious then we would show offline cpus and miss online ones. Signed-off-by: Milton Miller <miltonm@bga.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-15sched: domain sysctl fixes: use kcalloc()Milton Miller
kcalloc checks for n * sizeof(element) overflows and it zeros. Signed-off-by: Milton Miller <miltonm@bga.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-15Make scheduler debug file operations constArjan van de Ven
In general, struct file_operations are const in the kernel, to not have false cacheline sharing and to catch bugs at compiletime with accidental writes to them. The new scheduler code introduces a new non-const one; fix this up. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-15sched: allow the immediate migration of cache-cold tasksIngo Molnar
allow the immediate migration of cache-cold tasks. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-15sched: debug, improve migration statisticsIngo Molnar
add new migration statistics when SCHED_DEBUG and SCHEDSTATS is enabled. Available in /proc/<PID>/sched. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-15sched: debug: increase width of debug lineIngo Molnar
increase width of debug line - in preparation of more debugging info. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-15sched: activate task_hot() only on fair-scheduled tasksPeter Zijlstra
activate task_hot() only for fair-scheduled tasks (i.e. disable it for RT tasks). Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-15sched: reintroduce cache-hot affinityIngo Molnar
reintroduce a simplified version of cache-hot/cold scheduling affinity. This improves performance with certain SMP workloads, such as sysbench. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-15sched: speed up context-switches a bitIngo Molnar
speed up context-switches a bit by not clearing p->exec_start. (as a side-effect, this also makes p->exec_start a universal timestamp available to cache-hot estimations.) Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-15sched: do not wakeup-preempt with SCHED_BATCH tasksIngo Molnar
do not wakeup-preempt with SCHED_BATCH tasks, their preemption is batched too, driven by the tick. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-15sched: generate uevents for user creation/destructionSrivatsa Vaddagiri
Generate uevents when a user is being created/destroyed. These events can be used to configure cpu share of a new user. Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-15sched: do not normalize kernel threads via SysRq-NIngo Molnar
do not normalize kernel threads via SysRq-N: the migration threads, softlockup threads, etc. might be essential for the system to function properly. So only zap user tasks. pointed out by Andi Kleen. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-15sched: remove stale comment from sched_group_set_shares()Andi Kleen
remove stale comment from sched_group_set_shares(). Function never returns -EINVAL. Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-15sched: clean up is_migration_thread()Ingo Molnar
clean up is_migration_thread() and turn it into an inline function. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-15sched: cleanup: refactor normalize_rt_tasksAndi Kleen
Replace a particularly ugly ifdef with an inline and a new macro. Also split up the function to be easier to read. Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-15sched: cleanup: refactor common code of sleep_on / wait_for_completionAndi Kleen
Refactor common code of sleep_on / wait_for_completion These functions were largely cut'n'pasted. This moves the common code into single helpers instead. Advantage is about 1k less code on x86-64 and 91 lines of code removed. It adds one function call to the non timeout version of the functions; i don't expect this to be measurable. Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-15sched: cleanup: remove unnecessary gotosAndi Kleen
Replace loops implemented with gotos with real loops. Replace err = ...; goto x; x: return err; with return ...; No functional changes. Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-15sched: update commentIngo Molnar
update comment: clarify time-slices and remove obsolete tuning detail. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-15sched: prevent wakeup over-schedulingMike Galbraith
Prevent wakeup over-scheduling. Once a task has been preempted by a task of the same or lower priority, it becomes ineligible for repeated preemption by same until it has been ticked, or slept. Instead, the task is marked for preemption at the next tick. Tasks of higher priority still preempt immediately. Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-15sched: disable forced preemption by defaultPeter Zijlstra
Implement feature bit to disable forced preemption. This way it can be checked whether a workload is overscheduling or not. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>