Age | Commit message (Collapse) | Author |
|
commit 65ecc14a30ad21bed9aabdfd6a2ae1a1aaaa6a00 upstream, tweaked to get
it to apply to 2.6.27
For some unknown reason at Steven Rostedt added in disabling of the SPE
instruction generation for e500 based PPC cores in commit
6ec562328fda585be2d7f472cfac99d3b44d362a.
We are removing it because:
1. It generates e500 kernels that don't work
2. its not the correct set of flags to do this
3. we handle this in the arch/powerpc/Makefile already
4. its unknown in talking to Steven why he did this
Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
Tested-and-Acked-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit 8f7b0ba1c853919b85b54774775f567f30006107 upstream.
Inotify watch removals suck violently.
To kick the watch out we need (in this order) inode->inotify_mutex and
ih->mutex. That's fine if we have a hold on inode; however, for all
other cases we need to make damn sure we don't race with umount. We can
*NOT* just grab a reference to a watch - inotify_unmount_inodes() will
happily sail past it and we'll end with reference to inode potentially
outliving its superblock.
Ideally we just want to grab an active reference to superblock if we
can; that will make sure we won't go into inotify_umount_inodes() until
we are done. Cleanup is just deactivate_super().
However, that leaves a messy case - what if we *are* racing with
umount() and active references to superblock can't be acquired anymore?
We can bump ->s_count, grab ->s_umount, which will almost certainly wait
until the superblock is shut down and the watch in question is pining
for fjords. That's fine, but there is a problem - we might have hit the
window between ->s_active getting to 0 / ->s_count - below S_BIAS (i.e.
the moment when superblock is past the point of no return and is heading
for shutdown) and the moment when deactivate_super() acquires
->s_umount.
We could just do drop_super() yield() and retry, but that's rather
antisocial and this stuff is luser-triggerable. OTOH, having grabbed
->s_umount and having found that we'd got there first (i.e. that
->s_root is non-NULL) we know that we won't race with
inotify_umount_inodes().
So we could grab a reference to watch and do the rest as above, just
with drop_super() instead of deactivate_super(), right? Wrong. We had
to drop ih->mutex before we could grab ->s_umount. So the watch
could've been gone already.
That still can be dealt with - we need to save watch->wd, do idr_find()
and compare its result with our pointer. If they match, we either have
the damn thing still alive or we'd lost not one but two races at once,
the watch had been killed and a new one got created with the same ->wd
at the same address. That couldn't have happened in inotify_destroy(),
but inotify_rm_wd() could run into that. Still, "new one got created"
is not a problem - we have every right to kill it or leave it alone,
whatever's more convenient.
So we can use idr_find(...) == watch && watch->inode->i_sb == sb as
"grab it and kill it" check. If it's been our original watch, we are
fine, if it's a newcomer - nevermind, just pretend that we'd won the
race and kill the fscker anyway; we are safe since we know that its
superblock won't be going away.
And yes, this is far beyond mere "not very pretty"; so's the entire
concept of inotify to start with.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Greg KH <greg@kroah.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit 7ef9964e6d1b911b78709f144000aacadd0ebc21 upstream.
It has been thought that the per-user file descriptors limit would also
limit the resources that a normal user can request via the epoll
interface. Vegard Nossum reported a very simple program (a modified
version attached) that can make a normal user to request a pretty large
amount of kernel memory, well within the its maximum number of fds. To
solve such problem, default limits are now imposed, and /proc based
configuration has been introduced. A new directory has been created,
named /proc/sys/fs/epoll/ and inside there, there are two configuration
points:
max_user_instances = Maximum number of devices - per user
max_user_watches = Maximum number of "watched" fds - per user
The current default for "max_user_watches" limits the memory used by epoll
to store "watches", to 1/32 of the amount of the low RAM. As example, a
256MB 32bit machine, will have "max_user_watches" set to roughly 90000.
That should be enough to not break existing heavy epoll users. The
default value for "max_user_instances" is set to 128, that should be
enough too.
This also changes the userspace, because a new error code can now come out
from EPOLL_CTL_ADD (-ENOSPC). The EMFILE from epoll_create() was already
listed, so that should be ok.
[akpm@linux-foundation.org: use get_current_user()]
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Reported-by: Vegard Nossum <vegardno@ifi.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit 33d283bef23132c48195eafc21449f8ba88fce6b upstream.
Try this, and you'll get oops immediately:
# cd Documentation/accounting/
# gcc -o getdelays getdelays.c
# mount -t cgroup -o debug xxx /mnt
# ./getdelays -C /mnt/tasks
Because a normal file's dentry->d_fsdata is a pointer to struct cftype,
not struct cgroup.
After the patch, it returns EINVAL if we try to get cgroupstats
from a normal file.
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit 700018e0a77b4113172257fcdaa1c58e27a5074f upstream.
Impact: properly rebuild sched-domains on kmalloc() failure
When cpuset failed to generate sched domains due to kmalloc()
failure, the scheduler should fallback to the single partition
'fallback_doms' and rebuild sched domains, but now it only
destroys but not rebuilds sched domains.
The regression was introduced by:
| commit dfb512ec4834116124da61d6c1ee10fd0aa32bd6
| Author: Max Krasnyansky <maxk@qualcomm.com>
| Date: Fri Aug 29 13:11:41 2008 -0700
|
| sched: arch_reinit_sched_domains() must destroy domains to force rebuild
After the above commit, partition_sched_domains(0, NULL, NULL) will
only destroy sched domains and partition_sched_domains(1, NULL, NULL)
will create the default sched domain.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit 24eb089950ce44603b30a3145a2c8520e2b55bb1 upstream
This fixes an oops when reading /proc/sched_debug.
A cgroup won't be removed completely until finishing cgroup_diput(), so we
shouldn't invalidate cgrp->dentry in cgroup_rmdir(). Otherwise, when a
group is being removed while cgroup_path() gets called, we may trigger
NULL dereference BUG.
The bug can be reproduced:
# cat test.sh
#!/bin/sh
mount -t cgroup -o cpu xxx /mnt
for (( ; ; ))
{
mkdir /mnt/sub
rmdir /mnt/sub
}
# ./test.sh &
# cat /proc/sched_debug
BUG: unable to handle kernel NULL pointer dereference at 00000038
IP: [<c045a47f>] cgroup_path+0x39/0x90
..
Call Trace:
[<c0420344>] ? print_cfs_rq+0x6e/0x75d
[<c0421160>] ? sched_debug_show+0x72d/0xc1e
..
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit 5b7dba4ff834259a5623e03a565748704a8fe449 upstream
sched_clock: prevent scd->clock from moving backwards
When sched_clock_cpu() couples the clocks between two cpus, it may
increment scd->clock beyond the GTOD tick window that __update_sched_clock()
uses to clamp the clock. A later call to __update_sched_clock() may move
the clock back to scd->tick_gtod + TICK_NSEC, violating the clock's
monotonic property.
This patch ensures that scd->clock will not be set backward.
Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Chuck Ebbert <cebbert@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit 0c4b83da58ec2e96ce9c44c211d6eac5f9dae478 upstream
sched: disable the hrtick for now
David Miller reported that hrtick update overhead has tripled the
wakeup overhead on Sparc64.
That is too much - disable the HRTICK feature for now by default,
until a faster implementation is found.
Reported-by: David Miller <davem@davemloft.net>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Chuck Ebbert <cebbert@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit c851c8676bd7ae456e9b3af8e6bb2c434eddcc75 upstream
If NR_CPUS isn't a multiple of 32, we get a truncated string of sched
domains by catting /proc/schedstat. This is caused by the wrong mask_len.
This patch fixes it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit e94320939f44e0cbaccc3f259a5778abced4949c upstream
Fix "notes" kobject leak
It happens every rmmod if KALLSYMS=y and SYSFS=y.
# modprobe foo
kobject: 'foo' (ffffffffa00743d0): kobject_add_internal: parent: 'module', set: 'module'
kobject: 'holders' (ffff88017e7c5770): kobject_add_internal: parent: 'foo', set: '<NULL>'
kobject: 'foo' (ffffffffa00743d0): kobject_uevent_env
kobject: 'foo' (ffffffffa00743d0): fill_kobj_path: path = '/module/foo'
kobject: 'notes' (ffff88017fa9b668): kobject_add_internal: parent: 'foo', set: '<NULL>'
^^^^^
# rmmod foo
kobject: 'holders' (ffff88017e7c5770): kobject_cleanup
kobject: 'holders' (ffff88017e7c5770): auto cleanup kobject_del
kobject: 'holders' (ffff88017e7c5770): calling ktype release
kobject: (ffff88017e7c5770): dynamic_kobj_release
kobject: 'holders': free name
kobject: 'foo' (ffffffffa00743d0): kobject_cleanup
kobject: 'foo' (ffffffffa00743d0): does not have a release() function, it is broken and must be fixed.
kobject: 'foo' (ffffffffa00743d0): auto cleanup 'remove' event
kobject: 'foo' (ffffffffa00743d0): kobject_uevent_env
kobject: 'foo' (ffffffffa00743d0): fill_kobj_path: path = '/module/foo'
kobject: 'foo' (ffffffffa00743d0): auto cleanup kobject_del
kobject: 'foo': free name
[whooops]
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit f6121f4f8708195e88cbdf8dd8d171b226b3f858 upstream
While working on the new version of the code for SCHED_SPORADIC I
noticed something strange in the present throttling mechanism. More
specifically in the throttling timer handler in sched_rt.c
(do_sched_rt_period_timer()) and in rt_rq_enqueue().
The problem is that, when unthrottling a runqueue, rt_rq_enqueue() only
asks for rescheduling if the runqueue has a sched_entity associated to
it (i.e., rt_rq->rt_se != NULL).
Now, if the runqueue is the root rq (which has a rt_se = NULL)
rescheduling does not take place, and it is delayed to some undefined
instant in the future.
This imply some random bandwidth usage by the RT tasks under throttling.
For instance, setting rt_runtime_us/rt_period_us = 950ms/1000ms an RT
task will get less than 95%. In our tests we got something varying
between 70% to 95%.
Using smaller time values, e.g., 95ms/100ms, things are even worse, and
I can see values also going down to 20-25%!!
The tests we performed are simply running 'yes' as a SCHED_FIFO task,
and checking the CPU usage with top, but we can investigate thoroughly
if you think it is needed.
Things go much better, for us, with the attached patch... Don't know if
it is the best approach, but it solved the issue for us.
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Michael Trimarchi <trimarchimichael@yahoo.it>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
While debugging the e1000e corruption bug with Intel, we discovered
today that the dynamic ftrace code in mainline is the likely source of
this bug.
For the stable kernel we are providing the only viable fix patch: labeling
CONFIG_DYNAMIC_FTRACE as broken. (see the patch below)
We will follow up with a backport patch that contains the fixes. But since
the fixes are not a one liner, the safest approach for now is to
disable the code in question.
The cause of the bug is due to the way the current code in mainline
handles dynamic ftrace. When dynamic ftrace is turned on, it also
turns on CONFIG_FTRACE which enables the -pg config in gcc that places
a call to mcount at every function call. With just CONFIG_FTRACE this
causes a noticeable overhead. CONFIG_DYNAMIC_FTRACE works to ease this
overhead by dynamically updating the mcount call sites into nops.
The problem arises when we trace functions and modules are unloaded.
The first time a function is called, it will call mcount and the mcount
call will call ftrace_record_ip. This records the calling site and
stores it in a preallocated hash table. Later on a daemon will
wake up and call kstop_machine and convert any mcount callers into
nops.
The evolution of this code first tried to do this without the kstop_machine
and used cmpxchg to update the callers as they were called. But I
was informed that this is dangerous to do on SMP machines if another
CPU is running that same code. The solution was to do this with
kstop_machine.
We still used cmpxchg to test if the code that we are modifying is
indeed code that we expect to be before updating it - as a final
line of defense.
But on 32bit machines, ioremapped memory and modules share the same
address space. When a module would load its code into memory and execute
some code, that would register the function.
On module unload, ftrace incorrectly did not zap these functions from
its hash (this was the bug). The cmpxchg could have saved us in most
cases (via luck) - but with ioremap-ed memory that was exactly the wrong
thing to do - the results of cmpxchg on device memory are undefined.
(and will likely result in a write)
The pending .28 ftrace tree does not have this bug anymore, as a general push
towards more robustness of code patching, this is done differently: we do not
use cmpxchg and we do a WARN_ON and turn the tracer off if anything deviates
from its expected state. Furthermore, patch sites are statically identified
during build time so there's no runtime discovery of dynamic code areas
anymore, and no room for code unmaps to cause the hash to become out of date.
We believe the fragility of dynamic patching has been sufficiently
addressed in the development code via the static patching method, but further
suggestions to make it more robust are welcome.
Signed-off-by: Steven Rostedt <srostedt@goodmis.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
The softlockup watchdog needs to be touched when resuming the from the
kgdb stopped state to avoid the printk that a CPU is stuck if the
debugger was active for longer than the softlockup threshold.
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
|
|
Impact: jiffies increment too fast.
Hugh Dickins noted that with NOHZ=n and HIGHRES=n jiffies get
incremented too fast. The reason is a wrong check in the broadcast
enter/exit code, which keeps the local apic timer in periodic mode
when the switch happens.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Found by static checker (http://repo.or.cz/w/smatch.git).
Signed-off-by: Dan Carpenter <error27@gmail.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
hrtimer: prevent migration of per CPU hrtimers
hrtimer: mark migration state
hrtimer: fix migration of CB_IRQSAFE_NO_SOFTIRQ hrtimers
hrtimer: migrate pending list on cpu offline
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
|
|
There's a race between mm->owner assignment and swapoff, more easily
seen when task slab poisoning is turned on. The condition occurs when
try_to_unuse() runs in parallel with an exiting task. A similar race
can occur with callers of get_task_mm(), such as /proc/<pid>/<mmstats>
or ptrace or page migration.
CPU0 CPU1
try_to_unuse
looks at mm = task0->mm
increments mm->mm_users
task 0 exits
mm->owner needs to be updated, but no
new owner is found (mm_users > 1, but
no other task has task->mm = task0->mm)
mm_update_next_owner() leaves
mmput(mm) decrements mm->mm_users
task0 freed
dereferencing mm->owner fails
The fix is to notify the subsystem via mm_owner_changed callback(),
if no new owner is found, by specifying the new task as NULL.
Jiri Slaby:
mm->owner was set to NULL prior to calling cgroup_mm_owner_callbacks(), but
must be set after that, so as not to pass NULL as old owner causing oops.
Daisuke Nishimura:
mm_update_next_owner() may set mm->owner to NULL, but mem_cgroup_from_task()
and its callers need to take account of this situation to avoid oops.
Hugh Dickins:
Lockdep warning and hang below exec_mmap() when testing these patches.
exit_mm() up_reads mmap_sem before calling mm_update_next_owner(),
so exec_mmap() now needs to do the same. And with that repositioning,
there's now no point in mm_need_new_owner() allowing for NULL mm.
Reported-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Impact: per CPU hrtimers can be migrated from a dead CPU
The hrtimer code has no knowledge about per CPU timers, but we need to
prevent the migration of such timers and warn when such a timer is
active at migration time.
Explicitely mark the timers as per CPU and use a more understandable
mode descriptor for the interrupts safe unlocked callback mode, which
is used by hrtimer_sleeper and the scheduler code.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Impact: during migration active hrtimers can be seen as inactive
The migration code removes the hrtimers from the queues of the dead
CPU and sets the state temporary to INACTIVE. The enqueue code sets it
to ACTIVE/PENDING again.
Prevent that the wrong state can be seen by using a separate migration
state bit.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Impact: Stale timers after a CPU went offline.
commit 37bb6cb4097e29ffee970065b74499cbf10603a3
hrtimer: unlock hrtimer_wakeup
changed the hrtimer sleeper callback mode to CB_IRQSAFE_NO_SOFTIRQ due
to locking problems. A result of this change is that when enqueue is
called for an already expired hrtimer the callback function is not
longer called directly from the enqueue code. The normal callers have
been fixed in the code, but the migration code which moves hrtimers
from a dead CPU to a live CPU was not made aware of this.
This can be fixed by checking the timer state after the call to
enqueue in the migration code.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Impact: hrtimers which are on the pending list are not migrated at cpu
offline and can be stale forever
Add the pending list migration when CONFIG_HIGH_RES_TIMERS is enabled
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
On the x86 arch, user space single step exceptions should be ignored
if they occur in the kernel space, such as ptrace stepping through a
system call.
First check if it is kgdb that is executing a single step, then ensure
it is not an accidental traversal into the user space, while in kgdb,
any other time the TIF_SINGLESTEP is set, kgdb should ignore the
exception.
On x86, arm, mips and powerpc, the kgdb_contthread usage was
inconsistent with the way single stepping is implemented in the kgdb
core. The arch specific stub should always set the
kgdb_cpu_doing_single_step correctly if it is single stepping. This
allows kgdb to correctly process an instruction steps if ptrace
happens to be requesting an instruction step over a system call.
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
|
|
On the ARM architecture, kgdb will crash the kernel if the last byte
of valid memory is written due to a flush_icache_range flushing
beyond the memory boundary.
Signed-off-by: Atsuo Igarashi <atsuo_igarashi@tripeaks.co.jp>
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
timers: fix build error in !oneshot case
x86: c1e_idle: don't mark TSC unstable if CPU has invariant TSC
x86: prevent C-states hang on AMD C1E enabled machines
clockevents: prevent mode mismatch on cpu online
clockevents: check broadcast device not tick device
clockevents: prevent stale tick_next_period for onlining CPUs
x86: prevent stale state of c1e_mask across CPU offline/online
clockevents: prevent cpu online to interfere with nohz
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
sched: fix init_hrtick() section mismatch warning
|
|
A segmentation fault can occur in kimage_add_entry in kexec.c when loading
a kernel image into memory. The fault occurs because a page is requested
by calling kimage_alloc_page with gfp_mask GFP_KERNEL and the function may
actually return a page with gfp_mask GFP_HIGHUSER. The high mem page is
returned because it was swapped with the kernel page due to the kernel
page being a page that will shortly be copied to.
This patch ensures that kimage_alloc_page returns a page that was created
with the correct gfp flags.
I have verified the change and fixed the whitespace damage of the original
patch. Jonathan did a great job of tracking this down after he hit the
problem. -- Eric
Signed-off-by: Jonathan Steel <jon.steel@esentire.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
kernel/time/tick-common.c: In function ‘tick_setup_periodic’:
kernel/time/tick-common.c:113: error: implicit declaration of function ‘tick_broadcast_oneshot_active’
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
Impact: timer hang on CPU online observed on AMD C1E systems
When a CPU is brought online then the broadcast machinery can
be in the one shot state already. Check this and setup the timer
device of the new CPU in one shot mode so the broadcast code
can pick up the next_event value correctly.
Another AMD C1E oddity, as we switch to broadcast immediately and
not after the full bring up via the ACPI cpu idle code.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Impact: Possible hang on CPU online observed on AMD C1E machines.
The broadcast setup code looks at the mode of the tick device to
determine whether it needs to be shut down or setup. This is wrong
when the broadcast mode is set to one shot already. This can happen
when a CPU is brought online as it goes through the periodic setup
first.
The problem went unnoticed as sane systems do not call into that code
before the switch to one shot for the clock event device happens.
The AMD C1E idle routine switches over immediately and thereby shuts
down the just setup device before the first interrupt happens.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Impact: possible hang on CPU onlining in timer one shot mode.
The tick_next_period variable is only used during boot on nohz/highres
enabled systems, but for CPU onlining it needs to be maintained when
the per cpu clock events device operates in one shot mode.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Impact: rare hang which can be triggered on CPU online.
tick_do_timer_cpu keeps track of the CPU which updates jiffies
via do_timer. The value -1 is used to signal, that currently no
CPU is doing this. There are two cases, where the variable can
have this state:
boot:
necessary for systems where the boot cpu id can be != 0
nohz long idle sleep:
When the CPU which did the jiffies update last goes into
a long idle sleep it drops the update jiffies duty so
another CPU which is not idle can pick it up and keep
jiffies going.
Using the same value for both situations is wrong, as the CPU online
code can see the -1 state when the timer of the newly onlined CPU is
setup. The setup for a newly onlined CPU goes through periodic mode
and can pick up the do_timer duty without being aware of the nohz /
highres mode of the already running system.
Use two separate states and make them constants to avoid magic
numbers confusion.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
LD kernel/built-in.o
WARNING: kernel/built-in.o(.text+0x326): Section mismatch in reference
from the function init_hrtick() to the variable
.cpuinit.data:hotplug_hrtick_nb.8
The function init_hrtick() references
the variable __cpuinitdata hotplug_hrtick_nb.8.
This is often because init_hrtick lacks a __cpuinitdata
annotation or the annotation of hotplug_hrtick_nb.8 is wrong.
Signed-off-by: Md.Rakib H. Mullick <rakib.mullick@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
sched: fix deadlock in setting scheduler parameter to zero
sched: fix 2.6.27-rc5 couldn't boot on tulsa machine randomly
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
clockevents: make device shutdown robust
clocksource, acpi_pm.c: fix check for monotonicity
clockevents: remove WARN_ON which was used to gather information
|
|
The device shut down does not cleanup the next_event variable of the
clock event device. So when the device is reactivated the possible
stale next_event value can prevent the device to be reprogrammed as it
claims to wait on a event already.
This is the root cause of the resurfacing suspend/resume problem,
where systems need key press to come back to life.
Fix this by setting next_event to KTIME_MAX when the device is shut
down. Use a separate function for shutdown which takes care of that
and only keep the direct set mode call in the broadcast code, where we
can not touch the next_event value.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
After the patch:
commit 0b2f630a28d53b5a2082a5275bc3334b10373508
Author: Miao Xie <miaox@cn.fujitsu.com>
Date: Fri Jul 25 01:47:21 2008 -0700
cpusets: restructure the function update_cpumask() and update_nodemask()
It might happen that 'echo 0 > /cpuset/sub/cpus' returned failure but 'cpus'
has been changed, because cpus was changed before calling heap_init() which
may return -ENOMEM.
This patch restores the orginal behavior.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Andrei Gusev wrote:
> I played witch scheduler settings. After doing something like:
> echo -n 1000000 >sched_rt_period_us
>
> command is locked. I found in kernel.log:
>
> Sep 11 00:39:34 zaratustra
> Sep 11 00:39:34 zaratustra Pid: 4495, comm: bash Tainted: G W
> (2.6.26.3 #12)
> Sep 11 00:39:34 zaratustra EIP: 0060:[<c0213fc7>] EFLAGS: 00210246 CPU: 0
> Sep 11 00:39:34 zaratustra EIP is at div64_u64+0x57/0x80
> Sep 11 00:39:34 zaratustra EAX: 0000389f EBX: 00000000 ECX: 00000000
> EDX: 00000000
> Sep 11 00:39:34 zaratustra ESI: d9800000 EDI: d9800000 EBP: 0000389f
> ESP: ea7a6edc
> Sep 11 00:39:34 zaratustra DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068
> Sep 11 00:39:34 zaratustra Process bash (pid: 4495, ti=ea7a6000
> task=ea744000 task.ti=ea7a6000)
> Sep 11 00:39:34 zaratustra Stack: 00000000 000003e8 d9800000 0000389f
> c0119042 00000000 00000000 00000001
> Sep 11 00:39:34 zaratustra 00000000 00000000 ea7a6f54 00010000 00000000
> c04d2e80 00000001 000e7ef0
> Sep 11 00:39:34 zaratustra c01191a3 00000000 00000000 ea7a6fa0 00000001
> ffffffff c04d2e80 ea5b2480
> Sep 11 00:39:34 zaratustra Call Trace:
> Sep 11 00:39:34 zaratustra [<c0119042>] __rt_schedulable+0x52/0x130
> Sep 11 00:39:34 zaratustra [<c01191a3>] sched_rt_handler+0x83/0x120
> Sep 11 00:39:34 zaratustra [<c01a76a6>] proc_sys_call_handler+0xb6/0xd0
> Sep 11 00:39:34 zaratustra [<c01a76c0>] proc_sys_write+0x0/0x20
> Sep 11 00:39:34 zaratustra [<c01a76d9>] proc_sys_write+0x19/0x20
> Sep 11 00:39:34 zaratustra [<c016cc68>] vfs_write+0xa8/0x140
> Sep 11 00:39:34 zaratustra [<c016cdd1>] sys_write+0x41/0x80
> Sep 11 00:39:34 zaratustra [<c0103051>] sysenter_past_esp+0x6a/0x91
> Sep 11 00:39:34 zaratustra =======================
> Sep 11 00:39:34 zaratustra Code: c8 41 0f ad f3 d3 ee f6 c1 20 0f 45 de
> 31 f6 0f ad ef d3 ed f6 c1 20 0f 45 fd 0f 45 ee 31 c9 39 eb 89 fe 89 ea
> 77 08 89 e8 31 d2 <f7> f3 89 c1 89 f0 8b 7c 24 08 f7 f3 8b 74 24 04 89
> ca 8b 1c 24
> Sep 11 00:39:34 zaratustra EIP: [<c0213fc7>] div64_u64+0x57/0x80 SS:ESP
> 0068:ea7a6edc
> Sep 11 00:39:34 zaratustra ---[ end trace 4eaa2a86a8e2da22 ]---
fix the boundary condition.
sysctl_sched_rt_period=0 makes exception at to_ratio().
Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
On my tulsa x86-64 machine, kernel 2.6.25-rc5 couldn't boot randomly.
Basically, function __enable_runtime forgets to reset rt_rq->rt_throttled
to 0. When every cpu is up, per-cpu migration_thread is created and it runs
very fast, sometimes to mark the corresponding rt_rq->rt_throttled to 1 very
quickly. After all cpus are up, with below calling chain:
sched_init_smp => arch_init_sched_domains => build_sched_domains => ...
=> cpu_attach_domain => rq_attach_root => set_rq_online => ...
=> _enable_runtime
_enable_runtime is called against every rt_rq again, so rt_rq->rt_time is
reset to 0, but rt_rq->rt_throttled might be still 1. Later on function
do_sched_rt_period_timer couldn't reset it, and all RT tasks couldn't be
scheduled to run on that cpu. here is RT task migration_thread which is
woken up when a task is migrated to another cpu.
Below patch fixes it against 2.6.27-rc5.
Signed-off-by: Zhang Yanmin <yanmin_zhang@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
The issue of the endless reprogramming loop due to a too small
min_delta_ns was fixed with the previous updates of the clock events
code, but we had no information about the spread of this problem. I
added a WARN_ON to get automated information via kerneloops.org and to
get some direct reports, which allowed me to analyse the affected
machines.
The WARN_ON has served its purpose and would be annoying for a release
kernel. Remove it and just keep the information about the increase of
the min_delta_ns value.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
sched: arch_reinit_sched_domains() must destroy domains to force rebuild
sched, cpuset: rework sched domains and CPU hotplug handling (v4)
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
clocksource, acpi_pm.c: check for monotonicity
clocksource, acpi_pm.c: use proper read function also in errata mode
ntp: fix calculation of the next jiffie to trigger RTC sync
x86: HPET: read back compare register before reading counter
x86: HPET fix moronic 32/64bit thinko
clockevents: broadcast fixup possible waiters
HPET: make minimum reprogramming delta useful
clockevents: prevent endless loop lockup
clockevents: prevent multiple init/shutdown
clockevents: enforce reprogram in oneshot setup
clockevents: prevent endless loop in periodic broadcast handler
clockevents: prevent clockevent event_handler ending up handler_noop
|
|
|
|
What I realized recently is that calling rebuild_sched_domains() in
arch_reinit_sched_domains() by itself is not enough when cpusets are enabled.
partition_sched_domains() code is trying to avoid unnecessary domain rebuilds
and will not actually rebuild anything if new domain masks match the old ones.
What this means is that doing
echo 1 > /sys/devices/system/cpu/sched_mc_power_savings
on a system with cpusets enabled will not take affect untill something changes
in the cpuset setup (ie new sets created or deleted).
This patch fixes restore correct behaviour where domains must be rebuilt in
order to enable MC powersaving flags.
Test on quad-core Core2 box with both CONFIG_CPUSETS and !CONFIG_CPUSETS.
Also tested on dual-core Core2 laptop. Lockdep is happy and things are working
as expected.
Signed-off-by: Max Krasnyansky <maxk@qualcomm.com>
Tested-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
We have a bug in the calculation of the next jiffie to trigger the RTC
synchronisation. The aim here is to run sync_cmos_clock() as close as
possible to the middle of a second. Which means we want this function to
be called less than or equal to half a jiffie away from when now.tv_nsec
equals 5e8 (500000000).
If this is not the case for a given call to the function, for this purpose
instead of updating the RTC we calculate the offset in nanoseconds to the
next point in time where now.tv_nsec will be equal 5e8. The calculated
offset is then converted to jiffies as these are the unit used by the
timer.
Hovewer timespec_to_jiffies() used here uses a ceil()-type rounding mode,
where the resulting value is rounded up. As a result the range of
now.tv_nsec when the timer will trigger is from 5e8 to 5e8 + TICK_NSEC
rather than the desired 5e8 - TICK_NSEC / 2 to 5e8 + TICK_NSEC / 2.
As a result if for example sync_cmos_clock() happens to be called at the
time when now.tv_nsec is between 5e8 + TICK_NSEC / 2 and 5e8 to 5e8 +
TICK_NSEC, it will simply be rescheduled HZ jiffies later, falling in the
same range of now.tv_nsec again. Similarly for cases offsetted by an
integer multiple of TICK_NSEC.
This change addresses the problem by subtracting TICK_NSEC / 2 from the
nanosecond offset to the next point in time where now.tv_nsec will be
equal 5e8, effectively shifting the following rounding in
timespec_to_jiffies() so that it produces a rounded-to-nearest result.
Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
Until the C1E patches arrived there where no users of periodic broadcast
before switching to oneshot mode. Now we need to trigger a possible
waiter for a periodic broadcast when switching to oneshot mode.
Otherwise we can starve them for ever.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Spencer reported a problem where utime and stime were going negative despite
the fixes in commit b27f03d4bdc145a09fb7b0c0e004b29f1ee555fa. The suspected
reason for the problem is that signal_struct maintains it's own utime and
stime (of exited tasks), these are not updated using the new task_utime()
routine, hence sig->utime can go backwards and cause the same problem
to occur (sig->utime, adds tsk->utime and not task_utime()). This patch
fixes the problem
TODO: using max(task->prev_utime, derived utime) works for now, but a more
generic solution is to implement cputime_max() and use the cputime_gt()
function for comparison.
Reported-by: spencer@bluehost.com
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
If HLT stops the TSC, we'll fail to account idle time, thereby inflating the
actual process times. Fix this by re-calibrating the clock against GTOD when
leaving nohz mode.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Tested-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
The C1E/HPET bug reports on AMDX2/RS690 systems where tracked down to a
too small value of the HPET minumum delta for programming an event.
The clockevents code needs to enforce an interrupt event on the clock event
device in some cases. The enforcement code was stupid and naive, as it just
added the minimum delta to the current time and tried to reprogram the device.
When the minimum delta is too small, then this loops forever.
Add a sanity check. Allow reprogramming to fail 3 times, then print a warning
and double the minimum delta value to make sure, that this does not happen again.
Use the same function for both tick-oneshot and tick-broadcast code.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
While chasing the C1E/HPET bugreports I went through the clock events
code inch by inch and found that the broadcast device can be initialized
and shutdown multiple times. Multiple shutdowns are not critical, but
useless waste of time. Multiple initializations are simply broken. Another
CPU might have the device in use already after the first initialization and
the second init could just render it unusable again.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
In tick_oneshot_setup we program the device to the given next_event,
but we do not check the return value. We need to make sure that the
device is programmed enforced so the interrupt handler engine starts
working. Split out the reprogramming function from tick_program_event()
and call it with the device, which was handed in to tick_setup_oneshot().
Set the force argument, so the devices is firing an interrupt.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|