Age | Commit message (Collapse) | Author |
|
This is a backport of 6b43ae8a619d17c4935c3320d2ef9e92bdeed05d
This should have been backported when it was commited, but I
mistook the problem as requiring the ntp_lock changes
that landed in 3.4 in order for it to occur.
Unfortunately the same issue can happen (with only one cpu)
as follows:
do_adjtimex()
write_seqlock_irq(&xtime_lock);
process_adjtimex_modes()
process_adj_status()
ntp_start_leap_timer()
hrtimer_start()
hrtimer_reprogram()
tick_program_event()
clockevents_program_event()
ktime_get()
seq = req_seqbegin(xtime_lock); [DEADLOCK]
This deadlock will no always occur, as it requires the
leap_timer to force a hrtimer_reprogram which only happens
if its set and there's no sooner timer to expire.
NOTE: This patch, being faithful to the original commit,
introduces a bug (we don't update wall_to_monotonic),
which will be resovled by backporting a following fix.
Original commit message below:
Since commit 7dffa3c673fbcf835cd7be80bb4aec8ad3f51168 the ntp
subsystem has used an hrtimer for triggering the leapsecond
adjustment. However, this can cause a potential livelock.
Thomas diagnosed this as the following pattern:
CPU 0 CPU 1
do_adjtimex()
spin_lock_irq(&ntp_lock);
process_adjtimex_modes(); timer_interrupt()
process_adj_status(); do_timer()
ntp_start_leap_timer(); write_lock(&xtime_lock);
hrtimer_start(); update_wall_time();
hrtimer_reprogram(); ntp_tick_length()
tick_program_event() spin_lock(&ntp_lock);
clockevents_program_event()
ktime_get()
seq = req_seqbegin(xtime_lock);
This patch tries to avoid the problem by reverting back to not using
an hrtimer to inject leapseconds, and instead we handle the leapsecond
processing in the second_overflow() function.
The downside to this change is that on systems that support highres
timers, the leap second processing will occur on a HZ tick boundary,
(ie: ~1-10ms, depending on HZ) after the leap second instead of
possibly sooner (~34us in my tests w/ x86_64 lapic).
This patch applies on top of tip/timers/core.
CC: Sasha Levin <levinsasha928@gmail.com>
CC: Thomas Gleixner <tglx@linutronix.de>
Reported-by: Sasha Levin <levinsasha928@gmail.com>
Diagnoised-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Sasha Levin <levinsasha928@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Linux Kernel <linux-kernel@vger.kernel.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
|
|
commit e6780f7243eddb133cc20ec37fa69317c218b709 upstream.
It was found (by Sasha) that if you use a futex located in the gate
area we get stuck in an uninterruptible infinite loop, much like the
ZERO_PAGE issue.
While looking at this problem, PeterZ realized you'll get into similar
trouble when hitting any install_special_pages() mapping. And are there
still drivers setting up their own special mmaps without page->mapping,
and without special VM or pte flags to make get_user_pages fail?
In most cases, if page->mapping is NULL, we do not need to retry at all:
Linus points out that even /proc/sys/vm/drop_caches poses no problem,
because it ends up using remove_mapping(), which takes care not to
interfere when the page reference count is raised.
But there is still one case which does need a retry: if memory pressure
called shmem_writepage in between get_user_pages_fast dropping page
table lock and our acquiring page lock, then the page gets switched from
filecache to swapcache (and ->mapping set to NULL) whatever the refcount.
Fault it back in to get the page->mapping needed for key->shared.inode.
Reported-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[PG: 2.6.34 variable is page, not page_head, since it doesn't have a5b338f2]
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
|
|
[ Upstream commit b298d289c79211508f11cb50749b0d1d54eb244a ]
Commit a144c6a (PM: Print a warning if firmware is requested when tasks
are frozen) introduced usermodehelper_is_disabled() to warn and exit
immediately if firmware is requested when usermodehelpers are disabled.
However, it is racy. Consider the following scenario, currently used in
drivers/base/firmware_class.c:
...
if (usermodehelper_is_disabled())
goto out;
/* Do actual work */
...
out:
return err;
Nothing prevents someone from disabling usermodehelpers just after the check
in the 'if' condition, which means that it is quite possible to try doing the
"actual work" with usermodehelpers disabled, leading to undesirable
consequences.
In particular, this race condition in _request_firmware() causes task freezing
failures whenever suspend/hibernation is in progress because, it wrongly waits
to get the firmware/microcode image from userspace when actually the
usermodehelpers are disabled or userspace has been frozen.
Some of the example scenarios that cause freezing failures due to this race
are those that depend on userspace via request_firmware(), such as x86
microcode module initialization and microcode image reload.
Previous discussions about this issue can be found at:
http://thread.gmane.org/gmane.linux.kernel/1198291/focus=1200591
This patch adds proper synchronization to fix this issue.
It is to be noted that this patchset fixes the freezing failures but doesn't
remove the warnings. IOW, it does not attempt to add explicit synchronization
to x86 microcode driver to avoid requesting microcode image at inopportune
moments. Because, the warnings were introduced to highlight such cases, in the
first place. And we need not silence the warnings, since we take care of the
*real* problem (freezing failure) and hence, after that, the warnings are
pretty harmless anyway.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit a144c6a6c924aa1da04dd77fb84b89927354fdff ]
Some drivers erroneously use request_firmware() from their ->resume()
(or ->thaw(), or ->restore()) callbacks, which is not going to work
unless the firmware has been built in. This causes system resume to
stall until the firmware-loading timeout expires, which makes users
think that the resume has failed and reboot their machines
unnecessarily. For this reason, make _request_firmware() print a
warning and return immediately with error code if it has been called
when tasks are frozen and it's impossible to start any new usermode
helpers.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Reviewed-by: Valdis Kletnieks <valdis.kletnieks@vt.edu>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit f6302f1bcd75a042df69866d98b8d775a668f8f1 upstream.
"subbuf_size" and "n_subbufs" come from the user and they need to be
capped to prevent an integer overflow.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit d496aab567e7e52b3e974c9192a5de6e77dce32c upstream.
Commit ef53d9c5e ("kprobes: improve kretprobe scalability with hashed
locking") introduced a bug where we can potentially leak
kretprobe_instances since we initialize a hlist head after having used
it.
Initialize the hlist head before using it.
Reported by: Jim Keniston <jkenisto@us.ibm.com>
Acked-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Srinivasa D S <srinivasa@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit 79cfbdfa87e84992d509e6c1648a18e1d7e68c20 upstream.
The CPU hotplug notifications sent out by the _cpu_up() and _cpu_down()
functions depend on the value of the 'tasks_frozen' argument passed to them
(which indicates whether tasks have been frozen or not).
(Examples for such CPU hotplug notifications: CPU_ONLINE, CPU_ONLINE_FROZEN,
CPU_DEAD, CPU_DEAD_FROZEN).
Thus, it is essential that while the callbacks for those notifications are
running, the state of the system with respect to the tasks being frozen or
not remains unchanged, *throughout that duration*. Hence there is a need for
synchronizing the CPU hotplug code with the freezer subsystem.
Since the freezer is involved only in the Suspend/Hibernate call paths, this
patch hooks the CPU hotplug code to the suspend/hibernate notifiers
PM_[SUSPEND|HIBERNATE]_PREPARE and PM_POST_[SUSPEND|HIBERNATE] to prevent
the race between CPU hotplug and freezer, thus ensuring that CPU hotplug
notifications will always be run with the state of the system really being
what the notifications indicate, _throughout_ their execution time.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit f9fab10bbd768b0e5254e53a4a8477a94bfc4b96 upstream.
vfork parent uninterruptibly and unkillably waits for its child to
exec/exit. This wait is of unbounded length. Ignore such waits
in the hung_task detector.
Signed-off-by: Mandeep Singh Baines <msb@chromium.org>
Reported-by: Sasha Levin <levinsasha928@gmail.com>
LKML-Reference: <1325344394.28904.43.camel@lappy>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: John Kacur <jkacur@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit 3b87487ac5008072f138953b07505a7e3493327f upstream.
This reverts commit de28f25e8244c7353abed8de0c7792f5f883588c.
It results in resume problems for various people. See for example
http://thread.gmane.org/gmane.linux.kernel/1233033
http://thread.gmane.org/gmane.linux.kernel/1233389
http://thread.gmane.org/gmane.linux.kernel/1233159
http://thread.gmane.org/gmane.linux.kernel/1227868/focus=1230877
and the fedora and ubuntu bug reports
https://bugzilla.redhat.com/show_bug.cgi?id=767248
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/904569
which got bisected down to the stable version of this commit.
Reported-by: Jonathan Nieder <jrnieder@gmail.com>
Reported-by: Phil Miller <mille121@illinois.edu>
Reported-by: Philip Langdale <philipl@overt.org>
Reported-by: Tim Gardner <tim.gardner@canonical.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit 1a51410abe7d0ee4b1d112780f46df87d3621043 upstream.
Ok, this isn't optimal, since it means that 'iotop' needs admin
capabilities, and we may have to work on this some more. But at the
same time it is very much not acceptable to let anybody just read
anybody elses IO statistics quite at this level.
Use of the GENL_ADMIN_PERM suggested by Johannes Berg as an alternative
to checking the capabilities by hand.
Reported-by: Vasiliy Kulikov <segoon@openwall.com>
Cc: Johannes Berg <johannes.berg@intel.com>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Moritz Mühlenhoff <jmm@inutil.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit de28f25e8244c7353abed8de0c7792f5f883588c upstream.
If a device is shutdown, then there might be a pending interrupt,
which will be processed after we reenable interrupts, which causes the
original handler to be run. If the old handler is the (broadcast)
periodic handler the shutdown state might hang the kernel completely.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit c1be84309c58b1e7c6d626e28fba41a22b364c3d upstream.
When a better rated broadcast device is installed, then the current
active device is not disabled, which results in two running broadcast
devices.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit 550acb19269d65f32e9ac4ddb26c2b2070e37f1c upstream.
In irq_wait_for_interrupt(), the should_stop member is verified before
setting the task's state to TASK_INTERRUPTIBLE and calling schedule().
In case kthread_stop sets should_stop and wakes up the process after
should_stop is checked by the irq thread but before the task's state
is changed, the irq thread might never exit:
kthread_stop irq_wait_for_interrupt
------------ ----------------------
...
... while (!kthread_should_stop()) {
kthread->should_stop = 1;
wake_up_process(k);
wait_for_completion(&kthread->exited);
...
set_current_state(TASK_INTERRUPTIBLE);
...
schedule();
}
Fix this by checking if the thread should stop after modifying the
task's state.
[ tglx: Simplified it a bit ]
Signed-off-by: Ido Yariv <ido@wizery.com>
Link: http://lkml.kernel.org/r/1322740508-22640-1-git-send-email-ido@wizery.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit d004e024058a0eaca097513ce62cbcf978913e0a upstream.
ktime_get and ktime_get_ts were calling timekeeping_get_ns()
but later they were not calling arch_gettimeoffset() so architectures
using this mechanism returned 0 ns when calling these functions.
This happened for example when running Busybox's ping which calls
syscall(__NR_clock_gettime, CLOCK_MONOTONIC, ts) which eventually
calls ktime_get. As a result the returned ping travel time was zero.
Signed-off-by: Hector Palacios <hector.palacios@digi.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit 9bab0b7fbaceec47d32db51cd9e59c82fb071f5a upstream
This adds a mechanism to resume selected IRQs during syscore_resume
instead of dpm_resume_noirq.
Under Xen we need to resume IRQs associated with IPIs early enough
that the resched IPI is unmasked and we can therefore schedule
ourselves out of the stop_machine where the suspend/resume takes
place.
This issue was introduced by 676dc3cf5bc3 "xen: Use IRQF_FORCE_RESUME".
Back ported to 2.6.32 (which lacks syscore support) by calling the relavant
resume function directly from sysdev_resume).
v2: Fixed non-x86 build errors.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Jeremy Fitzhardinge <Jeremy.Fitzhardinge@citrix.com>
Cc: xen-devel <xen-devel@lists.xensource.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Link: http://lkml.kernel.org/r/1318713254.11016.52.camel@dagon.hellion.org.uk
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit 528f7ce6e439edeac38f6b3f8561f1be129b5e91 upstream.
In enter_state() we use "state" as an offset for the pm_states[]
array. The pm_states[] array only has PM_SUSPEND_MAX elements so
this test is off by one.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
This reverts commit 0f12a6ad9fa3a03f2bcee36c9cb704821e244c40.
It causes too many build errors and needs to be done properly.
Reported-by: Jiri Slaby <jslaby@suse.cz>
Reported-by: Christoph Biedl <linux-kernel.bfrz@manchmal.in-ulm.de>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Jeremy Fitzhardinge <Jeremy.Fitzhardinge@citrix.com>
Cc: xen-devel <xen-devel@lists.xensource.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit 9bab0b7fbaceec47d32db51cd9e59c82fb071f5a upstream
This adds a mechanism to resume selected IRQs during syscore_resume
instead of dpm_resume_noirq.
Under Xen we need to resume IRQs associated with IPIs early enough
that the resched IPI is unmasked and we can therefore schedule
ourselves out of the stop_machine where the suspend/resume takes
place.
This issue was introduced by 676dc3cf5bc3 "xen: Use IRQF_FORCE_RESUME".
Back ported to 2.6.32 (which lacks syscore support) by calling the relavant
resume function directly from sysdev_resume).
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Jeremy Fitzhardinge <Jeremy.Fitzhardinge@citrix.com>
Cc: xen-devel <xen-devel@lists.xensource.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Link: http://lkml.kernel.org/r/1318713254.11016.52.camel@dagon.hellion.org.uk
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit cbbc719fccdb8cbd87350a05c0d33167c9b79365 upstream.
The parameter's origin type is long. On an i386 architecture, it can
easily be larger than 0x80000000, causing this function to convert it
to a sign-extended u64 type.
Change the type to unsigned long so we get the correct result.
Signed-off-by: hank <pyu@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
[ build fix ]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit 37252db6aa576c34fd794a5a54fb32d7a8b3a07a upstream.
Due to post-increment in condition of kmod_loop_msg in __request_module(),
the system log can be spammed by much more than 5 instances of the 'runaway
loop' message if the number of events triggering it makes the kmod_loop_msg
to overflow.
Fix that by making sure we never increment it past the threshold.
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit 9ea71503a8ed9184d2d0b8ccc4d269d05f7940ae upstream.
commit 7485d0d3758e8e6491a5c9468114e74dc050785d (futexes: Remove rw
parameter from get_futex_key()) in 2.6.33 fixed two problems: First, It
prevented a loop when encountering a ZERO_PAGE. Second, it fixed RW
MAP_PRIVATE futex operations by forcing the COW to occur by
unconditionally performing a write access get_user_pages_fast() to get
the page. The commit also introduced a user-mode regression in that it
broke futex operations on read-only memory maps. For example, this
breaks workloads that have one or more reader processes doing a
FUTEX_WAIT on a futex within a read only shared file mapping, and a
writer processes that has a writable mapping issuing the FUTEX_WAKE.
This fixes the regression for valid futex operations on RO mappings by
trying a RO get_user_pages_fast() when the RW get_user_pages_fast()
fails. This change makes it necessary to also check for invalid use
cases, such as anonymous RO mappings (which can never change) and the
ZERO_PAGE which the commit referenced above was written to address.
This patch does restore the original behavior with RO MAP_PRIVATE
mappings, which have inherent user-mode usage problems and don't really
make sense. With this patch performing a FUTEX_WAIT within a RO
MAP_PRIVATE mapping will be successfully woken provided another process
updates the region of the underlying mapped file. However, the mmap()
man page states that for a MAP_PRIVATE mapping:
It is unspecified whether changes made to the file after
the mmap() call are visible in the mapped region.
So user-mode users attempting to use futex operations on RO MAP_PRIVATE
mappings are depending on unspecified behavior. Additionally a
RO MAP_PRIVATE mapping could fail to wake up in the following case.
Thread-A: call futex(FUTEX_WAIT, memory-region-A).
get_futex_key() return inode based key.
sleep on the key
Thread-B: call mprotect(PROT_READ|PROT_WRITE, memory-region-A)
Thread-B: write memory-region-A.
COW happen. This process's memory-region-A become related
to new COWed private (ie PageAnon=1) page.
Thread-B: call futex(FUETX_WAKE, memory-region-A).
get_futex_key() return mm based key.
IOW, we fail to wake up Thread-A.
Once again doing something like this is just silly and users who do
something like this get what they deserve.
While RO MAP_PRIVATE mappings are nonsensical, checking for a private
mapping requires walking the vmas and was deemed too costly to avoid a
userspace hang.
This Patch is based on Peter Zijlstra's initial patch with modifications to
only allow RO mappings for futex operations that need VERIFY_READ access.
Reported-by: David Oliver <david@rgmadvisors.com>
Signed-off-by: Shawn Bohrer <sbohrer@rgmadvisors.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Darren Hart <dvhart@linux.intel.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: peterz@infradead.org
Cc: eric.dumazet@gmail.com
Cc: zvonler@rgmadvisors.com
Cc: hughd@google.com
Link: http://lkml.kernel.org/r/1309450892-30676-1-git-send-email-sbohrer@rgmadvisors.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
The below patch is for -stable only, upstream has a much larger patch
that contains the below hunk in commit a8b0ca17b80e92faab46ee7179ba9e99ccb61233
Vince found that under certain circumstances software event overflows
go wrong and deadlock. Avoid trying to delete a timer from the timer
callback.
Reported-by: Vince Weaver <vweaver1@eecs.utk.edu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit 4d4cf23cdde2f8f9324f5684a7f349e182039529 upstream.
There is a bug in free_unnecessary_pages() that causes it to
attempt to free too many pages in some cases, which triggers the
BUG_ON() in memory_bm_clear_bit() for copy_bm. Namely, if
count_data_pages() is initially greater than alloc_normal, we get
to_free_normal equal to 0 and "save" greater from 0. In that case,
if the sum of "save" and count_highmem_pages() is greater than
alloc_highmem, we subtract a positive number from to_free_normal.
Hence, since to_free_normal was 0 before the subtraction and is
an unsigned int, the result is converted to a huge positive number
that is used as the number of pages to free.
Fix this bug by checking if to_free_normal is actually greater
than or equal to the number we're going to subtract from it.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Reported-and-tested-by: Matthew Garrett <mjg@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit 6715045ddc7472a22be5e49d4047d2d89b391f45 upstream.
There is a problem in hibernate_preallocate_memory() that it calls
preallocate_image_memory() with an argument that may be greater than
the total number of available non-highmem memory pages. If that's
the case, the OOM condition is guaranteed to trigger, which in turn
can cause significant slowdown to occur during hibernation.
To avoid that, make preallocate_image_memory() adjust its argument
before calling preallocate_image_pages(), so that the total number of
saveable non-highem pages left is not less than the minimum size of
a hibernation image. Change hibernate_preallocate_memory() to try to
allocate from highmem if the number of pages allocated by
preallocate_image_memory() is too low.
Modify free_unnecessary_pages() to take all possible memory
allocation patterns into account.
Reported-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Tested-by: M. Vefa Bicakci <bicave@superonline.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit 26c4caea9d697043cc5a458b96411b86d7f6babd upstream.
Currently a single process may register exit handlers unlimited times.
It may lead to a bloated listeners chain and very slow process
terminations.
Eg after 10KK sent TASKSTATS_CMD_ATTR_REGISTER_CPUMASKs ~300 Mb of
kernel memory is stolen for the handlers chain and "time id" shows 2-7
seconds instead of normal 0.003. It makes it possible to exhaust all
kernel memory and to eat much of CPU time by triggerring numerous exits
on a single CPU.
The patch limits the number of times a single process may register
itself on a single CPU to one.
One little issue is kept unfixed - as taskstats_exit() is called before
exit_files() in do_exit(), the orphaned listener entry (if it was not
explicitly deregistered) is kept until the next someone's exit() and
implicit deregistration in send_cpu_listeners(). So, if a process
registered itself as a listener exits and the next spawned process gets
the same pid, it would inherit taskstats attributes.
Signed-off-by: Vasiliy Kulikov <segooon@gmail.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit 8440f4b19494467883f8541b7aa28c7bbf6ac92b upstream.
When opening /dev/snapshot device, snapshot_open() creates memory
bitmaps which are freed in snapshot_release(). But if any of the
callbacks called by pm_notifier_call_chain() returns NOTIFY_BAD, open()
fails, snapshot_release() is never called and bitmaps are not freed.
Next attempt to open /dev/snapshot then triggers BUG_ON() check in
create_basic_memory_bitmaps(). This happens e.g. when vmwatchdog module
is active on s390x.
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit b5199515c25cca622495eb9c6a8a1d275e775088 upstream.
The clocksource watchdog code is interruptible and it has been
observed that this can trigger false positives which disable the TSC.
The reason is that an interrupt storm or a long running interrupt
handler between the read of the watchdog source and the read of the
TSC brings the two far enough apart that the delta is larger than the
unstable treshold. Move both reads into a short interrupt disabled
region to avoid that.
Reported-and-tested-by: Vernon Mauery <vernux@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit a386b5af8edda1c742ce9f77891e112eefffc005 upstream.
When the clocksource is not a multiple of HZ, the clock will be off. For
acpi_pm, HZ=1000 the error is 127.111 ppm:
The rounding of cycle_interval ends up generating a false error term in
ntp_error accumulation since xtime_interval is not exactly 1/HZ. So, we
subtract out the error caused by the rounding.
This has been visible since 2.6.32-rc2
commit a092ff0f90cae22b2ac8028ecd2c6f6c1a9e4601
time: Implement logarithmic time accumulation
That commit raised NTP_INTERVAL_FREQ and exposed the rounding error.
testing tool: http://n1.taur.dk/permanent/testpmt.c
Also tested with ntpd and a frequency counter.
Signed-off-by: Kasper Pedersen <kkp2010@kasperkp.dk>
Acked-by: john stultz <johnstul@us.ibm.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Will Tisdale <willtisdale@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit dc5f219e88294b93009eef946251251ffffb6d60 upstream.
Xen needs to reenable interrupts which are marked IRQF_NO_SUSPEND in the
resume path. Add a flag to force the reenabling in the resume code.
Tested-and-acked-by: Ian Campbell <Ian.Campbell@eu.citrix.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit f2513cde93f0957d5dc6c09bc24b0cccd27d8e1d upstream.
The main lock_is_held() user is lockdep_assert_held(), avoid false
assertions in lockdep_off() sections by unconditionally reporting the
lock is taken.
[ the reason this is important is a lockdep_assert_held() in ttwu()
which triggers a warning under lockdep_off() as in printk() which
can trigger another wakeup and lock up due to spinlock
recursion, as reported and heroically debugged by Arne Jansen ]
Reported-and-tested-by: Arne Jansen <lists@die-jansens.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1307398759.2497.966.camel@laptop
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit 058e297d34a404caaa5ed277de15698d8dc43000 upstream.
If function tracing is enabled, a read of the filter files will
cause the call to stop_machine to update the function trace sites.
It should only call stop_machine on write.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit 07f4beb0b5bbfaf36a64aa00d59e670ec578a95a upstream.
The first cpu which switches from periodic to oneshot mode switches
also the broadcast device into oneshot mode. The broadcast device
serves as a backup for per cpu timers which stop in deeper
C-states. To avoid starvation of the cpus which might be in idle and
depend on broadcast mode it marks the other cpus as broadcast active
and sets the brodcast expiry value of those cpus to the next tick.
The oneshot mode broadcast bit for the other cpus is sticky and gets
only cleared when those cpus exit idle. If a cpu was not idle while
the bit got set in consequence the bit prevents that the broadcast
device is armed on behalf of that cpu when it enters idle for the
first time after it switched to oneshot mode.
In most cases that goes unnoticed as one of the other cpus has usually
a timer pending which keeps the broadcast device armed with a short
timeout. Now if the only cpu which has a short timer active has the
bit set then the broadcast device will not be armed on behalf of that
cpu and will fire way after the expected timer expiry. In the case of
Christians bug report it took ~145 seconds which is about half of the
wrap around time of HPET (the limit for that device) due to the fact
that all other cpus had no timers armed which expired before the 145
seconds timeframe.
The solution is simply to clear the broadcast active bit
unconditionally when a cpu switches to oneshot mode after the first
cpu switched the broadcast device over. It's not idle at that point
otherwise it would not be executing that code.
[ I fundamentally hate that broadcast crap. Why the heck thought some
folks that when going into deep idle it's a brilliant concept to
switch off the last device which brings the cpu back from that
state? ]
Thanks to Christian for providing all the valuable debug information!
Reported-and-tested-by: Christian Hoffmann <email@christianhoffmann.info>
Cc: John Stultz <johnstul@us.ibm.com>
Link: http://lkml.kernel.org/r/%3Calpine.LFD.2.02.1105161105170.3078%40ionos%3E
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit e05b2efb82596905ebfe88e8612ee81dec9b6592 upstream.
Christian Hoffmann reported that the command line clocksource override
with acpi_pm timer fails:
Kernel command line: <SNIP> clocksource=acpi_pm
hpet clockevent registered
Switching to clocksource hpet
Override clocksource acpi_pm is not HRT compatible.
Cannot switch while in HRT/NOHZ mode.
The watchdog code is what enables CLOCK_SOURCE_VALID_FOR_HRES, but we
actually end up selecting the clocksource before we enqueue it into
the watchdog list, so that's why we see the warning and fail to switch
to acpi_pm timer as requested. That's particularly bad when we want to
debug timekeeping related problems in early boot.
Put the selection call last.
Reported-by: Christian Hoffmann <email@christianhoffmann.info>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Link: http://lkml.kernel.org/r/%3C1304558210.2943.24.camel%40work-vm%3E
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
Currently with 2.6.32-longterm, its possible for time() to occasionally
return values one second earlier then the previous time() call.
This happens because update_xtime_cache() does:
xtime_cache = xtime;
timespec_add_ns(&xtime_cache, nsec);
Its possible that xtime is 1sec,999msecs, and nsecs is 1ms, resulting in
a xtime_cache that is 2sec,0ms.
get_seconds() (which is used by sys_time()) does not take the
xtime_lock, which is ok as the xtime.tv_sec value is a long and can be
atomically read safely.
The problem occurs the next call to update_xtime_cache() if xtime has
not increased:
/* This sets xtime_cache back to 1sec, 999msec */
xtime_cache = xtime;
/* get_seconds, calls here, and sees a 1second inconsistency */
timespec_add_ns(&xtime_cache, nsec);
In order to resolve this, we could add locking to get_seconds(), but it
needs to be lock free, as it is called from the machine check handler,
opening a possible deadlock.
So instead, this patch introduces an intermediate value for the
calculations, so that we only assign xtime_cache once with the correct
time, using ACCESS_ONCE to make sure the compiler doesn't optimize out
any intermediate values.
The xtime_cache manipulations were removed with 2.6.35, so that kernel
and later do not need this change.
In 2.6.33 and 2.6.34 the logarithmic accumulation should make it so
xtime is updated each tick, so it is unlikely that two updates to
xtime_cache could occur while the difference between xtime and
xtime_cache crosses the second boundary. However, the paranoid might
want to pull this into 2.6.33/34-longterm just to be sure.
Thanks to Stephen for helping finally narrow down the root cause and
many hours of help with testing and validation. Also thanks to Max,
Andi, Eric and Paul for review of earlier attempts and helping clarify
what is possible with regard to out of order execution.
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit c78193e9c7bcbf25b8237ad0dec82f805c4ea69b upstream.
next_pidmap() just quietly accepted whatever 'last' pid that was passed
in, which is not all that safe when one of the users is /proc.
Admittedly the proc code should do some sanity checking on the range
(and that will be the next commit), but that doesn't mean that the
helper functions should just do that pidmap pointer arithmetic without
checking the range of its arguments.
So clamp 'last' to PID_MAX_LIMIT. The fact that we then do "last+1"
doesn't really matter, the for-loop does check against the end of the
pidmap array properly (it's only the actual pointer arithmetic overflow
case we need to worry about, and going one bit beyond isn't going to
overflow).
[ Use PID_MAX_LIMIT rather than pid_max as per Eric Biederman ]
Reported-by: Tavis Ormandy <taviso@cmpxchg8b.com>
Analyzed-by: Robert Święcki <robert@swiecki.net>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit fb82c0ff27b2c40c6f7a3d1a94cafb154591fa80 upstream.
The gdbserial protocol handler should return an empty packet instead
of an error string when ever it responds to a command it does not
implement.
The problem cases come from a debugger client sending
qTBuffer, qTStatus, qSearch, qSupported.
The incorrect response from the gdbstub leads the debugger clients to
not function correctly. Recent versions of gdb will not detach correctly as a result of this behavior.
Backport-request-by: Frank Pan <frankpzh@gmail.com>
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Dongdong Deng <dongdong.deng@windriver.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit 243b422af9ea9af4ead07a8ad54c90d4f9b6081a upstream.
Commit da48524eb206 ("Prevent rt_sigqueueinfo and rt_tgsigqueueinfo
from spoofing the signal code") made the check on si_code too strict.
There are several legitimate places where glibc wants to queue a
negative si_code different from SI_QUEUE:
- This was first noticed with glibc's aio implementation, which wants
to queue a signal with si_code SI_ASYNCIO; the current kernel
causes glibc's tst-aio4 test to fail because rt_sigqueueinfo()
fails with EPERM.
- Further examination of the glibc source shows that getaddrinfo_a()
wants to use SI_ASYNCNL (which the kernel does not even define).
The timer_create() fallback code wants to queue signals with SI_TIMER.
As suggested by Oleg Nesterov <oleg@redhat.com>, loosen the check to
forbid only the problematic SI_TKILL case.
Reported-by: Klaus Dittrich <kladit@arcor.de>
Acked-by: Julien Tinnes <jln@google.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit 880f57318450dbead6a03f9e31a1468924d6dd88 upstream.
The maximum kilobytes of locked memory that an unprivileged user
can reserve is of 512 kB = 128 pages by default, scaled to the
number of onlined CPUs, which fits well with the tools that use
128 data pages by default.
However tools actually use 129 pages, because they need one more
for the user control page. Thus the default mlock threshold is
not sufficient for the default tools needs and we always end up
to evaluate the constant mlock rlimit policy, which doesn't have
this scaling with the number of online CPUs.
Hence, on systems that have more than 16 CPUs, we overlap the
rlimit threshold and fail to mmap:
$ perf record ls
Error: failed to mmap with 1 (Operation not permitted)
Just increase the max unprivileged mlock threshold by one page
so that it supports well perf tools even after 16 CPUs.
Reported-by: Han Pingtian <phan@redhat.com>
Reported-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reported-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Stephane Eranian <eranian@google.com>
LKML-Reference: <1300904979-5508-1-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit da48524eb20662618854bb3df2db01fc65f3070c upstream.
Userland should be able to trust the pid and uid of the sender of a
signal if the si_code is SI_TKILL.
Unfortunately, the kernel has historically allowed sigqueueinfo() to
send any si_code at all (as long as it was negative - to distinguish it
from kernel-generated signals like SIGILL etc), so it could spoof a
SI_TKILL with incorrect siginfo values.
Happily, it looks like glibc has always set si_code to the appropriate
SI_QUEUE, so there are probably no actual user code that ever uses
anything but the appropriate SI_QUEUE flag.
So just tighten the check for si_code (we used to allow any negative
value), and add a (one-time) warning in case there are binaries out
there that might depend on using other si_code values.
Signed-off-by: Julien Tinnes <jln@google.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
This reverts commit 6f197b73304b3bd3d5a43b931383a5331d6b2987, which was
originally commit a0f7d0f7fc02465bb9758501f611f63381792996 upstream.
This breaks the build, thanks to Jiri Slaby for pointing this out.
Reported-by: Jiri Slaby <jslaby@suse.cz>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit 723aae25d5cdb09962901d36d526b44d4be1051c upstream.
Mike Galbraith reported finding a lockup ("perma-spin bug") where the
cpumask passed to smp_call_function_many was cleared by other cpu(s)
while a cpu was preparing its call_data block, resulting in no cpu to
clear the last ref and unlock the block.
Having cpus clear their bit asynchronously could be useful on a mask of
cpus that might have a translation context, or cpus that need a push to
complete an rcu window.
Instead of adding a BUG_ON and requiring yet another cpumask copy, just
detect the race and handle it.
Note: arch_send_call_function_ipi_mask must still handle an empty
cpumask because the data block is globally visible before the that arch
callback is made. And (obviously) there are no guarantees to which cpus
are notified if the mask is changed during the call; only cpus that were
online and had their mask bit set during the whole call are guaranteed
to be called.
Reported-by: Mike Galbraith <efault@gmx.de>
Reported-by: Jan Beulich <JBeulich@novell.com>
Acked-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit 45a5791920ae643eafc02e2eedef1a58e341b736 upstream.
Paul McKenney's review pointed out two problems with the barriers in the
2.6.38 update to the smp call function many code.
First, a barrier that would force the func and info members of data to
be visible before their consumption in the interrupt handler was
missing. This can be solved by adding a smp_wmb between setting the
func and info members and setting setting the cpumask; this will pair
with the existing and required smp_rmb ordering the cpumask read before
the read of refs. This placement avoids the need a second smp_rmb in
the interrupt handler which would be executed on each of the N cpus
executing the call request. (I was thinking this barrier was present
but was not).
Second, the previous write to refs (establishing the zero that we the
interrupt handler was testing from all cpus) was performed by a third
party cpu. This would invoke transitivity which, as a recient or
concurrent addition to memory-barriers.txt now explicitly states, would
require a full smp_mb().
However, we know the cpumask will only be set by one cpu (the data
owner) and any preivous iteration of the mask would have cleared by the
reading cpu. By redundantly writing refs to 0 on the owning cpu before
the smp_wmb, the write to refs will follow the same path as the writes
that set the cpumask, which in turn allows us to keep the barrier in the
interrupt handler a smp_rmb instead of promoting it to a smp_mb (which
will be be executed by N cpus for each of the possible M elements on the
list).
I moved and expanded the comment about our (ab)use of the rcu list
primitives for the concurrent walk earlier into this function. I
considered moving the first two paragraphs to the queue list head and
lock, but felt it would have been too disconected from the code.
Cc: Paul McKinney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit e6cd1e07a185d5f9b0aa75e020df02d3c1c44940 upstream.
Peter pointed out there was nothing preventing the list_del_rcu in
smp_call_function_interrupt from running before the list_add_rcu in
smp_call_function_many.
Fix this by not setting refs until we have gotten the lock for the list.
Take advantage of the wmb in list_add_rcu to save an explicit additional
one.
I tried to force this race with a udelay before the lock & list_add and
by mixing all 64 online cpus with just 3 random cpus in the mask, but
was unsuccessful. Still, inspection shows a valid race, and the fix is
a extension of the existing protection window in the current code.
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit a0f7d0f7fc02465bb9758501f611f63381792996 upstream.
We toggle the state from start and stop callbacks but actually
don't check it when the event triggers. Do it so that
these callbacks actually work.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1299529629-18280-2-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo |