Age | Commit message (Collapse) | Author |
|
commit 6575820221f7a4dd6eadecf7bf83cdd154335eda upstream.
Currently, all workqueue cpu hotplug operations run off
CPU_PRI_WORKQUEUE which is higher than normal notifiers. This is to
ensure that workqueue is up and running while bringing up a CPU before
other notifiers try to use workqueue on the CPU.
Per-cpu workqueues are supposed to remain working and bound to the CPU
for normal CPU_DOWN_PREPARE notifiers. This holds mostly true even
with workqueue offlining running with higher priority because
workqueue CPU_DOWN_PREPARE only creates a bound trustee thread which
runs the per-cpu workqueue without concurrency management without
explicitly detaching the existing workers.
However, if the trustee needs to create new workers, it creates
unbound workers which may wander off to other CPUs while
CPU_DOWN_PREPARE notifiers are in progress. Furthermore, if the CPU
down is cancelled, the per-CPU workqueue may end up with workers which
aren't bound to the CPU.
While reliably reproducible with a convoluted artificial test-case
involving scheduling and flushing CPU burning work items from CPU down
notifiers, this isn't very likely to happen in the wild, and, even
when it happens, the effects are likely to be hidden by the following
successful CPU down.
Fix it by using different priorities for up and down notifiers - high
priority for up operations and low priority for down operations.
Workqueue cpu hotplug operations will soon go through further cleanup.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit cc9a6c8776615f9c194ccf0b63a0aa5628235545 upstream.
Stable note: Not tracked in Bugzilla. [get|put]_mems_allowed() is extremely
expensive and severely impacted page allocator performance. This
is part of a series of patches that reduce page allocator overhead.
Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when
changing cpuset's mems") wins a super prize for the largest number of
memory barriers entered into fast paths for one commit.
[get|put]_mems_allowed is incredibly heavy with pairs of full memory
barriers inserted into a number of hot paths. This was detected while
investigating at large page allocator slowdown introduced some time
after 2.6.32. The largest portion of this overhead was shown by
oprofile to be at an mfence introduced by this commit into the page
allocator hot path.
For extra style points, the commit introduced the use of yield() in an
implementation of what looks like a spinning mutex.
This patch replaces the full memory barriers on both read and write
sides with a sequence counter with just read barriers on the fast path
side. This is much cheaper on some architectures, including x86. The
main bulk of the patch is the retry logic if the nodemask changes in a
manner that can cause a false failure.
While updating the nodemask, a check is made to see if a false failure
is a risk. If it is, the sequence number gets bumped and parallel
allocators will briefly stall while the nodemask update takes place.
In a page fault test microbenchmark, oprofile samples from
__alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The
actual results were
3.3.0-rc3 3.3.0-rc3
rc3-vanilla nobarrier-v2r1
Clients 1 UserTime 0.07 ( 0.00%) 0.08 (-14.19%)
Clients 2 UserTime 0.07 ( 0.00%) 0.07 ( 2.72%)
Clients 4 UserTime 0.08 ( 0.00%) 0.07 ( 3.29%)
Clients 1 SysTime 0.70 ( 0.00%) 0.65 ( 6.65%)
Clients 2 SysTime 0.85 ( 0.00%) 0.82 ( 3.65%)
Clients 4 SysTime 1.41 ( 0.00%) 1.41 ( 0.32%)
Clients 1 WallTime 0.77 ( 0.00%) 0.74 ( 4.19%)
Clients 2 WallTime 0.47 ( 0.00%) 0.45 ( 3.73%)
Clients 4 WallTime 0.38 ( 0.00%) 0.37 ( 1.58%)
Clients 1 Flt/sec/cpu 497620.28 ( 0.00%) 520294.53 ( 4.56%)
Clients 2 Flt/sec/cpu 414639.05 ( 0.00%) 429882.01 ( 3.68%)
Clients 4 Flt/sec/cpu 257959.16 ( 0.00%) 258761.48 ( 0.31%)
Clients 1 Flt/sec 495161.39 ( 0.00%) 517292.87 ( 4.47%)
Clients 2 Flt/sec 820325.95 ( 0.00%) 850289.77 ( 3.65%)
Clients 4 Flt/sec 1020068.93 ( 0.00%) 1022674.06 ( 0.26%)
MMTests Statistics: duration
Sys Time Running Test (seconds) 135.68 132.17
User+Sys Time Running Test (seconds) 164.2 160.13
Total Elapsed Time (seconds) 123.46 120.87
The overall improvement is small but the System CPU time is much
improved and roughly in correlation to what oprofile reported (these
performance figures are without profiling so skew is expected). The
actual number of page faults is noticeably improved.
For benchmarks like kernel builds, the overall benefit is marginal but
the system CPU time is slightly reduced.
To test the actual bug the commit fixed I opened two terminals. The
first ran within a cpuset and continually ran a small program that
faulted 100M of anonymous data. In a second window, the nodemask of the
cpuset was continually randomised in a loop.
Without the commit, the program would fail every so often (usually
within 10 seconds) and obviously with the commit everything worked fine.
With this patch applied, it also worked fine so the fix should be
functionally equivalent.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit a6bc32b899223a877f595ef9ddc1e89ead5072b8 upstream.
Stable note: Not tracked in Buzilla. This was part of a series that
reduced interactivity stalls experienced when THP was enabled.
These stalls were particularly noticable when copying data
to a USB stick but the experiences for users varied a lot.
This patch adds a lightweight sync migrate operation MIGRATE_SYNC_LIGHT
mode that avoids writing back pages to backing storage. Async compaction
maps to MIGRATE_ASYNC while sync compaction maps to MIGRATE_SYNC_LIGHT.
For other migrate_pages users such as memory hotplug, MIGRATE_SYNC is
used.
This avoids sync compaction stalling for an excessive length of time,
particularly when copying files to a USB stick where there might be a
large number of dirty pages backed by a filesystem that does not support
->writepages.
[aarcange@redhat.com: This patch is heavily based on Andrea's work]
[akpm@linux-foundation.org: fix fs/nfs/write.c build]
[akpm@linux-foundation.org: fix fs/btrfs/disk-io.c build]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andy Isaacson <adi@hexapodia.org>
Cc: Nai Xia <nai.xia@gmail.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit c82449352854ff09e43062246af86bdeb628f0c3 upstream.
Stable note: Not tracked in Bugzilla. A fix aimed at preserving page aging
information by reducing LRU list churning had the side-effect of
reducing THP allocation success rates. This was part of a series
to restore the success rates while preserving the reclaim fix.
Commit 39deaf85 ("mm: compaction: make isolate_lru_page() filter-aware")
noted that compaction does not migrate dirty or writeback pages and that
is was meaningless to pick the page and re-add it to the LRU list. This
had to be partially reverted because some dirty pages can be migrated by
compaction without blocking.
This patch updates "mm: compaction: make isolate_lru_page" by skipping
over pages that migration has no possibility of migrating to minimise LRU
disruption.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel<riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Cc: Dave Jones <davej@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andy Isaacson <adi@hexapodia.org>
Cc: Nai Xia <nai.xia@gmail.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
within ->migratepage
commit b969c4ab9f182a6e1b2a0848be349f99714947b0 upstream.
Stable note: Not tracked in Bugzilla. A fix aimed at preserving page
aging information by reducing LRU list churning had the side-effect
of reducing THP allocation success rates. This was part of a series
to restore the success rates while preserving the reclaim fix.
Asynchronous compaction is used when allocating transparent hugepages to
avoid blocking for long periods of time. Due to reports of stalling,
there was a debate on disabling synchronous compaction but this severely
impacted allocation success rates. Part of the reason was that many dirty
pages are skipped in asynchronous compaction by the following check;
if (PageDirty(page) && !sync &&
mapping->a_ops->migratepage != migrate_page)
rc = -EBUSY;
This skips over all mapping aops using buffer_migrate_page() even though
it is possible to migrate some of these pages without blocking. This
patch updates the ->migratepage callback with a "sync" parameter. It is
the responsibility of the callback to fail gracefully if migration would
block.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andy Isaacson <adi@hexapodia.org>
Cc: Nai Xia <nai.xia@gmail.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit f80c0673610e36ae29d63e3297175e22f70dde5f upstream.
Stable note: Not tracked in Bugzilla. THP and compaction disrupt the LRU list
leading to poor reclaim decisions which has a variable
performance impact.
In __zone_reclaim case, we don't want to shrink mapped page. Nonetheless,
we have isolated mapped page and re-add it into LRU's head. It's
unnecessary CPU overhead and makes LRU churning.
Of course, when we isolate the page, the page might be mapped but when we
try to migrate the page, the page would be not mapped. So it could be
migrated. But race is rare and although it happens, it's no big deal.
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 39deaf8585152f1a35c1676d3d7dc6ae0fb65967 upstream.
Stable note: Not tracked in Bugzilla. THP and compaction disrupt the LRU
list leading to poor reclaim decisions which has a variable
performance impact.
In async mode, compaction doesn't migrate dirty or writeback pages. So,
it's meaningless to pick the page and re-add it to lru list.
Of course, when we isolate the page in compaction, the page might be dirty
or writeback but when we try to migrate the page, the page would be not
dirty, writeback. So it could be migrated. But it's very unlikely as
isolate and migration cycle is much faster than writeout.
So, this patch helps cpu overhead and prevent unnecessary LRU churning.
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 4356f21d09283dc6d39a6f7287a65ddab61e2808 upstream.
Stable note: Not tracked in Bugzilla. This patch makes later patches
easier to apply but has no other impact.
Change ISOLATE_XXX macro with bitwise isolate_mode_t type. Normally,
macro isn't recommended as it's type-unsafe and making debugging harder as
symbol cannot be passed throught to the debugger.
Quote from Johannes
" Hmm, it would probably be cleaner to fully convert the isolation mode
into independent flags. INACTIVE, ACTIVE, BOTH is currently a
tri-state among flags, which is a bit ugly."
This patch moves isolate mode from swap.h to mmzone.h by memcontrol.h
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 095760730c1047c69159ce88021a7fa3833502c8 upstream.
Stable note: This patch makes later patches easier to apply but otherwise
has little to justify it. It is a diagnostic patch that was part
of a series addressing excessive slab shrinking after GFP_NOFS
failures. There is detailed information on the series' motivation
at https://lkml.org/lkml/2011/6/2/42 .
It is impossible to understand what the shrinkers are actually doing
without instrumenting the code, so add a some tracepoints to allow
insight to be gained.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Mel Gorman <mgorman@suse.de>
|
|
This is a backport of f6c06abfb3972ad4914cef57d8348fcb2932bc3b
To finally fix the infamous leap second issue and other race windows
caused by functions which change the offsets between the various time
bases (CLOCK_MONOTONIC, CLOCK_REALTIME and CLOCK_BOOTTIME) we need a
function which atomically gets the current monotonic time and updates
the offsets of CLOCK_REALTIME and CLOCK_BOOTTIME with minimalistic
overhead. The previous patch which provides ktime_t offsets allows us
to make this function almost as cheap as ktime_get() which is going to
be replaced in hrtimer_interrupt().
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Link: http://lkml.kernel.org/r/1341960205-56738-7-git-send-email-johnstul@us.ibm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
This is a backport of f55a6faa384304c89cfef162768e88374d3312cb
clock_was_set() cannot be called from hard interrupt context because
it calls on_each_cpu().
For fixing the widely reported leap seconds issue it is necessary to
call it from hard interrupt context, i.e. the timer tick code, which
does the timekeeping updates.
Provide a new function which denotes it in the hrtimer cpu base
structure of the cpu on which it is called and raise the hrtimer
softirq. We then execute the clock_was_set() notificiation from
softirq context in run_hrtimer_softirq(). The hrtimer softirq is
rarely used, so polling the flag there is not a performance issue.
[ tglx: Made it depend on CONFIG_HIGH_RES_TIMERS. We really should get
rid of all this ifdeffery ASAP ]
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Reported-by: Jan Engelhardt <jengelh@inai.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Prarit Bhargava <prarit@redhat.com>
Link: http://lkml.kernel.org/r/1341960205-56738-2-git-send-email-johnstul@us.ibm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
This is a backport of 6b43ae8a619d17c4935c3320d2ef9e92bdeed05d
This should have been backported when it was commited, but I
mistook the problem as requiring the ntp_lock changes
that landed in 3.4 in order for it to occur.
Unfortunately the same issue can happen (with only one cpu)
as follows:
do_adjtimex()
write_seqlock_irq(&xtime_lock);
process_adjtimex_modes()
process_adj_status()
ntp_start_leap_timer()
hrtimer_start()
hrtimer_reprogram()
tick_program_event()
clockevents_program_event()
ktime_get()
seq = req_seqbegin(xtime_lock); [DEADLOCK]
This deadlock will no always occur, as it requires the
leap_timer to force a hrtimer_reprogram which only happens
if its set and there's no sooner timer to expire.
NOTE: This patch, being faithful to the original commit,
introduces a bug (we don't update wall_to_monotonic),
which will be resovled by backporting a following fix.
Original commit message below:
Since commit 7dffa3c673fbcf835cd7be80bb4aec8ad3f51168 the ntp
subsystem has used an hrtimer for triggering the leapsecond
adjustment. However, this can cause a potential livelock.
Thomas diagnosed this as the following pattern:
CPU 0 CPU 1
do_adjtimex()
spin_lock_irq(&ntp_lock);
process_adjtimex_modes(); timer_interrupt()
process_adj_status(); do_timer()
ntp_start_leap_timer(); write_lock(&xtime_lock);
hrtimer_start(); update_wall_time();
hrtimer_reprogram(); ntp_tick_length()
tick_program_event() spin_lock(&ntp_lock);
clockevents_program_event()
ktime_get()
seq = req_seqbegin(xtime_lock);
This patch tries to avoid the problem by reverting back to not using
an hrtimer to inject leapseconds, and instead we handle the leapsecond
processing in the second_overflow() function.
The downside to this change is that on systems that support highres
timers, the leap second processing will occur on a HZ tick boundary,
(ie: ~1-10ms, depending on HZ) after the leap second instead of
possibly sooner (~34us in my tests w/ x86_64 lapic).
This patch applies on top of tip/timers/core.
CC: Sasha Levin <levinsasha928@gmail.com>
CC: Thomas Gleixner <tglx@linutronix.de>
Reported-by: Sasha Levin <levinsasha928@gmail.com>
Diagnoised-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Sasha Levin <levinsasha928@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit d8adde17e5f858427504725218c56aef90e90fc7 upstream.
kswapd_stop() is called to destroy the kswapd work thread when all memory
of a NUMA node has been offlined. But kswapd_stop() only terminates the
work thread without resetting NODE_DATA(nid)->kswapd to NULL. The stale
pointer will prevent kswapd_run() from creating a new work thread when
adding memory to the memory-less NUMA node again. Eventually the stale
pointer may cause invalid memory access.
An example stack dump as below. It's reproduced with 2.6.32, but latest
kernel has the same issue.
BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [<ffffffff81051a94>] exit_creds+0x12/0x78
PGD 0
Oops: 0000 [#1] SMP
last sysfs file: /sys/devices/system/memory/memory391/state
CPU 11
Modules linked in: cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq microcode fuse loop dm_mod tpm_tis rtc_cmos i2c_i801 rtc_core tpm serio_raw pcspkr sg tpm_bios igb i2c_core iTCO_wdt rtc_lib mptctl iTCO_vendor_support button dca bnx2 usbhid hid uhci_hcd ehci_hcd usbcore sd_mod crc_t10dif edd ext3 mbcache jbd fan ide_pci_generic ide_core ata_generic ata_piix libata thermal processor thermal_sys hwmon mptsas mptscsih mptbase scsi_transport_sas scsi_mod
Pid: 7949, comm: sh Not tainted 2.6.32.12-qiuxishi-5-default #92 Tecal RH2285
RIP: 0010:exit_creds+0x12/0x78
RSP: 0018:ffff8806044f1d78 EFLAGS: 00010202
RAX: 0000000000000000 RBX: ffff880604f22140 RCX: 0000000000019502
RDX: 0000000000000000 RSI: 0000000000000202 RDI: 0000000000000000
RBP: ffff880604f22150 R08: 0000000000000000 R09: ffffffff81a4dc10
R10: 00000000000032a0 R11: ffff880006202500 R12: 0000000000000000
R13: 0000000000c40000 R14: 0000000000008000 R15: 0000000000000001
FS: 00007fbc03d066f0(0000) GS:ffff8800282e0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 000000060f029000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process sh (pid: 7949, threadinfo ffff8806044f0000, task ffff880603d7c600)
Stack:
ffff880604f22140 ffffffff8103aac5 ffff880604f22140 ffffffff8104d21e
ffff880006202500 0000000000008000 0000000000c38000 ffffffff810bd5b1
0000000000000000 ffff880603d7c600 00000000ffffdd29 0000000000000003
Call Trace:
__put_task_struct+0x5d/0x97
kthread_stop+0x50/0x58
offline_pages+0x324/0x3da
memory_block_change_state+0x179/0x1db
store_mem_state+0x9e/0xbb
sysfs_write_file+0xd0/0x107
vfs_write+0xad/0x169
sys_write+0x45/0x6e
system_call_fastpath+0x16/0x1b
Code: ff 4d 00 0f 94 c0 84 c0 74 08 48 89 ef e8 1f fd ff ff 5b 5d 31 c0 41 5c c3 53 48 8b 87 20 06 00 00 48 89 fb 48 8b bf 18 06 00 00 <8b> 00 48 c7 83 18 06 00 00 00 00 00 00 f0 ff 0f 0f 94 c0 84 c0
RIP exit_creds+0x12/0x78
RSP <ffff8806044f1d78>
CR2: 0000000000000000
[akpm@linux-foundation.org: add pglist_data.kswapd locking comments]
Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit dbf0e4c7257f8d684ec1a3c919853464293de66e upstream.
Quite a few ASUS computers experience a nasty problem, related to the
EHCI controllers, when going into system suspend. It was observed
that the problem didn't occur if the controllers were not put into the
D3 power state before starting the suspend, and commit
151b61284776be2d6f02d48c23c3625678960b97 (USB: EHCI: fix crash during
suspend on ASUS computers) was created to do this.
It turned out this approach messed up other computers that didn't have
the problem -- it prevented USB wakeup from working. Consequently
commit c2fb8a3fa25513de8fedb38509b1f15a5bbee47b (USB: add
NO_D3_DURING_SLEEP flag and revert 151b61284776be2) was merged; it
reverted the earlier commit and added a whitelist of known good board
names.
Now we know the actual cause of the problem. Thanks to AceLan Kao for
tracking it down.
According to him, an engineer at ASUS explained that some of their
BIOSes contain a bug that was added in an attempt to work around a
problem in early versions of Windows. When the computer goes into S3
suspend, the BIOS tries to verify that the EHCI controllers were first
quiesced by the OS. Nothing's wrong with this, but the BIOS does it
by checking that the PCI COMMAND registers contain 0 without checking
the controllers' power state. If the register isn't 0, the BIOS
assumes the controller needs to be quiesced and tries to do so. This
involves making various MMIO accesses to the controller, which don't
work very well if the controller is already in D3. The end result is
a system hang or memory corruption.
Since the value in the PCI COMMAND register doesn't matter once the
controller has been suspended, and since the value will be restored
anyway when the controller is resumed, we can work around the BIOS bug
simply by setting the register to 0 during system suspend. This patch
(as1590) does so and also reverts the second commit mentioned above,
which is now unnecessary.
In theory we could do this for every PCI device. However to avoid
introducing new problems, the patch restricts itself to EHCI host
controllers.
Finally the affected systems can suspend with USB wakeup working
properly.
Reference: https://bugzilla.kernel.org/show_bug.cgi?id=37632
Reference: https://bugzilla.kernel.org/show_bug.cgi?id=42728
Based-on-patch-by: AceLan Kao <acelan.kao@canonical.com>
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Tested-by: Dâniel Fraga <fragabr@gmail.com>
Tested-by: Javier Marcet <jmarcet@gmail.com>
Tested-by: Andrey Rahmatullin <wrar@wrar.name>
Tested-by: Oleksij Rempel <bug-track@fisher-privat.net>
Tested-by: Pavel Pisa <pisa@cmp.felk.cvut.cz>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 5ee31c6898ea5537fcea160999d60dc63bc0c305 ]
In the transmit path of the bonding driver, skb->cb is used to
stash the skb->queue_mapping so that the bonding device can set its
own queue mapping. This value becomes corrupted since the skb->cb is
also used in __dev_xmit_skb.
When transmitting through bonding driver, bond_select_queue is
called from dev_queue_xmit. In bond_select_queue the original
skb->queue_mapping is copied into skb->cb (via bond_queue_mapping)
and skb->queue_mapping is overwritten with the bond driver queue.
Subsequently in dev_queue_xmit, __dev_xmit_skb is called which writes
the packet length into skb->cb, thereby overwriting the stashed
queue mappping. In bond_dev_queue_xmit (called from hard_start_xmit),
the queue mapping for the skb is set to the stashed value which is now
the skb length and hence is an invalid queue for the slave device.
If we want to save skb->queue_mapping into skb->cb[], best place is to
add a field in struct qdisc_skb_cb, to make sure it wont conflict with
other layers (eg : Qdiscc, Infiniband...)
This patchs also makes sure (struct qdisc_skb_cb)->data is aligned on 8
bytes :
netem qdisc for example assumes it can store an u64 in it, without
misalignment penalty.
Note : we only have 20 bytes left in (struct qdisc_skb_cb)->data[].
The largest user is CHOKe and it fills it.
Based on a previous patch from Tom Herbert.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Tom Herbert <therbert@google.com>
Cc: John Fastabend <john.r.fastabend@intel.com>
Cc: Roland Dreier <roland@kernel.org>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 20e2a86485967c385d7c7befc1646e4d1d39362e ]
When NetLabel is not enabled, e.g. CONFIG_NETLABEL=n, and the system
receives a CIPSO tagged packet it is dropped (cipso_v4_validate()
returns non-zero). In most cases this is the correct and desired
behavior, however, in the case where we are simply forwarding the
traffic, e.g. acting as a network bridge, this becomes a problem.
This patch fixes the forwarding problem by providing the basic CIPSO
validation code directly in ip_options_compile() without the need for
the NetLabel or CIPSO code. The new validation code can not perform
any of the CIPSO option label/value verification that
cipso_v4_validate() does, but it can verify the basic CIPSO option
format.
The behavior when NetLabel is enabled is unchanged.
Signed-off-by: Paul Moore <pmoore@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit c2fb8a3fa25513de8fedb38509b1f15a5bbee47b upstream.
This patch (as1558) fixes a problem affecting several ASUS computers:
The machine crashes or corrupts memory when going into suspend if the
ehci-hcd driver is bound to any controllers. Users have been forced
to unbind or unload ehci-hcd before putting their systems to sleep.
After extensive testing, it was determined that the machines don't
like going into suspend when any EHCI controllers are in the PCI D3
power state. Presumably this is a firmware bug, but there's nothing
we can do about it except to avoid putting the controllers in D3
during system sleep.
The patch adds a new flag to indicate whether the problem is present,
and avoids changing the controller's power state if the flag is set.
Runtime suspend is unaffected; this matters only for system suspend.
However as a side effect, the controller will not respond to remote
wakeup requests while the system is asleep. Hence USB wakeup is not
functional -- but of course, this is already true in the current state
of affairs.
A similar patch has already been applied as commit
151b61284776be2d6f02d48c23c3625678960b97 (USB: EHCI: fix crash during
suspend on ASUS computers). The patch supersedes that one and reverts
it. There are two differences:
The old patch added the flag at the USB level; this patch
adds it at the PCI level.
The old patch applied to all chipsets with the same vendor,
subsystem vendor, and product IDs; this patch makes an
exception for a known-good system (based on DMI information).
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Tested-by: Dâniel Fraga <fragabr@gmail.com>
Tested-by: Andrey Rahmatullin <wrar@wrar.name>
Tested-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit a2bef8ce826dd1e787fd8ad9b6e0566ba59dab43 upstream.
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 617c8c11236716dcbda877e764b7bf37c6fd8063 ]
At the beginning of __skb_cow, headroom gets set to a minimum of
NET_SKB_PAD. This causes unnecessary reallocations if the buffer was not
cloned and the headroom is just below NET_SKB_PAD, but still more than the
amount requested by the caller.
This was showing up frequently in my tests on VLAN tx, where
vlan_insert_tag calls skb_cow_head(skb, VLAN_HLEN).
Locally generated packets should have enough headroom, and for forward
paths, we already have NET_SKB_PAD bytes of headroom, so we don't need to
add any extra space here.
Signed-off-by: Felix Fietkau <nbd@openwrt.org>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit e0268868ba064980488fc8c194db3d8e9fb2959c ]
dst_check() will take care of SA (and obsolete field), hence
IPsec rekeying scenario is taken into account.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Vlad Yaseivch <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 59b9997baba5242997ddc7bd96b1391f5275a5a4 ]
This reverts commit 8a83a00b0735190384a348156837918271034144.
It causes regressions for S390 devices, because it does an
unconditional DST drop on SKBs for vlans and the QETH device
needs the neighbour entry hung off the DST for certain things
on transmit.
Arnd can't remember exactly why he even needed this change.
Conflicts:
drivers/net/macvlan.c
net/8021q/vlan_dev.c
net/core/dev.c
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 0c1833797a5a6ec23ea9261d979aa18078720b74 ]
Since commit ad0081e43a
"ipv6: Fragment locally generated tunnel-mode IPSec6 packets as needed"
the fragment of packets is incorrect.
because tunnel mode needs IPsec headers and trailer for all fragments,
while on transport mode it is sufficient to add the headers to the
first fragment and the trailer to the last.
so modify mtu and maxfraglen base on ipsec mode and if fragment is first
or last.
with my test,it work well(every fragment's size is the mtu)
and does not trigger slow fragment path.
Changes from v1:
though optimization, mtu_prev and maxfraglen_prev can be delete.
replace xfrm mode codes with dst_entry's new frag DST_XFRM_TUNNEL.
add fuction ip6_append_data_mtu to make codes clearer.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 26c191788f18129af0eb32a358cdaea0c7479626 upstream.
When holding the mmap_sem for reading, pmd_offset_map_lock should only
run on a pmd_t that has been read atomically from the pmdp pointer,
otherwise we may read only half of it leading to this crash.
PID: 11679 TASK: f06e8000 CPU: 3 COMMAND: "do_race_2_panic"
#0 [f06a9dd8] crash_kexec at c049b5ec
#1 [f06a9e2c] oops_end at c083d1c2
#2 [f06a9e40] no_context at c0433ded
#3 [f06a9e64] bad_area_nosemaphore at c043401a
#4 [f06a9e6c] __do_page_fault at c0434493
#5 [f06a9eec] do_page_fault at c083eb45
#6 [f06a9f04] error_code (via page_fault) at c083c5d5
EAX: 01fb470c EBX: fff35000 ECX: 00000003 EDX: 00000100 EBP:
00000000
DS: 007b ESI: 9e201000 ES: 007b EDI: 01fb4700 GS: 00e0
CS: 0060 EIP: c083bc14 ERR: ffffffff EFLAGS: 00010246
#7 [f06a9f38] _spin_lock at c083bc14
#8 [f06a9f44] sys_mincore at c0507b7d
#9 [f06a9fb0] system_call at c083becd
start len
EAX: ffffffda EBX: 9e200000 ECX: 00001000 EDX: 6228537f
DS: 007b ESI: 00000000 ES: 007b EDI: 003d0f00
SS: 007b ESP: 62285354 EBP: 62285388 GS: 0033
CS: 0073 EIP: 00291416 ERR: 000000da EFLAGS: 00000286
This should be a longstanding bug affecting x86 32bit PAE without THP.
Only archs with 64bit large pmd_t and 32bit unsigned long should be
affected.
With THP enabled the barrier() in pmd_none_or_trans_huge_or_clear_bad()
would partly hide the bug when the pmd transition from none to stable,
by forcing a re-read of the *pmd in pmd_offset_map_lock, but when THP is
enabled a new set of problem arises by the fact could then transition
freely in any of the none, pmd_trans_huge or pmd_trans_stable states.
So making the barrier in pmd_none_or_trans_huge_or_clear_bad()
unconditional isn't good idea and it would be a flakey solution.
This should be fully fixed by introducing a pmd_read_atomic that reads
the pmd in order with THP disabled, or by reading the pmd atomically
with cmpxchg8b with THP enabled.
Luckily this new race condition only triggers in the places that must
already be covered by pmd_none_or_trans_huge_or_clear_bad() so the fix
is localized there but this bug is not related to THP.
NOTE: this can trigger on x86 32bit systems with PAE enabled with more
than 4G of ram, otherwise the high part of the pmd will never risk to be
truncated because it would be zero at all times, in turn so hiding the
SMP race.
This bug was discovered and fully debugged by Ulrich, quote:
----
[..]
pmd_none_or_trans_huge_or_clear_bad() loads the content of edx and
eax.
496 static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t
*pmd)
497 {
498 /* depend on compiler for an atomic pmd read */
499 pmd_t pmdval = *pmd;
// edi = pmd pointer
0xc0507a74 <sys_mincore+548>: mov 0x8(%esp),%edi
...
// edx = PTE page table high address
0xc0507a84 <sys_mincore+564>: mov 0x4(%edi),%edx
...
// eax = PTE page table low address
0xc0507a8e <sys_mincore+574>: mov (%edi),%eax
[..]
Please note that the PMD is not read atomically. These are two "mov"
instructions where the high order bits of the PMD entry are fetched
first. Hence, the above machine code is prone to the following race.
- The PMD entry {high|low} is 0x0000000000000000.
The "mov" at 0xc0507a84 loads 0x00000000 into edx.
- A page fault (on another CPU) sneaks in between the two "mov"
instructions and instantiates the PMD.
- The PMD entry {high|low} is now 0x00000003fda38067.
The "mov" at 0xc0507a8e loads 0xfda38067 into eax.
----
Reported-by: Ulrich Obergfell <uobergfe@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Petr Matousek <pmatouse@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit bbbc4c4d8c5face097d695f9bf3a39647ba6b7e7 upstream.
Commit 06e8935feb ("optimized SDIO IRQ handling for single irq")
introduced some spurious calls to SDIO function interrupt handlers,
such as when the SDIO IRQ thread is started, or the safety check
performed upon a system resume. Let's add a flag to perform the
optimization only when a real interrupt is signaled by the host
driver and we know there is no point confirming it.
Reported-by: Sujit Reddy Thumma <sthumma@codeaurora.org>
Signed-off-by: Nicolas Pitre <nico@linaro.org>
Signed-off-by: Chris Ball <cjb@laptop.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 080399aaaf3531f5b8761ec0ac30ff98891e8686 upstream.
Hi,
We have a bug report open where a squashfs image mounted on ppc64 would
exhibit errors due to trying to read beyond the end of the disk. It can
easily be reproduced by doing the following:
[root@ibm-p750e-02-lp3 ~]# ls -l install.img
-rw-r--r-- 1 root root 142032896 Apr 30 16:46 install.img
[root@ibm-p750e-02-lp3 ~]# mount -o loop ./install.img /mnt/test
[root@ibm-p750e-02-lp3 ~]# dd if=/dev/loop0 of=/dev/null
dd: reading `/dev/loop0': Input/output error
277376+0 records in
277376+0 records out
142016512 bytes (142 MB) copied, 0.9465 s, 150 MB/s
In dmesg, you'll find the following:
squashfs: version 4.0 (2009/01/31) Phillip Lougher
[ 43.106012] attempt to access beyond end of device
[ 43.106029] loop0: rw=0, want=277410, limit=277408
[ 43.106039] Buffer I/O error on device loop0, logical block 138704
[ 43.106053] attempt to access beyond end of device
[ 43.106057] loop0: rw=0, want=277412, limit=277408
[ 43.106061] Buffer I/O error on device loop0, logical block 138705
[ 43.106066] attempt to access beyond end of device
[ 43.106070] loop0: rw=0, want=277414, limit=277408
[ 43.106073] Buffer I/O error on device loop0, logical block 138706
[ 43.106078] attempt to access beyond end of device
[ 43.106081] loop0: rw=0, want=277416, limit=277408
[ 43.106085] Buffer I/O error on device loop0, logical block 138707
[ 43.106089] attempt to access beyond end of device
[ 43.106093] loop0: rw=0, want=277418, limit=277408
[ 43.106096] Buffer I/O error on device loop0, logical block 138708
[ 43.106101] attempt to access beyond end of device
[ 43.106104] loop0: rw=0, want=277420, limit=277408
[ 43.106108] Buffer I/O error on device loop0, logical block 138709
[ 43.106112] attempt to access beyond end of device
[ 43.106116] loop0: rw=0, want=277422, limit=277408
[ 43.106120] Buffer I/O error on device loop0, logical block 138710
[ 43.106124] attempt to access beyond end of device
[ 43.106128] loop0: rw=0, want=277424, limit=277408
[ 43.106131] Buffer I/O error on device loop0, logical block 138711
[ 43.106135] attempt to access beyond end of device
[ 43.106139] loop0: rw=0, want=277426, limit=277408
[ 43.106143] Buffer I/O error on device loop0, logical block 138712
[ 43.106147] attempt to access beyond end of device
[ 43.106151] loop0: rw=0, want=277428, limit=277408
[ 43.106154] Buffer I/O error on device loop0, logical block 138713
[ 43.106158] attempt to access beyond end of device
[ 43.106162] loop0: rw=0, want=277430, limit=277408
[ 43.106166] attempt to access beyond end of device
[ 43.106169] loop0: rw=0, want=277432, limit=277408
...
[ 43.106307] attempt to access beyond end of device
[ 43.106311] loop0: rw=0, want=277470, limit=2774
Squashfs manages to read in the end block(s) of the disk during the
mount operation. Then, when dd reads the block device, it leads to
block_read_full_page being called with buffers that are beyond end of
disk, but are marked as mapped. Thus, it would end up submitting read
I/O against them, resulting in the errors mentioned above. I fixed the
problem by modifying init_page_buffers to only set the buffer mapped if
it fell inside of i_size.
Cheers,
Jeff
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Nick Piggin <npiggin@kernel.dk>
--
Changes from v1->v2: re-used max_block, as suggested by Nick Piggin.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 05c69d298c96703741cac9a5cbbf6c53bd55a6e2 upstream.
6d1d8050b4bc8 "block, partition: add partition_meta_info to hd_struct"
added part_unpack_uuid() which assumes that the passed in buffer has
enough space for sprintfing "%pU" - 37 characters including '\0'.
Unfortunately, b5af921ec0233 "init: add support for root devices
specified by partition UUID" supplied 33 bytes buffer to the function
leading to the following panic with stackprotector enabled.
Kernel panic - not syncing: stack-protector: Kernel stack corrupted in: ffffffff81b14c7e
[<ffffffff815e226b>] panic+0xba/0x1c6
[<ffffffff81b14c7e>] ? printk_all_partitions+0x259/0x26xb
[<ffffffff810566bb>] __stack_chk_fail+0x1b/0x20
[<ffffffff81b15c7e>] printk_all_paritions+0x259/0x26xb
[<ffffffff81aedfe0>] mount_block_root+0x1bc/0x27f
[<ffffffff81aee0fa>] mount_root+0x57/0x5b
[<ffffffff81aee23b>] prepare_namespace+0x13d/0x176
[<ffffffff8107eec0>] ? release_tgcred.isra.4+0x330/0x30
[<ffffffff81aedd60>] kernel_init+0x155/0x15a
[<ffffffff81087b97>] ? schedule_tail+0x27/0xb0
[<ffffffff815f4d24>] kernel_thread_helper+0x5/0x10
[<ffffffff81aedc0b>] ? start_kernel+0x3c5/0x3c5
[<ffffffff815f4d20>] ? gs_change+0x13/0x13
Increase the buffer size, remove the dangerous part_unpack_uuid() and
use snprintf() directly from printk_all_partitions().
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Szymon Gruszczynski <sz.gruszczynski@googlemail.com>
Cc: Will Drewry <wad@chromium.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 5b6e9bcdeb65634b4ad604eb4536404bbfc62cfa upstream.
Commit 4231d47e6fe69f061f96c98c30eaf9fb4c14b96d(net/usbnet: avoid
recursive locking in usbnet_stop()) fixes the recursive locking
problem by releasing the skb queue lock before unlink, but may
cause skb traversing races:
- after URB is unlinked and the queue lock is released,
the refered skb and skb->next may be moved to done queue,
even be released
- in skb_queue_walk_safe, the next skb is still obtained
by next pointer of the last skb
- so maybe trigger oops or other problems
This patch extends the usage of entry->state to describe 'start_unlink'
state, so always holding the queue(rx/tx) lock to change the state if
the referd skb is in rx or tx queue because we need to know if the
refered urb has been started unlinking in unlink_urbs.
The other part of this patch is based on Huajun's patch:
always traverse from head of the tx/rx queue to get skb which is
to be unlinked but not been started unlinking.
Signed-off-by: Huajun Li <huajun.li.lee@gmail.com>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Cc: Oliver Neukum <oneukum@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 2f624278626677bfaf73fef97f86b37981621f5c upstream.
We really need to use a ACCESS_ONCE() on the sequence value read in
__read_seqcount_begin(), because otherwise the compiler might end up
reloading the value in between the test and the return of it. As a
result, it might end up returning an odd value (which means that a write
is in progress).
If the reader is then fast enough that that odd value is still the
current one when the read_seqcount_retry() is done, we might end up with
a "successful" read sequence, even despite the concurrent write being
active.
In practice this probably never really happens - there just isn't
anything else going on around the read of the sequence count, and the
common case is that we end up having a read barrier immediately
afterwards.
So the code sequence in which gcc might decide to reaload from memory is
small, and there's no reason to believe it would ever actually do the
reload. But if the compiler ever were to decide to do so, it would be
incredibly annoying to debug. Let's just make sure.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit f5c2347ee20a8d6964d6a6b1ad04f200f8d4dfa7 upstream.
<asm-generic/statfs.h> is exported to userspace, so using
BITS_PER_LONG is invalid. We need to use __BITS_PER_LONG instead.
This is kernel bugzilla 43165.
Reported-by: H.J. Lu <hjl.tools@gmail.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Link: http://lkml.kernel.org/r/1335465916-16965-1-git-send-email-hpa@linux.intel.com
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 41b3254c93acc56adc3c4477fef7c9512d47659e upstream.
More recent versions of the UEFI spec have added new attributes for
variables. Add them.
Signed-off-by: Matthew Garrett <mjg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 9883035ae7edef3ec62ad215611cb8e17d6a1a5d upstream.
The actual internal pipe implementation is already really about
individual packets (called "pipe buffers"), and this simply exposes that
as a special packetized mode.
When we are in the packetized mode (marked by O_DIRECT as suggested by
Alan Cox), a write() on a pipe will not merge the new data with previous
writes, so each write will get a pipe buffer of its own. The pipe
buffer is then marked with the PIPE_BUF_FLAG_PACKET flag, which in turn
will tell the reader side to break the read at that boundary (and throw
away any partial packet contents that do not fit in the read buffer).
End result: as long as you do writes less than PIPE_BUF in size (so that
the pipe doesn't have to split them up), you can now treat the pipe as a
packet interface, where each read() system call will read one packet at
a time. You can just use a sufficiently big read buffer (PIPE_BUF is
sufficient, since bigger than that doesn't guarantee atomicity anyway),
and the return value of the read() will naturally give you the size of
the packet.
NOTE! We do not support zero-sized packets, and zero-sized reads and
writes to a pipe continue to be no-ops. Also note that big packets will
currently be split at write time, but that the size at which that
happens is not really specified (except that it's bigger than PIPE_BUF).
Currently that limit is the system page size, but we might want to
explicitly support bigger packets some day.
The main user for this is going to be the autofs packet interface,
allowing us to stop having to care so deeply about exact packet sizes
(which have had bugs with 32/64-bit compatibility modes). But user
space can create packetized pipes with "pipe2(fd, O_DIRECT)", which will
fail with an EINVAL on kernels that do not support this interface.
Tested-by: Michael Tokarev <mjt@tls.msk.ru>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: David Miller <davem@davemloft.net>
Cc: Ian Kent <raven@themaw.net>
Cc: Thomas Meyer <thomas@m3y3r.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 151b61284776be2d6f02d48c23c3625678960b97 upstream.
This patch (as1545) fixes a problem affecting several ASUS computers:
The machine crashes or corrupts memory when going into suspend if the
ehci-hcd driver is bound to any controllers. Users have been forced
to unbind or unload ehci-hcd before putting their systems to sleep.
After extensive testing, it was determined that the machines don't
like going into suspend when any EHCI controllers are in the PCI D3
power state. Presumably this is a firmware bug, but there's nothing
we can do about it except to avoid putting the controllers in D3
during system sleep.
The patch adds a new flag to indicate whether the problem is present,
and avoids changing the controller's power state if the flag is set.
Runtime suspend is unaffected; this matters only for system suspend.
However as a side effect, the controller will not respond to remote
wakeup requests while the system is asleep. Hence USB wakeup is not
functional -- but of course, this is already true in the current state
of affairs.
This fixes Bugzilla #42728.
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Tested-by: Steven Rostedt <rostedt@goodmis.org>
Tested-by: Andrey Rahmatullin <wrar@wrar.name>
Tested-by: Oleksij Rempel (fishor) <bug-track@fisher-privat.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 32f6daad4651a748a58a3ab6da0611862175722f upstream.
We've been adding new mappings, but not destroying old mappings.
This can lead to a page leak as pages are pinned using
get_user_pages, but only unpinned with put_page if they still
exist in the memslots list on vm shutdown. A memslot that is
destroyed while an iommu domain is enabled for the guest will
therefore result in an elevated page reference count that is
never cleared.
Additionally, without this fix, the iommu is only programmed
with the first translation for a gpa. This can result in
peer-to-peer errors if a mapping is destroyed and replaced by a
new mapping at the same gpa as the iommu will still be pointing
to the original, pinned memory address.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Jonathan Nieder <jrnieder@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ This combines upstream commit
2f53384424251c06038ae612e56231b96ab610ee and the follow-on bug fix
commit 35f9c09fe9c72eb8ca2b8e89a593e1c151f28fc2 ]
vmsplice()/splice(pipe, socket) call do_tcp_sendpages() one page at a
time, adding at most 4096 bytes to an skb. (assuming PAGE_SIZE=4096)
The call to tcp_push() at the end of do_tcp_sendpages() forces an
immediate xmit when pipe is not already filled, and tso_fragment() try
to split these skb to MSS multiples.
4096 bytes are usually split in a skb with 2 MSS, and a remaining
sub-mss skb (assuming MTU=1500)
This makes slow start suboptimal because many small frames are sent to
qdisc/driver layers instead of big ones (constrained by cwnd and packets
in flight of course)
In fact, applications using sendmsg() (adding an additional memory copy)
instead of vmsplice()/splice()/sendfile() are a bit faster because of
this anomaly, especially if serving small files in environments with
large initial [c]wnd.
Call tcp_push() only if MSG_MORE is not set in the flags parameter.
This bit is automatically provided by splice() internals but for the
last page, or on all pages if user specified SPLICE_F_MORE splice()
flag.
In some workloads, this can reduce number of sent logical packets by an
order of magnitude, making zero-copy TCP actually faster than
one-copy :)
Reported-by: Tom Herbert <therbert@google.com>
Cc: Nandita Dukkipati <nanditad@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: H.K. Jerry Chu <hkchu@google.com>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Mahesh Bandewar <maheshb@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 94324962066231a938564bebad0f941cd2d06bb2 upstream.
Make sure hci_dev_open returns immediately if hci_dev_unregister has
been called.
This fixes a race between hci_dev_open and hci_dev_unregister which can
lead to a NULL-pointer dereference.
Bug is 100% reproducible using hciattach and a disconnected serial port:
0. # hciattach -n /dev/ttyO1 any noflow
1. hci_dev_open called from hci_power_on grabs req lock
2. hci_init_req executes but device fails to initialise (times out
eventually)
3. hci_dev_open is called from hci_sock_ioctl and sleeps on req lock
4. hci_uart_tty_close calls hci_dev_unregister and sleeps on req lock in
hci_dev_do_close
5. hci_dev_open (1) releases req lock
6. hci_dev_do_close grabs req lock and returns as device is not up
7. hci_dev_unregister sleeps in destroy_workqueue
8. hci_dev_open (3) grabs req lock, calls hci_init_req and eventually sleeps
9. hci_dev_unregister finishes, while hci_dev_open is still running...
[ 79.627136] INFO: trying to register non-static key.
[ 79.632354] the code is fine but needs lockdep annotation.
[ 79.638122] turning off the locking correctness validator.
[ 79.643920] [<c00188bc>] (unwind_backtrace+0x0/0xf8) from [<c00729c4>] (__lock_acquire+0x1590/0x1ab0)
[ 79.653594] [<c00729c4>] (__lock_acquire+0x1590/0x1ab0) from [<c00733f8>] (lock_acquire+0x9c/0x128)
[ 79.663085] [<c00733f8>] (lock_acquire+0x9c/0x128) from [<c0040a88>] (run_timer_softirq+0x150/0x3ac)
[ 79.672668] [<c0040a88>] (run_timer_softirq+0x150/0x3ac) from [<c003a3b8>] (__do_softirq+0xd4/0x22c)
[ 79.682281] [<c003a3b8>] (__do_softirq+0xd4/0x22c) from [<c003a924>] (irq_exit+0x8c/0x94)
[ 79.690856] [<c003a924>] (irq_exit+0x8c/0x94) from [<c0013a50>] (handle_IRQ+0x34/0x84)
[ 79.699157] [<c0013a50>] (handle_IRQ+0x34/0x84) from [<c0008530>] (omap3_intc_handle_irq+0x48/0x4c)
[ 79.708648] [<c0008530>] (omap3_intc_handle_irq+0x48/0x4c) from [<c037499c>] (__irq_usr+0x3c/0x60)
[ 79.718048] Exception stack(0xcf281fb0 to 0xcf281ff8)
[ 79.723358] 1fa0: 0001e6a0 be8dab00 0001e698 00036698
[ 79.731933] 1fc0: 0002df98 0002df38 0000001f 00000000 b6f234d0 00000000 00000004 00000000
[ 79.740509] 1fe0: 0001e6f8 be8d6aa0 be8dac50 0000aab8 80000010 ffffffff
[ 79.747497] Unable to handle kernel NULL pointer dereference at virtual address 00000000
[ 79.756011] pgd = cf3b4000
[ 79.758850] [00000000] *pgd=8f0c7831, *pte=00000000, *ppte=00000000
[ 79.765502] Internal error: Oops: 80000007 [#1]
[ 79.770294] Modules linked in:
[ 79.773529] CPU: 0 Tainted: G W (3.3.0-rc6-00002-gb5d5c87 #421)
[ 79.781066] PC is at 0x0
[ 79.783721] LR is at run_timer_softirq+0x16c/0x3ac
[ 79.788787] pc : [<00000000>] lr : [<c0040aa4>] psr: 60000113
[ 79.788787] sp : cf281ee0 ip : 00000000 fp : cf280000
[ 79.800903] r10: 00000004 r9 : 00000100 r8 : b6f234d0
[ 79.806427] r7 : c0519c28 r6 : cf093488 r5 : c0561a00 r4 : 00000000
[ 79.813323] r3 : 00000000 r2 : c054eee0 r1 : 00000001 r0 : 00000000
[ 79.820190] Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment user
[ 79.827728] Control: 10c5387d Table: 8f3b4019 DAC: 00000015
[ 79.833801] Process gpsd (pid: 1265, stack limit = 0xcf2802e8)
[ 79.839965] Stack: (0xcf281ee0 to 0xcf282000)
[ 79.844573] 1ee0: 00000002 00000000 c0040a24 00000000 00000002 cf281f08 00200200 00000000
[ 79.853210] 1f00: 00000000 cf281f18 cf281f08 00000000 00000000 00000000 cf281f18 cf281f18
[ 79.861816] 1f20: 00000000 00000001 c056184c 00000000 00000001 b6f234d0 c0561848 00000004
[ 79.870452] 1f40: cf280000 c003a3b8 c051e79c 00000001 00000000 00000100 3fa9e7b8 0000000a
[ 79.879089] 1f60: 00000025 cf280000 00000025 00000000 00000000 b6f234d0 00000000 00000004
[ 79.887756] 1f80: 00000000 c003a924 c053ad38 c0013a50 fa200000 cf281fb0 ffffffff c0008530
[ 79.896362] 1fa0: 0001e6a0 0000aab8 80000010 c037499c 0001e6a0 be8dab00 0001e698 00036698
[ 79.904998] 1fc0: 0002df98 0002df38 0000001f 00000000 b6f234d0 00000000 00000004 00000000
[ 79.913665] 1fe0: 0001e6f8 be8d6aa0 be8dac50 0000aab8 80000010 ffffffff 00fbf700 04ffff00
[ 79.922302] [<c0040aa4>] (run_timer_softirq+0x16c/0x3ac) from [<c003a3b8>] (__do_softirq+0xd4/0x22c)
[ 79.931945] [<c003a3b8>] (__do_softirq+0xd4/0x22c) from [<c003a924>] (irq_exit+0x8c/0x94)
[ 79.940582] [<c003a924>] (irq_exit+0x8c/0x94) from [<c0013a50>] (handle_IRQ+0x34/0x84)
[ 79.948913] [<c0013a50>] (handle_IRQ+0x34/0x84) from [<c0008530>] (omap3_intc_handle_irq+0x48/0x4c)
[ 79.958404] [<c0008530>] (omap3_intc_handle_irq+0x48/0x4c) from [<c037499c>] (__irq_usr+0x3c/0x60)
[ 79.967773] Exception stack(0xcf281fb0 to 0xcf281ff8)
[ 79.973083] 1fa0: 0001e6a0 be8dab00 0001e698 00036698
[ 79.981658] 1fc0: 0002df98 0002df38 0000001f 00000000 b6f234d0 00000000 00000004 00000000
[ 79.990234] 1fe0: 0001e6f8 be8d6aa0 be8dac50 0000aab8 80000010 ffffffff
[ 79.997161] Code: bad PC value
[ 80.000396] ---[ end trace 6f6739840475f9ee ]---
[ 80.005279] Kernel panic - not syncing: Fatal exception in interrupt
Signed-off-by: Johan Hovold <jhovold@gmail.com>
Acked-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 9993bc635d01a6ee7f6b833b4ee65ce7c06350b1 upstream.
When a machine boots up, the TSC generally gets reset. However,
when kexec is used to boot into a kernel, the TSC value would be
carried over from the previous kernel. The computation of
cycns_offset in set_cyc2ns_scale is prone to an overflow, if the
machine has been up more than 208 days prior to the kexec. The
overflow happens when we multiply *scale, even though there is
enough room to store the final answer.
We fix this issue by decomposing tsc_now into the quotient and
remainder of division by CYC2NS_SCALE_FACTOR and then performing
the multiplication separately on the two components.
Refactor code to share the calculation with the previous
fix in __cycles_2_ns().
Signed-off-by: Salman Qazi <sqazi@google.com>
Acked-by: John Stultz <john.stultz@linaro.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Turner <pjt@google.com>
Cc: john stultz <johnstul@us.ibm.com>
Link: http://lkml.kernel.org/r/20120310004027.19291.88460.stgit@dungbeetle.mtv.corp.google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 3751d3e85cf693e10e2c47c03c8caa65e171099b upstream.
There has long been a limitation using software breakpoints with a
kernel compiled with CONFIG_DEBUG_RODATA going back to 2.6.26. For
this particular patch, it will apply cleanly and has been tested all
the way back to 2.6.36.
The kprobes code uses the text_poke() function which accommodates
writing a breakpoint into a read-only page. The x86 kgdb code can
solve the problem similarly by overriding the default breakpoint
set/remove routines and using text_poke() directly.
The x86 kgdb code will first attempt to use the traditional
probe_kernel_write(), and next try using a the text_poke() function.
The break point install method is tracked such that the correct break
point removal routine will get called later on.
Cc: x86@kernel.org
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Inspried-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 98b54aa1a2241b59372468bd1e9c2d207bdba54b upstream.
There is extra state information that needs to be exposed in the
kgdb_bpt structure for tracking how a breakpoint was installed. The
debug_core only uses the the probe_kernel_write() to install
breakpoints, but this is not enough for all the archs. Some arch such
as x86 need to use text_poke() in order to install a breakpoint into a
read only page.
Passing the kgdb_bpt structure to kgdb_arch_set_breakpoint() and
kgdb_arch_remove_breakpoint() allows other archs to set the type
variable which indicates how the breakpoint was installed.
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 1631fcea8399da5e80a80084b3b8c5bfd99d21e7 upstream.
<asm-generic/unistd.h> was set up to use sys_sendfile() for the 32-bit
compat API instead of sys_sendfile64(), but in fact the right thing to
do is to use sys_sendfile64() in all cases. The 32-bit sendfile64() API
in glibc uses the sendfile64 syscall, so it has to be capable of doing
full 64-bit operations. But the sys_sendfile() kernel implementation
has a MAX_NON_LFS test in it which explicitly limits the offset to 2^32.
So, we need to use the sys_sendfile64() implementation in the kernel
for this case.
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit ce880cb860f36694d2cdebfac9e6ae18176fe4c4 upstream.
The USB graphics card driver delays the unregistering of the framebuffer
device to a workqueue, which breaks the userspace visible remove uevent
sequence. Recent userspace tools started to support USB graphics card
hotplug out-of-the-box and rely on proper events sent by the kernel.
The framebuffer device is a direct child of the USB interface which is
removed immediately after the USB .disconnect() callback. But the fb device
in /sys stays around until its final cleanup, at a time where all the parent
devices have been removed already.
To work around that, we remove the sysfs fb device directly in the USB
.disconnect() callback and leave only the cleanup of the internal fb
data to the delayed work.
Before:
add /devices/pci0000:00/0000:00:1d.0/usb2/2-1/2-1.2 (usb)
add /devices/pci0000:00/0000:00:1d.0/usb2/2-1/2-1.2/2-1.2:1.0 (usb)
add /devices/pci0000:00/0000:00:1d.0/usb2/2-1/2-1.2/2-1.2:1.0/graphics/fb0 (graphics)
remove /devices/pci0000:00/0000:00:1d.0/usb2/2-1/2-1.2/2-1.2:1.0 (usb)
remove /devices/pci0000:00/0000:00:1d.0/usb2/2-1/2-1.2 (usb)
remove /2-1.2:1.0/graphics/fb0 (graphics)
After:
add /devices/pci0000:00/0000:00:1d.0/usb2/2-1/2-1.2 (usb)
add /devices/pci0000:00/0000:00:1d.0/usb2/2-1/2-1.2/2-1.2:1.0 (usb)
add /devices/pci0000:00/0000:00:1d.0/usb2/2-1/2-1.2/2-1.2:1.0/graphics/fb1 (graphics)
remove /devices/pci0000:00/0000:00:1d.0/usb2/2-1/2-1.2/2-1.2:1.0/graphics/fb1 (graphics)
remove /devices/pci0000:00/0000:00:1d.0/usb2/2-1/2-1.2/2-1.2:1.0 (usb)
remove /devices/pci0000:00/0000:00:1d.0/usb2/2-1/2-1.2 (usb)
Tested-by: Bernie Thompson <bernie@plugable.com>
Acked-by: Bernie Thompson <bernie@plugable.com>
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Florian Tobias Schandinat <FlorianSchandinat@gmx.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 1a5a9906d4e8d1976b701f889d8f35d54b928f25 upstream.
In some cases it may happen that pmd_none_or_clear_bad() is called with
the mmap_sem hold in read mode. In those cases the huge page faults can
allocate hugepmds under pmd_none_or_clear_bad() and that can trigger a
false positive from pmd_bad() that will not like to see a pmd
materializing as trans huge.
It's not khugepaged causing the problem, khugepaged holds the mmap_sem
in write mode (and all those sites must hold the mmap_sem in read mode
to prevent pagetables to go away from under them, during code review it
seems vm86 mode on 32bit kernels requires that too unless it's
restricted to 1 thread per process or UP builds). The race is only with
the huge pagefaults that can convert a pmd_none() into a
pmd_trans_huge().
Effectively all these pmd_none_or_clear_bad() sites running with
mmap_sem in read mode are somewhat speculative with the page faults, and
the result is always undefined when they run simultaneously. This is
probably why it wasn't common to run into this. For example if the
madvise(MADV_DONTNEED) runs zap_page_range() shortly before the page
fault, the hugepage will not be zapped, if the page fault runs first it
will be zapped.
Altering pmd_bad() not to error out if it finds hugepmds won't be enough
to fix this, because zap_pmd_range would then proceed to call
zap_pte_range (which would be incorrect if the pmd become a
pmd_trans_huge()).
The simplest way to fix this is to read the pmd in the local stack
(regardless of what we read, no need of actual CPU barriers, only
compiler barrier needed), and be sure it is not changing under the code
that computes its value. Even if the real pmd is changing under the
value we hold on the stack, we don't care. If we actually end up in
zap_pte_range it means the pmd was not none already and it was not huge,
and it can't become huge from under us (khugepaged locking explained
above).
All we need is to enforce that there is no way anymore that in a code
path like below, pmd_trans_huge can be false, but pmd_none_or_clear_bad
can run into a hugepmd. The overhead of a barrier() is just a compiler
tweak and should not be measurable (I only added it for THP builds). I
don't exclude different compiler versions may have prevented the race
too by caching the value of *pmd on the stack (that hasn't been
verified, but it wouldn't be impossible considering
pmd_none_or_clear_bad, pmd_bad, pmd_trans_huge, pmd_none are all inlines
and there's no external function called in between pmd_trans_huge and
pmd_none_or_clear_bad).
if (pmd_trans_huge(*pmd)) {
if (next-addr != HPAGE_PMD_SIZE) {
VM_BUG_ON(!rwsem_is_locked(&tlb->mm->mmap_sem));
split_huge_page_pmd(vma->vm_mm, pmd);
} else if (zap_huge_pmd(tlb, vma, pmd, addr))
continue;
/* fall through */
}
if (pmd_none_or_clear_bad(pmd))
Because this race condition could be exercised without special
privileges this was reported in CVE-2012-1179.
The race was identified and fully explained by Ulrich who debugged it.
I'm quoting his accurate explanation below, for reference.
====== start quote =======
mapcount 0 page_mapcount 1
kernel BUG at mm/huge_memory.c:1384!
At some point prior to the panic, a "bad pmd ..." message similar to the
following is logged on the console:
mm/memory.c:145: bad pmd ffff8800376e1f98(80000000314000e7).
The "bad pmd ..." message is logged by pmd_clear_bad() before it clears
the page's PMD table entry.
143 void pmd_clear_bad(pmd_t *pmd)
144 {
-> 145 pmd_ERROR(*pmd);
146 pmd_clear(pmd);
147 }
After the PMD table entry has been cleared, there is an inconsistency
between the actual number of PMD table entries that are mapping the page
and the page's map count (_mapcount field in struct page). When the page
is subsequently reclaimed, __split_huge_page() detects this inconsistency.
1381 if (mapcount != page_mapcount(page))
1382 printk(KERN_ERR "mapcount %d page_mapcount %d\n",
1383 mapcount, page_mapcount(page));
-> 1384 BUG_ON(mapcount != page_mapcount(page));
The root cause of the problem is a race of two threads in a multithreaded
process. Thread B incurs a page fault on a virtual address that has never
been accessed (PMD entry is zero) while Thread A is executing an madvise()
system call on a virtual address within the same 2 MB (huge page) range.
virtual address space
.---------------------.
| |
| |
.-|---------------------|
| | |
| | |<-- B(fault)
| | |
2 MB | |/////////////////////|-.
huge < |/////////////////////| > A(range)
page | |/////////////////////|-'
| | |
| | |
'-|---------------------|
| |
| |
'---------------------'
- Thread A is executing an madvise(..., MADV_DONTNEED) system call
on the virtual address range "A(range)" shown in the picture.
sys_madvise
// Acquire the semaphore in shared mode.
down_read(¤t->mm->mmap_sem)
...
madvise_vma
switch (behavior)
case MADV_DONTNEED:
madvise_dontneed
zap_page_range
unmap_vmas
unmap_page_range
zap_pud_range
zap_pmd_range
//
// Assume that this huge page has never been accessed.
// I.e. content of the PMD entry is zero (not mapped).
//
if (pmd_trans_huge(*pmd)) {
// We don't get here due to the above assumption.
}
//
// Assume that Thread B incurred a page fault and
.---------> // sneaks in here as shown below.
| //
| if (pmd_none_or_clear_bad(pmd))
| {
| if (unlikely(pmd_bad(*pmd)))
| pmd_clear_bad
| {
| pmd_ERROR
| // Log "bad pmd ..." message here.
| pmd_clear
| // Clear the page's PMD entry.
| // Thread B incremented the map count
| // in page_add_new_anon_rmap(), but
| // now the page is no longer mapped
| // by a PMD entry (-> inconsistency).
| }
| }
|
v
- Thread B is handling a page fault on virtual address "B(fault)" shown
in the picture.
...
do_page_fault
__do_page_fault
// Acquire the semaphore in shared mode.
down_read_trylock(&mm->mmap_sem)
...
handle_mm_fault
if (pmd_none(*pmd) && transparent_hugepage_enabled(vma))
// We get here due to the above assumption (PMD entry is zero).
do_huge_pmd_anonymous_page
alloc_hugepage_vma
// Allocate a new transparent huge page here.
...
__do_huge_pmd_anonymous_page
...
spin_lock(&mm->page_table_lock)
...
page_add_new_anon_rmap
// Here we increment the page's map count (starts at -1).
atomic_set(&page->_mapcount, 0)
set_pmd_at
// Here we set the page's PMD entry which will be cleared
// when Thread A calls pmd_clear_bad().
...
spin_unlock(&mm->page_table_lock)
The mmap_sem does not prevent the race because both threads are acquiring
it in shared mode (down_read). Thread B holds the page_table_lock while
the page's map count and PMD table entry are updated. However, Thread A
does not synchronize on that lock.
====== end quote =======
[akpm@linux-foundation.org: checkpatch fixes]
Reported-by: Ulrich Obergfell <uobergfe@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Jones <davej@redhat.com>
Acked-by: Larry Woodman <lwoodman@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Mark Salter <msalter@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit f910381a55cdaa097030291f272f6e6e4380c39a upstream.
Add a div64_long macro which is used to devide a 64bit number by a long (which
can be 4 bytes on 32bit systems and 8 bytes on 64bit systems).
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Cc: johnstul@us.ibm.com
Link: http://lkml.kernel.org/r/1331829374-31543-1-git-send-email-levinsasha928@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 62d3c5439c534b0e6c653fc63e6d8c67be3a57b1 upstream.
This patch (as1519) fixes a bug in the block layer's disk-events
polling. The polling is done by a work routine queued on the
system_nrt_wq workqueue. Since that workqueue isn't freezable, the
polling continues even in the middle of a system sleep transition.
Obviously, polling a suspended drive for media changes and such isn't
a good thing to do; in the case of USB mass-storage devices it can
lead to real problems requiring device resets and even re-enumeration.
The patch fixes things by creating a new system-wide, non-reentrant,
freezable workqueue and using it for disk-events polling.
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit fe316bf2d5847bc5dd975668671a7b1067603bc7 upstream.
Since 2.6.39 (1196f8b), when a driver returns -ENOMEDIUM for open(),
__blkdev_get() calls rescan_partitions() to remove
in-kernel partition structures and raise KOBJ_CHANGE uevent.
However it ends up calling driver's revalidate_disk without open
and could cause oops.
In the case of SCSI:
process A process B
----------------------------------------------
sys_open
__blkdev_get
sd_open
returns -ENOMEDIUM
scsi_remove_device
<scsi_device torn down>
rescan_partitions
sd_revalidate_disk
<oops>
Oopses are reported here:
http://marc.info/?l=linux-scsi&m=132388619710052
This patch separates the partition invalidation from rescan_partitions()
and use it for -ENOMEDIUM case.
Reported-by: Huajun Li <huajun.li.lee@gmail.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 03606895cd98c0a628b17324fd7b5ff15db7e3cd ]
Niccolo Belli reported ipsec crashes in case we handle a frame without
mac header (atm in his case)
Before copying mac header, better make sure it is present.
Bugzilla reference: https://bugzilla.kernel.org/show_bug.cgi?id=42809
Reported-by: Niccolò Belli <darkbasic@linuxsystems.it>
Tested-by: Niccolò Belli <darkbasic@linuxsystems.it>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit c49d005b6cc8491fad5b24f82805be2d6bcbd3dd upstream.
A hardware bug in the OMAP4 HDMI PHY causes physical damage to the board
if the HDMI PHY is kept powered on when the cable is not connected.
This patch solves the problem by adding hot-plug-detection into the HDMI
IP driver. This is not a real HPD support in the sense that nobody else
than the IP driver gets to know about the HPD events, but is only meant
to fix the HW bug.
The strategy is simple: If the display device is turned off by the user,
the PHY power is set to OFF. When the display device is turned on by the
user, the PHY power is set either to LDOON or TXON, depending on whether
the HDMI cable is connected.
The reason to avoid PHY OFF when the display device is on, but the cable
is disconnected, is that when the PHY is turned OFF, the HDMI IP is not
"ticking" and thus the DISPC does not receive pixel clock from the HDMI
IP. This would, for example, prevent any VSYNCs from happening, and
would thus affect the users of omapdss. By using LDOON when the cable is
disconnected we'll avoid the HW bug, but keep the HDMI working as usual
from the user's point of view.
Signed-off-by: Tomi Valkeinen <tomi.valkeinen@ti.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 5189fa19a4b2b4c3bec37c3a019d446148827717 upstream.
There is only one error code to return for a bad user-space buffer
pointer passed to a system call in the same address space as the
system call is executed, and that is EFAULT. Furthermore, the
low-level access routines, which catch most of the faults, return
EFAULT already.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Roland McGrath <roland@hack.frob.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit c8e252586f8d5de906385d8cf6385fee289a825e upstream.
The regset common infrastructure assumed that regsets would always
have .get and .set methods, but not necessarily .active methods.
Unfortunately people have since written regsets without .set methods.
Rather than putting in stub functions everywhere, handle regsets with
null .get or .set methods explicitly.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Roland McGrath <roland@hack.frob.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 3c761ea05a8900a907f32b628611873f6bef24b2 upstream.
The autofs compat handling fix caused a compile failure when
CONFIG_COMPAT isn't defined.
Instead of adding random #ifdef'fery in autofs, let's just make the
compat helpers earlier to use: without CONFIG_COMPAT, is_compat_task()
just hardcodes to zero.
We could probably do something similar for a number of other cases where
we have #ifdef's in code, but this is the low-hanging fruit.
Reported-and-tested-by: Andreas Schwab <schwab@linux-m68k.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Jonathan Nieder <jrnieder@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 28d82dc1c4edbc352129f97f4ca22624d1fe61de upstream.
The current epoll code can be tickled to run basically indefinitely in
both loop detection path check (on ep_insert()), and in the wakeup paths.
The programs that tickle this behavior set up deeply linked networks of
epoll file descriptors that cause the epoll algorithms to traverse them
indefinitely. A couple of these sample programs have been previously
posted in this thread: https://lkml.org/lkml/2011/2/25/297.
To fix the loop detection path check algorithms, I simply keep track of
the epoll nodes that have been already visited. Thus, the loop detection
becomes proportional to the number of epoll file descriptor and links.
This dramatically decreases the run-time of the loop check algorithm. In
one diabolical case I tried it reduced the run-time from 15 mintues (all
in kernel time) to .3 seconds.
Fixing the wakeup paths could be done at wakeup time in a similar manner
by keeping track of nodes that have already been visited, but the
complexity is harder, since there can be multiple wakeups on different
cpus...Thus, I've opted to limit the number of possible wakeup paths when
the paths are created.
This is accomplished, by noting that the end file descriptor points that
are found during the loop detection pass (from the newly added link), are
actually the sources for wakeup events. I keep a list of these file
descriptors and limit the number and length of these paths that emanate
from these 'source file descriptors'. In the current implemetation I
allow 1000 paths of length 1, 500 of length 2, 100 of length 3, 50 of
length 4 and 10 of length 5. Note that it is sufficient to check the
'source file descriptors' reachable from the newly added link, since no
other 'source file descriptors' will have newly added links. This allows
us to check only the wakeup paths that may have gotten too long, and not
re-check all possible wakeup paths on the system.
In terms of the path limit selection, I think its first worth noting that
the most common case for epoll, is probably the model where you have 1
epoll file descriptor that is monitoring n number of 'source file
descriptors'. In this case, each 'source file descriptor' has a 1 path of
length 1. Thus, I believe that the limits I'm proposing are quite
reasonable and in fact may be too generous. Thus, I'm hoping that the
proposed limits will not prevent any workloads that currently work to
fail.
In terms of locking, I have extended the use of the 'epmutex' to all
epoll_ctl add and remove operations. Currently its only used in a subset
of the add paths. I need to hold the epmutex, so that we can correctly
traverse a coherent graph, to check the number of paths. I believe that
this additional locking is probably ok, since its in the setup/teardown
paths, and doesn't affect the running paths, but it certainly is going to
add some extra overhead. Also, worth noting is that the epmuex was
recently added to the ep_ctl add operations in the initial path loop
detection code using the argument that it was not on a critical path.
Another thing to note here, is the length of epoll chains that is allowed.
Currently, eventpoll.c defines:
/* Maximum number of nesting allowed inside epoll sets */
#define EP_MAX_NESTS 4
This basically means that I am limited to a graph depth of 5 (EP_MAX_NESTS
+ 1). However, this limit is currently only enforced during the loop
check detection code, and only when the epoll file descriptors are added
in a certain order. Thus, this limit is currently easily bypassed. The
newly added check for wakeup paths, stricly limits the wakeup paths to a
length of 5, regardless of the order in which ep's are linked together.
Thus, a side-effect of the new code is a more consistent enforcement of
the graph depth.
Thus far, I've tested this, using the sample programs previously
mentioned, which now either return quickly or return -EINVAL. I've also
testing using the piptest.c epoll tester, which showed no difference in
performance. I've also created a number of different epoll networks and
tested that they behave as expectded.
I believe this solves the original diabolical test cases, while still
preserving the sane epoll nesting.
Signed-off-by: Jason Baron <jbaron@redhat.com>
Cc: Nelson Elhage <nelhage@ksplice.com>
Cc: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|