Age | Commit message (Collapse) | Author |
|
commit 635697c663f38106063d5659f0cf2e45afcd4bb5 upstream.
Stable note: The commit [acf92b48: vmscan: shrinker->nr updates race and
go wrong] aimed to reduce excessive reclaim of slab objects but
had bug in how it treated shrinker functions that returned -1.
A shrinker function can return -1, means that it cannot do anything
without a risk of deadlock. For example prune_super() does this if it
cannot grab a superblock refrence, even if nr_to_scan=0. Currently we
interpret this -1 as a ULONG_MAX size shrinker and evaluate `total_scan'
according to this. So the next time around this shrinker can cause
really big pressure. Let's skip such shrinkers instead.
Also make total_scan signed, otherwise the check (total_scan < 0) below
never works.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit b1c12cbcd0a02527c180a862e8971e249d3b347d upstream.
Stable note: Not tracked in Bugzilla. [get|put]_mems_allowed() is extremely
expensive and severely impacted page allocator performance. This
is part of a series of patches that reduce page allocator overhead.
Fix a gcc warning (and bug?) introduced in cc9a6c877 ("cpuset: mm: reduce
large amounts of memory barrier related damage v3")
Local variable "page" can be uninitialized if the nodemask from vma policy
does not intersects with nodemask from cpuset. Even if it doesn't happens
it is better to initialize this variable explicitly than to introduce
a kernel oops in a weird corner case.
mm/hugetlb.c: In function `alloc_huge_page':
mm/hugetlb.c:1135:5: warning: `page' may be used uninitialized in this function
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit cc9a6c8776615f9c194ccf0b63a0aa5628235545 upstream.
Stable note: Not tracked in Bugzilla. [get|put]_mems_allowed() is extremely
expensive and severely impacted page allocator performance. This
is part of a series of patches that reduce page allocator overhead.
Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when
changing cpuset's mems") wins a super prize for the largest number of
memory barriers entered into fast paths for one commit.
[get|put]_mems_allowed is incredibly heavy with pairs of full memory
barriers inserted into a number of hot paths. This was detected while
investigating at large page allocator slowdown introduced some time
after 2.6.32. The largest portion of this overhead was shown by
oprofile to be at an mfence introduced by this commit into the page
allocator hot path.
For extra style points, the commit introduced the use of yield() in an
implementation of what looks like a spinning mutex.
This patch replaces the full memory barriers on both read and write
sides with a sequence counter with just read barriers on the fast path
side. This is much cheaper on some architectures, including x86. The
main bulk of the patch is the retry logic if the nodemask changes in a
manner that can cause a false failure.
While updating the nodemask, a check is made to see if a false failure
is a risk. If it is, the sequence number gets bumped and parallel
allocators will briefly stall while the nodemask update takes place.
In a page fault test microbenchmark, oprofile samples from
__alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The
actual results were
3.3.0-rc3 3.3.0-rc3
rc3-vanilla nobarrier-v2r1
Clients 1 UserTime 0.07 ( 0.00%) 0.08 (-14.19%)
Clients 2 UserTime 0.07 ( 0.00%) 0.07 ( 2.72%)
Clients 4 UserTime 0.08 ( 0.00%) 0.07 ( 3.29%)
Clients 1 SysTime 0.70 ( 0.00%) 0.65 ( 6.65%)
Clients 2 SysTime 0.85 ( 0.00%) 0.82 ( 3.65%)
Clients 4 SysTime 1.41 ( 0.00%) 1.41 ( 0.32%)
Clients 1 WallTime 0.77 ( 0.00%) 0.74 ( 4.19%)
Clients 2 WallTime 0.47 ( 0.00%) 0.45 ( 3.73%)
Clients 4 WallTime 0.38 ( 0.00%) 0.37 ( 1.58%)
Clients 1 Flt/sec/cpu 497620.28 ( 0.00%) 520294.53 ( 4.56%)
Clients 2 Flt/sec/cpu 414639.05 ( 0.00%) 429882.01 ( 3.68%)
Clients 4 Flt/sec/cpu 257959.16 ( 0.00%) 258761.48 ( 0.31%)
Clients 1 Flt/sec 495161.39 ( 0.00%) 517292.87 ( 4.47%)
Clients 2 Flt/sec 820325.95 ( 0.00%) 850289.77 ( 3.65%)
Clients 4 Flt/sec 1020068.93 ( 0.00%) 1022674.06 ( 0.26%)
MMTests Statistics: duration
Sys Time Running Test (seconds) 135.68 132.17
User+Sys Time Running Test (seconds) 164.2 160.13
Total Elapsed Time (seconds) 123.46 120.87
The overall improvement is small but the System CPU time is much
improved and roughly in correlation to what oprofile reported (these
performance figures are without profiling so skew is expected). The
actual number of page faults is noticeably improved.
For benchmarks like kernel builds, the overall benefit is marginal but
the system CPU time is slightly reduced.
To test the actual bug the commit fixed I opened two terminals. The
first ran within a cpuset and continually ran a small program that
faulted 100M of anonymous data. In a second window, the nodemask of the
cpuset was continually randomised in a loop.
Without the commit, the program would fail every so often (usually
within 10 seconds) and obviously with the commit everything worked fine.
With this patch applied, it also worked fine so the fix should be
functionally equivalent.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit b95a2f2d486d0d768a92879c023a03757b9c7e58 upstream - WARNING: this is a substitute patch.
Stable note: Not tracked in Bugzilla. This is a partial backport of an
upstream commit addressing a completely different issue
that accidentally contained an important fix. The workload
this patch helps was memcached when IO is started in the
background. memcached should stay resident but without this patch
it gets swapped. Sometimes this manifests as a drop in throughput
but mostly it was observed through /proc/vmstat.
Commit [246e87a9: memcg: fix get_scan_count() for small targets] was meant
to fix a problem whereby small scan targets on memcg were ignored causing
priority to raise too sharply. It forced scanning to take place if the
target was small, memcg or kswapd.
From the time it was introduced it caused excessive reclaim by kswapd
with workloads being pushed to swap that previously would have stayed
resident. This was accidentally fixed in commit [b95a2f2d: mm: vmscan:
convert global reclaim to per-memcg LRU lists] by making it harder for
kswapd to force scan small targets but that patchset is not suitable for
backporting. This was later changed again by commit [90126375: mm/vmscan:
push lruvec pointer into get_scan_count()] into a format that looks
like it would be a straight-forward backport but there is a subtle
difference due to the use of lruvecs.
The impact of the accidental fix is to make it harder for kswapd to force
scan small targets by taking zone->all_unreclaimable into account. This
patch is the closest equivalent available based on what is backported.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 043bcbe5ec51e0478ef2b44acef17193e01d7f70 upstream.
Stable note: Not tracked in Bugzilla. There were reports of shared
mapped pages being unfairly reclaimed in comparison to older kernels.
This is being addressed over time. Even though the subject
refers to lumpy reclaim, it impacts compaction as well.
Lumpy reclaim does well to stop at a PageAnon when there's no swap, but
better is to stop at any PageSwapBacked, which includes shmem/tmpfs too.
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
|
|
commit 86cfd3a45042ab242d47f3935a02811a402beab6 upstream.
Stable note: Not tracked in Bugzilla. This patch reduces kswapd CPU
usage on swapless systems with high anonymous memory usage.
It's pointless to continue reclaiming when we have no swap space and lots
of anon pages in the inactive list.
Without this patch, it is possible when swap is disabled to continue
trying to reclaim when there are only anonymous pages in the system even
though that will not make any progress.
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit c909e99364c8b6ca07864d752950b6b4ecf6bef4 upstream.
Stable note: Not tracked in Bugzilla. There were reports of shared
mapped pages being unfairly reclaimed in comparison to older kernels.
This is being addressed over time.
Logic added in commit 8cab4754d24a0 ("vmscan: make mapped executable pages
the first class citizen") was noticeably weakened in commit
645747462435d84 ("vmscan: detect mapped file pages used only once").
Currently these pages can become "first class citizens" only after second
usage. After this patch page_check_references() will activate they after
first usage, and executable code gets yet better chance to stay in memory.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 34dbc67a644f11ab3475d822d72e25409911e760 upstream.
Stable note: Not tracked in Bugzilla. There were reports of shared
mapped pages being unfairly reclaimed in comparison to older kernels.
This is being addressed over time. The specific workload being
addressed here in described in paragraph four and while paragraph
five says it did not help performance as such, it made a difference
to major page faults. I'm aware of at least one bug for a large
vendor that was due to increased major faults.
Commit 645747462435 ("vmscan: detect mapped file pages used only once")
greatly decreases lifetime of single-used mapped file pages.
Unfortunately it also decreases life time of all shared mapped file
pages. Because after commit bf3f3bc5e7347 ("mm: don't mark_page_accessed
in fault path") page-fault handler does not mark page active or even
referenced.
Thus page_check_references() activates file page only if it was used twice
while it stays in inactive list, meanwhile it activates anon pages after
first access. Inactive list can be small enough, this way reclaimer can
accidentally throw away any widely used page if it wasn't used twice in
short period.
After this patch page_check_references() also activate file mapped page at
first inactive list scan if this page is already used multiple times via
several ptes.
I found this while trying to fix degragation in rhel6 (~2.6.32) from rhel5
(~2.6.18). There a complete mess with >100 web/mail/spam/ftp containers,
they share all their files but there a lot of anonymous pages: ~500mb
shared file mapped memory and 15-20Gb non-shared anonymous memory. In
this situation major-pagefaults are very costly, because all containers
share the same page. In my load kernel created a disproportionate
pressure on the file memory, compared with the anonymous, they equaled
only if I raise swappiness up to 150 =)
These patches actually wasn't helped a lot in my problem, but I saw
noticable (10-20 times) reduce in count and average time of
major-pagefault in file-mapped areas.
Actually both patches are fixes for commit v2.6.33-5448-g6457474, because
it was aimed at one scenario (singly used pages), but it breaks the logic
in other scenarios (shared and/or executable pages)
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Acked-by: Pekka Enberg <penberg@kernel.org>
Acked-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
|
|
is true for one zone
commit 0cee34fd72c582b4f8ad8ce00645b75fb4168199 upstream.
Stable note: Not tracked on Bugzilla. THP and compaction was found to
aggressively reclaim pages and stall systems under different
situations that was addressed piecemeal over time.
If compaction can proceed for a given zone, shrink_zones() does not
reclaim any more pages from it. After commit [e0c2327: vmscan: abort
reclaim/compaction if compaction can proceed], do_try_to_free_pages()
tries to finish as soon as possible once one zone can compact.
This was intended to prevent slabs being shrunk unnecessarily but there
are side-effects. One is that a small zone that is ready for compaction
will abort reclaim even if the chances of successfully allocating a THP
from that zone is small. It also means that reclaim can return too early
even though sc->nr_to_reclaim pages were not reclaimed.
This partially reverts the commit until it is proven that slabs are really
being shrunk unnecessarily but preserves the check to return 1 to avoid
OOM if reclaim was aborted prematurely.
[aarcange@redhat.com: This patch replaces a revert from Andrea]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andy Isaacson <adi@hexapodia.org>
Cc: Nai Xia <nai.xia@gmail.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 7335084d446b83cbcb15da80497d03f0c1dc9e21 upstream.
Stable note: Not tracked in Bugzilla. This patch makes later patches
easier to apply but otherwise has little to justify it. The
problem it fixes was never observed but the source of the
theoretical problem did not exist for very long.
During direct reclaim it is possible that reclaim will be aborted so that
compaction can be attempted to satisfy a high-order allocation. If this
decision is made before any pages are reclaimed, it is possible that 0 is
returned to the page allocator potentially triggering an OOM. This has
not been observed but it is a possibility so this patch addresses it.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andy Isaacson <adi@hexapodia.org>
Cc: Nai Xia <nai.xia@gmail.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
pages available
commit fe4b1b244bdb96136855f2c694071cb09d140766 upstream.
Stable note: Not tracked on Bugzilla. THP and compaction was found to
aggressively reclaim pages and stall systems under different
situations that was addressed piecemeal over time. This patch
addresses a problem where the fix regressed THP allocation
success rates.
In commit e0887c19 ("vmscan: limit direct reclaim for higher order
allocations"), Rik noted that reclaim was too aggressive when THP was
enabled. In his initial patch he used the number of free pages to decide
if reclaim should abort for compaction. My feedback was that reclaim and
compaction should be using the same logic when deciding if reclaim should
be aborted.
Unfortunately, this had the effect of reducing THP success rates when the
workload included something like streaming reads that continually
allocated pages. The window during which compaction could run and return
a THP was too small.
This patch combines Rik's two patches together. compaction_suitable() is
still used to decide if reclaim should be aborted to allow compaction is
used. However, it will also ensure that there is a reasonable buffer of
free pages available. This improves upon the THP allocation success rates
but bounds the number of pages that are freed for compaction.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel<riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andy Isaacson <adi@hexapodia.org>
Cc: Nai Xia <nai.xia@gmail.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit a6bc32b899223a877f595ef9ddc1e89ead5072b8 upstream.
Stable note: Not tracked in Buzilla. This was part of a series that
reduced interactivity stalls experienced when THP was enabled.
These stalls were particularly noticable when copying data
to a USB stick but the experiences for users varied a lot.
This patch adds a lightweight sync migrate operation MIGRATE_SYNC_LIGHT
mode that avoids writing back pages to backing storage. Async compaction
maps to MIGRATE_ASYNC while sync compaction maps to MIGRATE_SYNC_LIGHT.
For other migrate_pages users such as memory hotplug, MIGRATE_SYNC is
used.
This avoids sync compaction stalling for an excessive length of time,
particularly when copying files to a USB stick where there might be a
large number of dirty pages backed by a filesystem that does not support
->writepages.
[aarcange@redhat.com: This patch is heavily based on Andrea's work]
[akpm@linux-foundation.org: fix fs/nfs/write.c build]
[akpm@linux-foundation.org: fix fs/btrfs/disk-io.c build]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andy Isaacson <adi@hexapodia.org>
Cc: Nai Xia <nai.xia@gmail.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit f0dfcde099453aa4c0dc42473828d15a6d492936 upstream.
Stable note: Fixes https://bugzilla.redhat.com/show_bug.cgi?id=712019. This
patch reduces kswapd CPU usage.
There 2 places to read pgdat in kswapd. One is return from a successful
balance, another is waked up from kswapd sleeping. The new_order and
new_classzone_idx represent the balance input order and classzone_idx.
But current new_order and new_classzone_idx are not assigned after
kswapd_try_to_sleep(), that will cause a bug in the following scenario.
1: after a successful balance, kswapd goes to sleep, and new_order = 0;
new_classzone_idx = __MAX_NR_ZONES - 1;
2: kswapd waked up with order = 3 and classzone_idx = ZONE_NORMAL
3: in the balance_pgdat() running, a new balance wakeup happened with
order = 5, and classzone_idx = ZONE_NORMAL
4: the first wakeup(order = 3) finished successufly, return order = 3
but, the new_order is still 0, so, this balancing will be treated as a
failed balance. And then the second tighter balancing will be missed.
So, to avoid the above problem, the new_order and new_classzone_idx need
to be assigned for later successful comparison.
Signed-off-by: Alex Shi <alex.shi@intel.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Tested-by: Pádraig Brady <P@draigBrady.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit d2ebd0f6b89567eb93ead4e2ca0cbe03021f344b upstream.
Stable note: Fixes https://bugzilla.redhat.com/show_bug.cgi?id=712019. This
patch reduces kswapd CPU usage.
In commit 215ddd66 ("mm: vmscan: only read new_classzone_idx from pgdat
when reclaiming successfully") , Mel Gorman said kswapd is better to sleep
after a unsuccessful balancing if there is tighter reclaim request pending
in the balancing. But in the following scenario, kswapd do something that
is not matched our expectation. The patch fixes this issue.
1, Read pgdat request A (classzone_idx, order = 3)
2, balance_pgdat()
3, During pgdat, a new pgdat request B (classzone_idx, order = 5) is placed
4, balance_pgdat() returns but failed since returned order = 0
5, pgdat of request A assigned to balance_pgdat(), and do balancing again.
While the expectation behavior of kswapd should try to sleep.
Signed-off-by: Alex Shi <alex.shi@intel.com>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Tested-by: Pádraig Brady <P@draigBrady.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit c82449352854ff09e43062246af86bdeb628f0c3 upstream.
Stable note: Not tracked in Bugzilla. A fix aimed at preserving page aging
information by reducing LRU list churning had the side-effect of
reducing THP allocation success rates. This was part of a series
to restore the success rates while preserving the reclaim fix.
Commit 39deaf85 ("mm: compaction: make isolate_lru_page() filter-aware")
noted that compaction does not migrate dirty or writeback pages and that
is was meaningless to pick the page and re-add it to the LRU list. This
had to be partially reverted because some dirty pages can be migrated by
compaction without blocking.
This patch updates "mm: compaction: make isolate_lru_page" by skipping
over pages that migration has no possibility of migrating to minimise LRU
disruption.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel<riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Cc: Dave Jones <davej@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andy Isaacson <adi@hexapodia.org>
Cc: Nai Xia <nai.xia@gmail.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
compaction is deferred
commit 66199712e9eef5aede09dbcd9dfff87798a66917 upstream.
Stable note: Not tracked in Buzilla. This was part of a series that
reduced interactivity stalls experienced when THP was enabled.
If compaction is deferred, direct reclaim is used to try to free enough
pages for the allocation to succeed. For small high-orders, this has a
reasonable chance of success. However, if the caller has specified
__GFP_NO_KSWAPD to limit the disruption to the system, it makes more sense
to fail the allocation rather than stall the caller in direct reclaim.
This patch skips direct reclaim if compaction is deferred and the caller
specifies __GFP_NO_KSWAPD.
Async compaction only considers a subset of pages so it is possible for
compaction to be deferred prematurely and not enter direct reclaim even in
cases where it should. To compensate for this, this patch also defers
compaction only if sync compaction failed.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: Rik van Riel<riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andy Isaacson <adi@hexapodia.org>
Cc: Nai Xia <nai.xia@gmail.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
within ->migratepage
commit b969c4ab9f182a6e1b2a0848be349f99714947b0 upstream.
Stable note: Not tracked in Bugzilla. A fix aimed at preserving page
aging information by reducing LRU list churning had the side-effect
of reducing THP allocation success rates. This was part of a series
to restore the success rates while preserving the reclaim fix.
Asynchronous compaction is used when allocating transparent hugepages to
avoid blocking for long periods of time. Due to reports of stalling,
there was a debate on disabling synchronous compaction but this severely
impacted allocation success rates. Part of the reason was that many dirty
pages are skipped in asynchronous compaction by the following check;
if (PageDirty(page) && !sync &&
mapping->a_ops->migratepage != migrate_page)
rc = -EBUSY;
This skips over all mapping aops using buffer_migrate_page() even though
it is possible to migrate some of these pages without blocking. This
patch updates the ->migratepage callback with a "sync" parameter. It is
the responsibility of the callback to fail gracefully if migration would
block.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andy Isaacson <adi@hexapodia.org>
Cc: Nai Xia <nai.xia@gmail.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit a77ebd333cd810d7b680d544be88c875131c2bd3 upstream.
Stable note: Not tracked in Bugzilla. A fix aimed at preserving page aging
information by reducing LRU list churning had the side-effect of
reducing THP allocation success rates. This was part of a series
to restore the success rates while preserving the reclaim fix.
Short summary: There are severe stalls when a USB stick using VFAT is
used with THP enabled that are reduced by this series. If you are
experiencing this problem, please test and report back and considering I
have seen complaints from openSUSE and Fedora users on this as well as a
few private mails, I'm guessing it's a widespread issue. This is a new
type of USB-related stall because it is due to synchronous compaction
writing where as in the past the big problem was dirty pages reaching
the end of the LRU and being written by reclaim.
Am cc'ing Andrew this time and this series would replace
mm-do-not-stall-in-synchronous-compaction-for-thp-allocations.patch.
I'm also cc'ing Dave Jones as he might have merged that patch to Fedora
for wider testing and ideally it would be reverted and replaced by this
series.
That said, the later patches could really do with some review. If this
series is not the answer then a new direction needs to be discussed
because as it is, the stalls are unacceptable as the results in this
leader show.
For testers that try backporting this to 3.1, it won't work because
there is a non-obvious dependency on not writing back pages in direct
reclaim so you need those patches too.
Changelog since V5
o Rebase to 3.2-rc5
o Tidy up the changelogs a bit
Changelog since V4
o Added reviewed-bys, credited Andrea properly for sync-light
o Allow dirty pages without mappings to be considered for migration
o Bound the number of pages freed for compaction
o Isolate PageReclaim pages on their own LRU list
This is against 3.2-rc5 and follows on from discussions on "mm: Do
not stall in synchronous compaction for THP allocations" and "[RFC
PATCH 0/5] Reduce compaction-related stalls". Initially, the proposed
patch eliminated stalls due to compaction which sometimes resulted in
user-visible interactivity problems on browsers by simply never using
sync compaction. The downside was that THP success allocation rates
were lower because dirty pages were not being migrated as reported by
Andrea. His approach at fixing this was nacked on the grounds that
it reverted fixes from Rik merged that reduced the amount of pages
reclaimed as it severely impacted his workloads performance.
This series attempts to reconcile the requirements of maximising THP
usage, without stalling in a user-visible fashion due to compaction
or cheating by reclaiming an excessive number of pages.
Patch 1 partially reverts commit 39deaf85 to allow migration to isolate
dirty pages. This is because migration can move some dirty
pages without blocking.
Patch 2 notes that the /proc/sys/vm/compact_memory handler is not using
synchronous compaction when it should be. This is unrelated
to the reported stalls but is worth fixing.
Patch 3 checks if we isolated a compound page during lumpy scan and
account for it properly. For the most part, this affects
tracing so it's unrelated to the stalls but worth fixing.
Patch 4 notes that it is possible to abort reclaim early for compaction
and return 0 to the page allocator potentially entering the
"may oom" path. This has not been observed in practice but
the rest of the series potentially makes it easier to happen.
Patch 5 adds a sync parameter to the migratepage callback and gives
the callback responsibility for migrating the page without
blocking if sync==false. For example, fallback_migrate_page
will not call writepage if sync==false. This increases the
number of pages that can be handled by asynchronous compaction
thereby reducing stalls.
Patch 6 restores filter-awareness to isolate_lru_page for migration.
In practice, it means that pages under writeback and pages
without a ->migratepage callback will not be isolated
for migration.
Patch 7 avoids calling direct reclaim if compaction is deferred but
makes sure that compaction is only deferred if sync
compaction was used.
Patch 8 introduces a sync-light migration mechanism that sync compaction
uses. The objective is to allow some stalls but to not call
->writepage which can lead to significant user-visible stalls.
Patch 9 notes that while we want to abort reclaim ASAP to allow
compation to go ahead that we leave a very small window of
opportunity for compaction to run. This patch allows more pages
to be freed by reclaim but bounds the number to a reasonable
level based on the high watermark on each zone.
Patch 10 allows slabs to be shrunk even after compaction_ready() is
true for one zone. This is to avoid a problem whereby a single
small zone can abort reclaim even though no pages have been
reclaimed and no suitably large zone is in a usable state.
Patch 11 fixes a problem with the rate of page scanning. As reclaim is
rarely stalling on pages under writeback it means that scan
rates are very high. This is particularly true for direct
reclaim which is not calling writepage. The vmstat figures
implied that much of this was busy work with PageReclaim pages
marked for immediate reclaim. This patch is a prototype that
moves these pages to their own LRU list.
This has been tested and other than 2 USB keys getting trashed,
nothing horrible fell out. That said, I am a bit unhappy with the
rescue logic in patch 11 but did not find a better way around it. It
does significantly reduce scan rates and System CPU time indicating
it is the right direction to take.
What is of critical importance is that stalls due to compaction
are massively reduced even though sync compaction was still
allowed. Testing from people complaining about stalls copying to USBs
with THP enabled are particularly welcome.
The following tests all involve THP usage and USB keys in some
way. Each test follows this type of pattern
1. Read from some fast fast storage, be it raw device or file. Each time
the copy finishes, start again until the test ends
2. Write a large file to a filesystem on a USB stick. Each time the copy
finishes, start again until the test ends
3. When memory is low, start an alloc process that creates a mapping
the size of physical memory to stress THP allocation. This is the
"real" part of the test and the part that is meant to trigger
stalls when THP is enabled. Copying continues in the background.
4. Record the CPU usage and time to execute of the alloc process
5. Record the number of THP allocs and fallbacks as well as the number of THP
pages in use a the end of the test just before alloc exited
6. Run the test 5 times to get an idea of variability
7. Between each run, sync is run and caches dropped and the test
waits until nr_dirty is a small number to avoid interference
or caching between iterations that would skew the figures.
The individual tests were then
writebackCPDeviceBasevfat
Disable THP, read from a raw device (sda), vfat on USB stick
writebackCPDeviceBaseext4
Disable THP, read from a raw device (sda), ext4 on USB stick
writebackCPDevicevfat
THP enabled, read from a raw device (sda), vfat on USB stick
writebackCPDeviceext4
THP enabled, read from a raw device (sda), ext4 on USB stick
writebackCPFilevfat
THP enabled, read from a file on fast storage and USB, both vfat
writebackCPFileext4
THP enabled, read from a file on fast storage and USB, both ext4
The kernels tested were
3.1 3.1
vanilla 3.2-rc5
freemore Patches 1-10
immediate Patches 1-11
andrea The 8 patches Andrea posted as a basis of comparison
The results are very long unfortunately. I'll start with the case
where we are not using THP at all
writebackCPDeviceBasevfat
3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 andrea-v2r1
System Time 1.28 ( 0.00%) 54.49 (-4143.46%) 48.63 (-3687.69%) 4.69 ( -265.11%) 51.88 (-3940.81%)
+/- 0.06 ( 0.00%) 2.45 (-4305.55%) 4.75 (-8430.57%) 7.46 (-13282.76%) 4.76 (-8440.70%)
User Time 0.09 ( 0.00%) 0.05 ( 40.91%) 0.06 ( 29.55%) 0.07 ( 15.91%) 0.06 ( 27.27%)
+/- 0.02 ( 0.00%) 0.01 ( 45.39%) 0.02 ( 25.07%) 0.00 ( 77.06%) 0.01 ( 52.24%)
Elapsed Time 110.27 ( 0.00%) 56.38 ( 48.87%) 49.95 ( 54.70%) 11.77 ( 89.33%) 53.43 ( 51.54%)
+/- 7.33 ( 0.00%) 3.77 ( 48.61%) 4.94 ( 32.63%) 6.71 ( 8.50%) 4.76 ( 35.03%)
THP Active 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
+/- 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Fault Alloc 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
+/- 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Fault Fallback 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
+/- 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
The THP figures are obviously all 0 because THP was enabled. The
main thing to watch is the elapsed times and how they compare to
times when THP is enabled later. It's also important to note that
elapsed time is improved by this series as System CPu time is much
reduced.
writebackCPDevicevfat
3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 andrea-v2r1
System Time 1.22 ( 0.00%) 13.89 (-1040.72%) 46.40 (-3709.20%) 4.44 ( -264.37%) 47.37 (-3789.33%)
+/- 0.06 ( 0.00%) 22.82 (-37635.56%) 3.84 (-6249.44%) 6.48 (-10618.92%) 6.60
(-10818.53%)
User Time 0.06 ( 0.00%) 0.06 ( -6.90%) 0.05 ( 17.24%) 0.05 ( 13.79%) 0.04 ( 31.03%)
+/- 0.01 ( 0.00%) 0.01 ( 33.33%) 0.01 ( 33.33%) 0.01 ( 39.14%) 0.01 ( 25.46%)
Elapsed Time 10445.54 ( 0.00%) 2249.92 ( 78.46%) 70.06 ( 99.33%) 16.59 ( 99.84%) 472.43 (
95.48%)
+/- 643.98 ( 0.00%) 811.62 ( -26.03%) 10.02 ( 98.44%) 7.03 ( 98.91%) 59.99 ( 90.68%)
THP Active 15.60 ( 0.00%) 35.20 ( 225.64%) 65.00 ( 416.67%) 70.80 ( 453.85%) 62.20 ( 398.72%)
+/- 18.48 ( 0.00%) 51.29 ( 277.59%) 15.99 ( 86.52%) 37.91 ( 205.18%) 22.02 ( 119.18%)
Fault Alloc 121.80 ( 0.00%) 76.60 ( 62.89%) 155.40 ( 127.59%) 181.20 ( 148.77%) 286.60 ( 235.30%)
+/- 73.51 ( 0.00%) 61.11 ( 83.12%) 34.89 ( 47.46%) 31.88 ( 43.36%) 68.13 ( 92.68%)
Fault Fallback 881.20 ( 0.00%) 926.60 ( -5.15%) 847.60 ( 3.81%) 822.00 ( 6.72%) 716.60 ( 18.68%)
+/- 73.51 ( 0.00%) 61.26 ( 16.67%) 34.89 ( 52.54%) 31.65 ( 56.94%) 67.75 ( 7.84%)
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 3540.88 1945.37 716.04 64.97 1937.03
Total Elapsed Time (seconds) 52417.33 11425.90 501.02 230.95 2520.28
The first thing to note is the "Elapsed Time" for the vanilla kernels
of 2249 seconds versus 56 with THP disabled which might explain the
reports of USB stalls with THP enabled. Applying the patches brings
performance in line with THP-disabled performance while isolating
pages for immediate reclaim from the LRU cuts down System CPU time.
The "Fault Alloc" success rate figures are also improved. The vanilla
kernel only managed to allocate 76.6 pages on average over the course
of 5 iterations where as applying the series allocated 181.20 on
average albeit it is well within variance. It's worth noting that
applies the series at least descreases the amount of variance which
implies an improvement.
Andrea's series had a higher success rate for THP allocations but
at a severe cost to elapsed time which is still better than vanilla
but still much worse than disabling THP altogether. One can bring my
series close to Andrea's by removing this check
/*
* If compaction is deferred for high-order allocations, it is because
* sync compaction recently failed. In this is the case and the caller
* has requested the system not be heavily disrupted, fail the
* allocation now instead of entering direct reclaim
*/
if (deferred_compaction && (gfp_mask & __GFP_NO_KSWAPD))
goto nopage;
I didn't include a patch that removed the above check because hurting
overall performance to improve the THP figure is not what the average
user wants. It's something to consider though if someone really wants
to maximise THP usage no matter what it does to the workload initially.
This is summary of vmstat figures from the same test.
3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 andrea-v2r1
Page Ins 3257266139 1111844061 17263623 10901575 161423219
Page Outs 81054922 30364312 3626530 3657687 8753730
Swap Ins 3294 2851 6560 4964 4592
Swap Outs 390073 528094 620197 790912 698285
Direct pages scanned 1077581700 3024951463 1764930052 115140570 5901188831
Kswapd pages scanned 34826043 7112868 2131265 1686942 1893966
Kswapd pages reclaimed 28950067 4911036 1246044 966475 1497726
Direct pages reclaimed 805148398 280167837 3623473 2215044 40809360
Kswapd efficiency 83% 69% 58% 57% 79%
Kswapd velocity 664.399 622.521 4253.852 7304.360 751.490
Direct efficiency 74% 9% 0% 1% 0%
Direct velocity 20557.737 264745.137 3522673.849 498551.938 2341481.435
Percentage direct scans 96% 99% 99% 98% 99%
Page writes by reclaim 722646 529174 620319 791018 699198
Page writes file 332573 1080 122 106 913
Page writes anon 390073 528094 620197 790912 698285
Page reclaim immediate 0 2552514720 1635858848 111281140 5478375032
Page rescued immediate 0 0 0 87848 0
Slabs scanned 23552 23552 9216 8192 9216
Direct inode steals 231 0 0 0 0
Kswapd inode steals 0 0 0 0 0
Kswapd skipped wait 28076 786 0 61 6
THP fault alloc 609 383 753 906 1433
THP collapse alloc 12 6 0 0 6
THP splits 536 211 456 593 1136
THP fault fallback 4406 4633 4263 4110 3583
THP collapse fail 120 127 0 0 4
Compaction stalls 1810 728 623 779 3200
Compaction success 196 53 60 80 123
Compaction failures 1614 675 563 699 3077
Compaction pages moved 193158 53545 243185 333457 226688
Compaction move failure 9952 9396 16424 23676 45070
The main things to look at are
1. Page In/out figures are much reduced by the series.
2. Direct page scanning is incredibly high (264745.137 pages scanned
per second on the vanilla kernel) but isolating PageReclaim pages
on their own list reduces the number of pages scanned significantly.
3. The fact that "Page rescued immediate" is a positive number implies
that we sometimes race removing pages from the LRU_IMMEDIATE list
that need to be put back on a normal LRU but it happens only for
0.07% of the pages marked for immediate reclaim.
writebackCPDeviceext4
3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 andrea-v2r1
System Time 1.51 ( 0.00%) 1.77 ( -17.66%) 1.46 ( 2.92%) 1.15 ( 23.77%) 1.89 ( -25.63%)
+/- 0.27 ( 0.00%) 0.67 ( -148.52%) 0.33 ( -22.76%) 0.30 ( -11.15%) 0.19 ( 30.16%)
User Time 0.03 ( 0.00%) 0.04 ( -37.50%) 0.05 ( -62.50%) 0.07 ( -112.50%) 0.04 ( -18.75%)
+/- 0.01 ( 0.00%) 0.02 ( -146.64%) 0.02 ( -97.91%) 0.02 ( -75.59%) 0.02 ( -63.30%)
Elapsed Time 124.93 ( 0.00%) 114.49 ( 8.36%) 96.77 ( 22.55%) 27.48 ( 78.00%) 205.70 ( -64.65%)
+/- 20.20 ( 0.00%) 74.39 ( -268.34%) 59.88 ( -196.48%) 7.72 ( 61.79%) 25.03 ( -23.95%)
THP Active 161.80 ( 0.00%) 83.60 ( 51.67%) 141.20 ( 87.27%) 84.60 ( 52.29%) 82.60 ( 51.05%)
+/- 71.95 ( 0.00%) 43.80 ( 60.88%) 26.91 ( 37.40%) 59.02 ( 82.03%) 52.13 ( 72.45%)
Fault Alloc 471.40 ( 0.00%) 228.60 ( 48.49%) 282.20 ( 59.86%) 225.20 ( 47.77%) 388.40 ( 82.39%)
+/- 88.07 ( 0.00%) 87.42 ( 99.26%) 73.79 ( 83.78%) 109.62 ( 124.47%) 82.62 ( 93.81%)
Fault Fallback 531.60 ( 0.00%) 774.60 ( -45.71%) 720.80 ( -35.59%) 777.80 ( -46.31%) 614.80 ( -15.65%)
+/- 88.07 ( 0.00%) 87.26 ( 0.92%) 73.79 ( 16.22%) 109.62 ( -24.47%) 82.29 ( 6.56%)
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 50.22 33.76 30.65 24.14 128.45
Total Elapsed Time (seconds) 1113.73 1132.19 1029.45 759.49 1707.26
Similar test but the USB stick is using ext4 instead of vfat. As
ext4 does not use writepage for migration, the large stalls due to
compaction when THP is enabled are not observed. Still, isolating
PageReclaim pages on their own list helped completion time largely
by reducing the number of pages scanned by direct reclaim although
time spend in congestion_wait could also be a factor.
Again, Andrea's series had far higher success rates for THP allocation
at the cost of elapsed time. I didn't look too closely but a quick
look at the vmstat figures tells me kswapd reclaimed 8 times more pages
than the patch series and direct reclaim reclaimed roughly three times
as many pages. It follows that if memory is aggressively reclaimed,
there will be more available for THP.
writebackCPFilevfat
3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 andrea-v2r1
System Time 1.76 ( 0.00%) 29.10 (-1555.52%) 46.01 (-2517.18%) 4.79 ( -172.35%) 54.89 (-3022.53%)
+/- 0.14 ( 0.00%) 25.61 (-18185.17%) 2.15 (-1434.83%) 6.60 (-4610.03%) 9.75
(-6863.76%)
User Time 0.05 ( 0.00%) 0.07 ( -45.83%) 0.05 ( -4.17%) 0.06 ( -29.17%) 0.06 ( -16.67%)
+/- 0.02 ( 0.00%) 0.02 ( 20.11%) 0.02 ( -3.14%) 0.01 ( 31.58%) 0.01 ( 47.41%)
Elapsed Time 22520.79 ( 0.00%) 1082.85 ( 95.19%) 73.30 ( 99.67%) 32.43 ( 99.86%) 291.84 ( 98.70%)
+/- 7277.23 ( 0.00%) 706.29 ( 90.29%) 19.05 ( 99.74%) 17.05 ( 99.77%) 125.55 ( 98.27%)
THP Active 83.80 ( 0.00%) 12.80 ( 15.27%) 15.60 ( 18.62%) 13.00 ( 15.51%) 0.80 ( 0.95%)
+/- 66.81 ( 0.00%) 20.19 ( 30.22%) 5.92 ( 8.86%) 15.06 ( 22.54%) 1.17 ( 1.75%)
Fault Alloc 171.00 ( 0.00%) 67.80 ( 39.65%) 97.40 ( 56.96%) 125.60 ( 73.45%) 133.00 ( 77.78%)
+/- 82.91 ( 0.00%) 30.69 ( 37.02%) 53.91 ( 65.02%) 55.05 ( 66.40%) 21.19 ( 25.56%)
Fault Fallback 832.00 ( 0.00%) 935.20 ( -12.40%) 906.00 ( -8.89%) 877.40 ( -5.46%) 870.20 ( -4.59%)
+/- 82.91 ( 0.00%) 30.69 ( 62.98%) 54.01 ( 34.86%) 55.05 ( 33.60%) 20.91 ( 74.78%)
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 7229.81 928.42 704.52 80.68 1330.76
Total Elapsed Time (seconds) 112849.04 5618.69 571.11 360.54 1664.28
In this case, the test is reading/writing only from filesystems but as
it's vfat, it's slow due to calling writepage during compaction. Little
to observe really - the time to complete the test goes way down
with the series applied and THP allocation success rates go up in
comparison to 3.2-rc5. The success rates are lower than 3.1.0 but
the elapsed time for that kernel is abysmal so it is not really a
sensible comparison.
As before, Andrea's series allocates more THPs at the cost of overall
performance.
writebackCPFileext4
3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 andrea-v2r1
System Time 1.51 ( 0.00%) 1.77 ( -17.66%) 1.46 ( 2.92%) 1.15 ( 23.77%) 1.89 ( -25.63%)
+/- 0.27 ( 0.00%) 0.67 ( -148.52%) 0.33 ( -22.76%) 0.30 ( -11.15%) 0.19 ( 30.16%)
User Time 0.03 ( 0.00%) 0.04 ( -37.50%) 0.05 ( -62.50%) 0.07 ( -112.50%) 0.04 ( -18.75%)
+/- 0.01 ( 0.00%) 0.02 ( -146.64%) 0.02 ( -97.91%) 0.02 ( -75.59%) 0.02 ( -63.30%)
Elapsed Time 124.93 ( 0.00%) 114.49 ( 8.36%) 96.77 ( 22.55%) 27.48 ( 78.00%) 205.70 ( -64.65%)
+/- 20.20 ( 0.00%) 74.39 ( -268.34%) 59.88 ( -196.48%) 7.72 ( 61.79%) 25.03 ( -23.95%)
THP Active 161.80 ( 0.00%) 83.60 ( 51.67%) 141.20 ( 87.27%) 84.60 ( 52.29%) 82.60 ( 51.05%)
+/- 71.95 ( 0.00%) 43.80 ( 60.88%) 26.91 ( 37.40%) 59.02 ( 82.03%) 52.13 ( 72.45%)
Fault Alloc 471.40 ( 0.00%) 228.60 ( 48.49%) 282.20 ( 59.86%) 225.20 ( 47.77%) 388.40 ( 82.39%)
+/- 88.07 ( 0.00%) 87.42 ( 99.26%) 73.79 ( 83.78%) 109.62 ( 124.47%) 82.62 ( 93.81%)
Fault Fallback 531.60 ( 0.00%) 774.60 ( -45.71%) 720.80 ( -35.59%) 777.80 ( -46.31%) 614.80 ( -15.65%)
+/- 88.07 ( 0.00%) 87.26 ( 0.92%) 73.79 ( 16.22%) 109.62 ( -24.47%) 82.29 ( 6.56%)
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 50.22 33.76 30.65 24.14 128.45
Total Elapsed Time (seconds) 1113.73 1132.19 1029.45 759.49 1707.26
Same type of story - elapsed times go down. In this case, allocation
success rates are roughtly the same. As before, Andrea's has higher
success rates but takes a lot longer.
Overall the series does reduce latencies and while the tests are
inherency racy as alloc competes with the cp processes, the variability
was included. The THP allocation rates are not as high as they could
be but that is because we would have to be more aggressive about
reclaim and compaction impacting overall performance.
This patch:
Commit 39deaf85 ("mm: compaction: make isolate_lru_page() filter-aware")
noted that compaction does not migrate dirty or writeback pages and that
is was meaningless to pick the page and re-add it to the LRU list.
What was missed during review is that asynchronous migration moves dirty
pages if their ->migratepage callback is migrate_page() because these can
be moved without blocking. This potentially impacted hugepage allocation
success rates by a factor depending on how many dirty pages are in the
system.
This patch partially reverts 39deaf85 to allow migration to isolate dirty
pages again. This increases how much compaction disrupts the LRU but that
is addressed later in the series.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andy Isaacson <adi@hexapodia.org>
Cc: Nai Xia <nai.xia@gmail.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 0dabec93de633a87adfbbe1d800a4c56cd19d73b upstream.
Stable note: Not tracked in Bugzilla. This patch makes later patches
easier to apply but has no other impact.
unmap_and_move() is one a big messy function. Clean it up.
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit f80c0673610e36ae29d63e3297175e22f70dde5f upstream.
Stable note: Not tracked in Bugzilla. THP and compaction disrupt the LRU list
leading to poor reclaim decisions which has a variable
performance impact.
In __zone_reclaim case, we don't want to shrink mapped page. Nonetheless,
we have isolated mapped page and re-add it into LRU's head. It's
unnecessary CPU overhead and makes LRU churning.
Of course, when we isolate the page, the page might be mapped but when we
try to migrate the page, the page would be not mapped. So it could be
migrated. But race is rare and although it happens, it's no big deal.
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 39deaf8585152f1a35c1676d3d7dc6ae0fb65967 upstream.
Stable note: Not tracked in Bugzilla. THP and compaction disrupt the LRU
list leading to poor reclaim decisions which has a variable
performance impact.
In async mode, compaction doesn't migrate dirty or writeback pages. So,
it's meaningless to pick the page and re-add it to lru list.
Of course, when we isolate the page in compaction, the page might be dirty
or writeback but when we try to migrate the page, the page would be not
dirty, writeback. So it could be migrated. But it's very unlikely as
isolate and migration cycle is much faster than writeout.
So, this patch helps cpu overhead and prevent unnecessary LRU churning.
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 4356f21d09283dc6d39a6f7287a65ddab61e2808 upstream.
Stable note: Not tracked in Bugzilla. This patch makes later patches
easier to apply but has no other impact.
Change ISOLATE_XXX macro with bitwise isolate_mode_t type. Normally,
macro isn't recommended as it's type-unsafe and making debugging harder as
symbol cannot be passed throught to the debugger.
Quote from Johannes
" Hmm, it would probably be cleaner to fully convert the isolation mode
into independent flags. INACTIVE, ACTIVE, BOTH is currently a
tri-state among flags, which is a bit ugly."
This patch moves isolate mode from swap.h to mmzone.h by memcontrol.h
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit b9e84ac1536d35aee03b2601f19694949f0bd506 upstream.
Stable note: Not tracked in Bugzilla. This patch makes later patches
easier to apply but has no other impact.
acct_isolated of compaction uses page_lru_base_type which returns only
base type of LRU list so it never returns LRU_ACTIVE_ANON or
LRU_ACTIVE_FILE. In addtion, cc->nr_[anon|file] is used in only
acct_isolated so it doesn't have fields in conpact_control.
This patch removes fields from compact_control and makes clear function of
acct_issolated which counts the number of anon|file pages isolated.
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit e0c23279c9f800c403f37511484d9014ac83adec upstream.
Stable note: Not tracked on Bugzilla. THP and compaction was found to
aggressively reclaim pages and stall systems under different
situations that was addressed piecemeal over time.
If compaction can proceed, shrink_zones() stops doing any work but its
callers still call shrink_slab() which raises the priority and potentially
sleeps. This is unnecessary and wasteful so this patch aborts direct
reclaim/compaction entirely if compaction can proceed.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Johannes Weiner <jweiner@redhat.com>
Cc: Josh Boyer <jwboyer@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit e0887c19b2daa140f20ca8104bdc5740f39dbb86 upstream.
Stable note: Not tracked on Bugzilla. THP and compaction was found to
aggressively reclaim pages and stall systems under different
situations that was addressed piecemeal over time. Paragraph
3 of this changelog is the motivation for this patch.
When suffering from memory fragmentation due to unfreeable pages, THP page
faults will repeatedly try to compact memory. Due to the unfreeable
pages, compaction fails.
Needless to say, at that point page reclaim also fails to create free
contiguous 2MB areas. However, that doesn't stop the current code from
trying, over and over again, and freeing a minimum of 4MB (2UL <<
sc->order pages) at every single invocation.
This resulted in my 12GB system having 2-3GB free memory, a corresponding
amount of used swap and very sluggish response times.
This can be avoided by having the direct reclaim code not reclaim from
zones that already have plenty of free memory available for compaction.
If compaction still fails due to unmovable memory, doing additional
reclaim will only hurt the system, not help.
[jweiner@redhat.com: change comment to explain the order check]
Signed-off-by: Rik van Riel <riel@redhat.com>
Acked-by: Johannes Weiner <jweiner@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 3567b59aa80ac4417002bf58e35dce5c777d4164 upstream.
Stable note: Not tracked in Bugzilla. This patch reduces excessive
reclaim of slab objects reducing the amount of information that
has to be brought back in from disk. The third and fourth paragram
in the series describes the impact.
When a shrinker returns -1 to shrink_slab() to indicate it cannot do
any work given the current memory reclaim requirements, it adds the
entire total_scan count to shrinker->nr. The idea ehind this is that
whenteh shrinker is next called and can do work, it will do the work
of the previously aborted shrinker call as well.
However, if a filesystem is doing lots of allocation with GFP_NOFS
set, then we get many, many more aborts from the shrinkers than we
do successful calls. The result is that shrinker->nr winds up to
it's maximum permissible value (twice the current cache size) and
then when the next shrinker call that can do work is issued, it
has enough scan count built up to free the entire cache twice over.
This manifests itself in the cache going from full to empty in a
matter of seconds, even when only a small part of the cache is
needed to be emptied to free sufficient memory.
Under metadata intensive workloads on ext4 and XFS, I'm seeing the
VFS caches increase memory consumption up to 75% of memory (no page
cache pressure) over a period of 30-60s, and then the shrinker
empties them down to zero in the space of 2-3s. This cycle repeats
over and over again, with the shrinker completely trashing the inode
and dentry caches every minute or so the workload continues.
This behaviour was made obvious by the shrink_slab tracepoints added
earlier in the series, and made worse by the patch that corrected
the concurrent accounting of shrinker->nr.
To avoid this problem, stop repeated small increments of the total
scan value from winding shrinker->nr up to a value that can cause
the entire cache to be freed. We still need to allow it to wind up,
so use the delta as the "large scan" threshold check - if the delta
is more than a quarter of the entire cache size, then it is a large
scan and allowed to cause lots of windup because we are clearly
needing to free lots of memory.
If it isn't a large scan then limit the total scan to half the size
of the cache so that windup never increases to consume the whole
cache. Reducing the total scan limit further does not allow enough
wind-up to maintain the current levels of performance, whilst a
higher threshold does not prevent the windup from freeing the entire
cache under sustained workloads.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit acf92b485cccf028177f46918e045c0c4e80ee10 upstream.
Stable note: Not tracked in Bugzilla. This patch reduces excessive
reclaim of slab objects reducing the amount of information
that has to be brought back in from disk.
shrink_slab() allows shrinkers to be called in parallel so the
struct shrinker can be updated concurrently. It does not provide any
exclusio for such updates, so we can get the shrinker->nr value
increasing or decreasing incorrectly.
As a result, when a shrinker repeatedly returns a value of -1 (e.g.
a VFS shrinker called w/ GFP_NOFS), the shrinker->nr goes haywire,
sometimes updating with the scan count that wasn't used, sometimes
losing it altogether. Worse is when a shrinker does work and that
update is lost due to racy updates, which means the shrinker will do
the work again!
Fix this by making the total_scan calculations independent of
shrinker->nr, and making the shrinker->nr updates atomic w.r.t. to
other updates via cmpxchg loops.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 095760730c1047c69159ce88021a7fa3833502c8 upstream.
Stable note: This patch makes later patches easier to apply but otherwise
has little to justify it. It is a diagnostic patch that was part
of a series addressing excessive slab shrinking after GFP_NOFS
failures. There is detailed information on the series' motivation
at https://lkml.org/lkml/2011/6/2/42 .
It is impossible to understand what the shrinkers are actually doing
without instrumenting the code, so add a some tracepoints to allow
insight to be gained.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Mel Gorman <mgorman@suse.de>
|
|
commit 439423f6894aa0dec22187526827456f5004baed upstream.
Stable note: Not tracked in Bugzilla. kswapd is responsible for clearing
ZONE_CONGESTED after it balances a zone and this patch fixes a bug
where that was failing to happen. Without this patch, processes
can stall in wait_iff_congested unnecessarily. For users, this can
look like an interactivity stall but some workloads would see it
as sudden drop in throughput.
ZONE_CONGESTED is only cleared in kswapd, but pages can be freed in any
task. It's possible ZONE_CONGESTED isn't cleared in some cases:
1. the zone is already balanced just entering balance_pgdat() for
order-0 because concurrent tasks free memory. In this case, later
check will skip the zone as it's balanced so the flag isn't cleared.
2. high order balance fallbacks to order-0. quote from Mel: At the
end of balance_pgdat(), kswapd uses the following logic;
If reclaiming at high order {
for each zone {
if all_unreclaimable
skip
if watermark is not met
order = 0
loop again
/* watermark is met */
clear congested
}
}
i.e. it clears ZONE_CONGESTED if it the zone is balanced. if not,
it restarts balancing at order-0. However, if the higher zones are
balanced for order-0, kswapd will miss clearing ZONE_CONGESTED as
that only happens after a zone is shrunk. This can mean that
wait_iff_congested() stalls unnecessarily.
This patch makes kswapd clear ZONE_CONGESTED during its initial
highmem->dma scan for zones that are already balanced.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit a4d3e9e76337059406fcf3ead288c0df22a790e9 upstream.
Stable note: Not tracked in Bugzilla. This patch augments an earlier commit
that avoids scanning priority being artificially raised. The older
fix was particularly important for small memcgs to avoid calling
wait_iff_congested() unnecessarily.
Without swap, anonymous pages are not scanned. As such, they should not
count when considering force-scanning a small target if there is no swap.
Otherwise, targets are not force-scanned even when their effective scan
number is zero and the other conditions--kswapd/memcg--apply.
This fixes 246e87a93934 ("memcg: fix get_scan_count() for small
targets").
[akpm@linux-foundation.org: fix comment]
Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Ying Han <yinghan@google.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 938929f14cb595f43cd1a4e63e22d36cab1e4a1f upstream.
Stable note: Fixes https://bugzilla.novell.com/show_bug.cgi?id=726210 .
Large machines with 1TB or more of RAM take a long time to boot
without this patch and may spew out soft lockup warnings.
When min_free_kbytes is updated, some pageblocks are marked
MIGRATE_RESERVE. Ordinarily, this work is unnoticable as it happens early
in boot but on large machines with 1TB of memory, this has been reported
to delay boot times, probably due to the NUMA distances involved.
The bulk of the work is due to calling calling pageblock_is_reserved() an
unnecessary amount of times and accessing far more struct page metadata
than is necessary. This patch significantly reduces the amount of work
done by setup_zone_migrate_reserve() improving boot times on 1TB machines.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit a1cb2c60ddc98ff4e5246f410558805401ceee67 upstream.
Stable note: Not tracked on Bugzilla. This patch is known to make a big
difference to tmpfs performance on larger machines.
This was found to adversely affect tmpfs I/O performance.
Tests run on a 640 cpu UV system.
With 120 threads doing parallel writes, each to different tmpfs mounts:
No patch: ~300 MB/sec
With vm_stat alignment: ~430 MB/sec
Signed-off-by: Dimitri Sivanich <sivanich@sgi.com>
Acked-by: Christoph Lameter <cl@gentwo.org>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
|
|
commit 1c7e7f6c0703d03af6bcd5ccc11fc15d23e5ecbe upstream.
Offlining memory may block forever, waiting for kswapd() to wake up
because kswapd() does not check the event kthread->should_stop before
sleeping.
The proper pattern, from Documentation/memory-barriers.txt, is:
--- waker ---
event_indicated = 1;
wake_up_process(event_daemon);
--- sleeper ---
for (;;) {
set_current_state(TASK_UNINTERRUPTIBLE);
if (event_indicated)
break;
schedule();
}
set_current_state() may be wrapped by:
prepare_to_wait();
In the kswapd() case, event_indicated is kthread->should_stop.
=== offlining memory (waker) ===
kswapd_stop()
kthread_stop()
kthread->should_stop = 1
wake_up_process()
wait_for_completion()
=== kswapd_try_to_sleep (sleeper) ===
kswapd_try_to_sleep()
prepare_to_wait()
.
.
schedule()
.
.
finish_wait()
The schedule() needs to be protected by a test of kthread->should_stop,
which is wrapped by kthread_should_stop().
Reproducer:
Do heavy file I/O in background.
Do a memory offline/online in a tight loop
Signed-off-by: Aaditya Kumar <aaditya.kumar@ap.sony.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 9ab4233dd08036fe34a89c7dc6f47a8bf2eb29eb upstream.
Otherwise the code races with munmap (causing a use-after-free
of the vma) or with close (causing a use-after-free of the struct
file).
The bug was introduced by commit 90ed52ebe481 ("[PATCH] holepunch: fix
mmap_sem i_mutex deadlock")
[bwh: Backported to 3.2:
- Adjust context
- madvise_remove() calls vmtruncate_range(), not do_fallocate()]
[luto: Backported to 3.0: Adjust context]
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Miklos Szeredi <mszeredi@suse.cz>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Nick Piggin <npiggin@suse.de>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 4bf2bba3750f10aa9e62e6949bc7e8329990f01b upstream.
If page migration cannot charge the temporary page to the memcg,
migrate_pages() will return -ENOMEM. This isn't considered in memory
compaction however, and the loop continues to iterate over all
pageblocks trying to isolate and migrate pages. If a small number of
very large memcgs happen to be oom, however, these attempts will mostly
be futile leading to an enormous amout of cpu consumption due to the
page migration failures.
This patch will short circuit and fail memory compaction if
migrate_pages() returns -ENOMEM. COMPACT_PARTIAL is returned in case
some migrations were successful so that the page allocator will retry.
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit d8adde17e5f858427504725218c56aef90e90fc7 upstream.
kswapd_stop() is called to destroy the kswapd work thread when all memory
of a NUMA node has been offlined. But kswapd_stop() only terminates the
work thread without resetting NODE_DATA(nid)->kswapd to NULL. The stale
pointer will prevent kswapd_run() from creating a new work thread when
adding memory to the memory-less NUMA node again. Eventually the stale
pointer may cause invalid memory access.
An example stack dump as below. It's reproduced with 2.6.32, but latest
kernel has the same issue.
BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [<ffffffff81051a94>] exit_creds+0x12/0x78
PGD 0
Oops: 0000 [#1] SMP
last sysfs file: /sys/devices/system/memory/memory391/state
CPU 11
Modules linked in: cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq microcode fuse loop dm_mod tpm_tis rtc_cmos i2c_i801 rtc_core tpm serio_raw pcspkr sg tpm_bios igb i2c_core iTCO_wdt rtc_lib mptctl iTCO_vendor_support button dca bnx2 usbhid hid uhci_hcd ehci_hcd usbcore sd_mod crc_t10dif edd ext3 mbcache jbd fan ide_pci_generic ide_core ata_generic ata_piix libata thermal processor thermal_sys hwmon mptsas mptscsih mptbase scsi_transport_sas scsi_mod
Pid: 7949, comm: sh Not tainted 2.6.32.12-qiuxishi-5-default #92 Tecal RH2285
RIP: 0010:exit_creds+0x12/0x78
RSP: 0018:ffff8806044f1d78 EFLAGS: 00010202
RAX: 0000000000000000 RBX: ffff880604f22140 RCX: 0000000000019502
RDX: 0000000000000000 RSI: 0000000000000202 RDI: 0000000000000000
RBP: ffff880604f22150 R08: 0000000000000000 R09: ffffffff81a4dc10
R10: 00000000000032a0 R11: ffff880006202500 R12: 0000000000000000
R13: 0000000000c40000 R14: 0000000000008000 R15: 0000000000000001
FS: 00007fbc03d066f0(0000) GS:ffff8800282e0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 000000060f029000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process sh (pid: 7949, threadinfo ffff8806044f0000, task ffff880603d7c600)
Stack:
ffff880604f22140 ffffffff8103aac5 ffff880604f22140 ffffffff8104d21e
ffff880006202500 0000000000008000 0000000000c38000 ffffffff810bd5b1
0000000000000000 ffff880603d7c600 00000000ffffdd29 0000000000000003
Call Trace:
__put_task_struct+0x5d/0x97
kthread_stop+0x50/0x58
offline_pages+0x324/0x3da
memory_block_change_state+0x179/0x1db
store_mem_state+0x9e/0xbb
sysfs_write_file+0xd0/0x107
vfs_write+0xad/0x169
sys_write+0x45/0x6e
system_call_fastpath+0x16/0x1b
Code: ff 4d 00 0f 94 c0 84 c0 74 08 48 89 ef e8 1f fd ff ff 5b 5d 31 c0 41 5c c3 53 48 8b 87 20 06 00 00 48 89 fb 48 8b bf 18 06 00 00 <8b> 00 48 c7 83 18 06 00 00 00 00 00 00 f0 ff 0f 0f 94 c0 84 c0
RIP exit_creds+0x12/0x78
RSP <ffff8806044f1d78>
CR2: 0000000000000000
[akpm@linux-foundation.org: add pglist_data.kswapd locking comments]
Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit c50ac050811d6485616a193eb0f37bfbd191cc89 and
4523e1458566a0e8ecfaff90f380dd23acc44d27 upstream.
When called for anonymous (non-shared) mappings, hugetlb_reserve_pages()
does a resv_map_alloc(). It depends on code in hugetlbfs's
vm_ops->close() to release that allocation.
However, in the mmap() failure path, we do a plain unmap_region() without
the remove_vma() which actually calls vm_ops->close().
This is a decent fix. This leak could get reintroduced if new code (say,
after hugetlb_reserve_pages() in hugetlbfs_file_mmap()) decides to return
an error. But, I think it would have to unroll the reservation anyway.
Christoph's test case:
http://marc.info/?l=linux-mm&m=133728900729735
This patch applies to 3.4 and later. A version for earlier kernels is at
https://lkml.org/lkml/2012/5/22/418.
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reported-by: Christoph Lameter <cl@linux.com>
Tested-by: Christoph Lameter <cl@linux.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit dbda591d920b4c7692725b13e3f68ecb251e9080 upstream.
The transfer of ->flags causes some of the static mapping virtual
addresses to be prematurely freed (before the mapping is removed) because
VM_LAZY_FREE gets "set" if tmp->flags has VM_IOREMAP set. This might
cause subsequent vmalloc/ioremap calls to fail because it might allocate
one of the freed virtual address ranges that aren't unmapped.
va->flags has different types of flags from tmp->flags. If a region with
VM_IOREMAP set is registered with vm_area_add_early(), it will be removed
by __purge_vmap_area_lazy().
Fix vmalloc_init() to correctly initialize vmap_area for the given
vm_struct.
Also initialise va->vm. If it is not set, find_vm_area() for the early
vm regions will always fail.
Signed-off-by: KyongHo Cho <pullip.cho@samsung.com>
Cc: "Olav Haugan" <ohaugan@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit db1aecafef58b5dda39c4228debe2c845e4a27ab upstream.
vmap_area->private is void* but we don't use the field for various purpose
but use only for vm_struct. So change it to a vm_struct* with naming to
improve for readability and type checking.
Signed-off-by: Minchan Kim <minchan@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit e48982734ea0500d1eba4f9d96195acc5406cad6 upstream.
Commit 645747462435 ("vmscan: detect mapped file pages used only once")
made mapped pages have another round in inactive list because they might
be just short lived and so we could consider them again next time. This
heuristic helps to reduce pressure on the active list with a streaming
IO worklods.
This patch fixes a regression introduced by this commit for heavy shmem
based workloads because unlike Anon pages, which are excluded from this
heuristic because they are usually long lived, shmem pages are handled
as a regular page cache.
This doesn't work quite well, unfortunately, if the workload is mostly
backed by shmem (in memory database sitting on 80% of memory) with a
streaming IO in the background (backup - up to 20% of memory). Anon
inactive list is full of (dirty) shmem pages when watermarks are hit.
Shmem pages are kept in the inactive list (they are referenced) in the
first round and it is hard to reclaim anything else so we reach lower
scanning priorities very quickly which leads to an excessive swap out.
Let's fix this by excluding all swap backed pages (they tend to be long
lived wrt. the regular page cache anyway) from used-once heuristic and
rather activate them if they are referenced.
The customer's workload is shmem backed database (80% of RAM) and they
are measuring transactions/s with an IO in the background (20%).
Transactions touch more or less random rows in the table. The
transaction rate fell by a factor of 3 (in the worst case) because of
commit 64574746. This patch restores the previous numbers.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Minchan Kim <minchan@kernel.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 05f144a0d5c2207a0349348127f996e104ad7404 upstream.
Dave Jones' system call fuzz testing tool "trinity" triggered the
following bug error with slab debugging enabled
=============================================================================
BUG numa_policy (Not tainted): Poison overwritten
-----------------------------------------------------------------------------
INFO: 0xffff880146498250-0xffff880146498250. First byte 0x6a instead of 0x6b
INFO: Allocated in mpol_new+0xa3/0x140 age=46310 cpu=6 pid=32154
__slab_alloc+0x3d3/0x445
kmem_cache_alloc+0x29d/0x2b0
mpol_new+0xa3/0x140
sys_mbind+0x142/0x620
system_call_fastpath+0x16/0x1b
INFO: Freed in __mpol_put+0x27/0x30 age=46268 cpu=6 pid=32154
__slab_free+0x2e/0x1de
kmem_cache_free+0x25a/0x260
__mpol_put+0x27/0x30
remove_vma+0x68/0x90
exit_mmap+0x118/0x140
mmput+0x73/0x110
exit_mm+0x108/0x130
do_exit+0x162/0xb90
do_group_exit+0x4f/0xc0
sys_exit_group+0x17/0x20
system_call_fastpath+0x16/0x1b
INFO: Slab 0xffffea0005192600 objects=27 used=27 fp=0x (null) flags=0x20000000004080
INFO: Object 0xffff880146498250 @offset=592 fp=0xffff88014649b9d0
This implied a reference counting bug and the problem happened during
mbind().
mbind() applies a new memory policy to a range and uses mbind_range() to
merge existing VMAs or split them as necessary. In the event of splits,
mpol_dup() will allocate a new struct mempolicy and maintain existing
reference counts whose rules are documented in
Documentation/vm/numa_memory_policy.txt .
The problem occurs with shared memory policies. The vm_op->set_policy
increments the reference count if necessary and split_vma() and
vma_merge() have already handled the existing reference counts.
However, policy_vma() screws it up by replacing an existing
vma->vm_policy with one that potentially has the wrong reference count
leading to a premature free. This patch removes the damage caused by
policy_vma().
With this patch applied Dave's trinity tool runs an mbind test for 5
minutes without error. /proc/slabinfo reported that there are no
numa_policy or shared_policy_node objects allocated after the test
completed and the shared memory region was deleted.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Dave Jones <davej@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Stephen Wilson <wilsons@start.ca>
Cc: Christoph Lameter <cl@linux.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 8c7577637ca31385e92769a77e2ab5b428e8b99c upstream.
When the last event is unregistered, there is no need to keep the spare
array anymore. So free it to avoid memory leak.
Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 6bc2e853c6b46a6041980d58200ad9b0a73a60ff upstream.
Systems with 8 TBytes of memory or greater can hit a problem where only
the the first 8 TB of memory shows up. This is due to "int i" being
smaller than "unsigned long start_aligned", causing the high bits to be
dropped.
The fix is to change `i' to unsigned long to match start_aligned
and end_aligned.
Thanks to Jack Steiner for assistance tracking this down.
Signed-off-by: Russ Anderson <rja@sgi.com>
Cc: Jack Steiner <steiner@sgi.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 4998a6c0edce7fae9c0a5463f6ec3fa585258ee7 upstream.
Commit 66aebce747eaf ("hugetlb: fix race condition in hugetlb_fault()")
added code to avoid a race condition by elevating the page refcount in
hugetlb_fault() while calling hugetlb_cow().
However, one code path in hugetlb_cow() includes an assertion that the
page count is 1, whereas it may now also have the value 2 in this path.
The consensus is that this BUG_ON has served its purpose, so rather than
extending it to cover both cases, we just remove it.
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Hillf Danton <dhillf@gmail.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
are complete
commit 42b64281453249dac52861f9b97d18552a7ec62b upstream.
pcpu_embed_first_chunk() allocates memory for each node, copies percpu
data and frees unused portions of it before proceeding to the next
group. This assumes that allocations for different nodes doesn't
overlap; however, depending on memory topology, the bootmem allocator
may end up allocating memory from a different node than the requested
one which may overlap with the portion freed from one of the previous
percpu areas. This leads to percpu groups for different nodes
overlapping which is a serious bug.
This patch separates out copy & partial free from the allocation loop
such that all allocations are complete before partial frees happen.
This also fixes overlapping frees which could happen on allocation
failure path - out_free_areas path frees whole groups but the groups
could have portions freed at that point.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: "Pavel V. Panteleev" <pp_84@mail.ru>
Tested-by: "Pavel V. Panteleev" <pp_84@mail.ru>
LKML-Reference: <E1SNhwY-0007ui-V7.pp_84-mail-ru@f220.mail.ru>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit aca50bd3b4c4bb5528a1878158ba7abce41de534 upstream.
Mel reports a BUG_ON(slot == NULL) in radix_tree_tag_set() on s390
3.0.13: called from __set_page_dirty_nobuffers() when page_remove_rmap()
tries to transfer dirty flag from s390 storage key to struct page and
radix_tree.
That would be because of reclaim's shrink_page_list() calling
add_to_swap() on this page at the same time: first PageSwapCache is set
(causing page_mapping(page) to appear as &swapper_space), then
page->private set, then tree_lock taken, then page inserted into
radix_tree - so there's an interval before taking the lock when the
radix_tree slot is empty.
We could fix this by moving __add_to_swap_cache()'s spin_lock_irq up
before the SetPageSwapCache. But a better fix is simply to do what's
five years overdue: Ken Chen introduced __set_page_dirty_no_writeback()
(if !PageDirty TestSetPageDirty) for tmpfs to skip all the radix_tree
overhead, and swap is just the same - it ignores the radix_tree tag, and
does not participate in dirty page accounting, so should be using
__set_page_dirty_no_writeback() too.
s390 testing now confirms that this does indeed fix the problem.
Reported-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Ken Chen <kenchen@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 66aebce747eaf9bc456bf1f1b217d8db843031d0 upstream.
The race is as follows:
Suppose a multi-threaded task forks a new process (on cpu A), thus
bumping up the ref count on all the pages. While the fork is occurring
(and thus we have marked all the PTEs as read-only), another thread in
the original process (on cpu B) tries to write to a huge page, taking an
access violation from the write-protect and calling hugetlb_cow(). Now,
suppose the fork() fails. It will undo the COW and decrement the ref
count on the pages, so the ref count on the huge page drops back to 1.
Meanwhile hugetlb_cow() also decrements the ref count by one on the
original page, since the original address space doesn't need it any
more, having copied a new page to replace the original page. This
leaves the ref count at zero, and when we call unlock_page(), we panic.
fork on CPU A fault on CPU B
============= ==============
...
down_write(&parent->mmap_sem);
down_write_nested(&child->mmap_sem);
...
while duplicating vmas
if error
break;
...
up_write(&child->mmap_sem);
up_write(&parent->mmap_sem); ...
down_read(&parent->mmap_sem);
...
lock_page(page);
handle COW
page_mapcount(old_page) == 2
alloc and prepare new_page
...
handle error
page_remove_rmap(page);
put_page(page);
...
fold new_page into pte
page_remove_rmap(page);
put_page(page);
...
oops ==> unlock_page(page);
up_read(&parent->mmap_sem);
The solution is to take an extra reference to the page while we are
holding the lock on it.
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 66c4c35c6bc5a1a452b024cf0364635b28fd94e4 upstream.
sysfs_slab_add() calls various sysfs functions that actually may
end up in userspace doing all sorts of things.
Release the slub_lock after adding the kmem_cache structure to the list.
At that point the address of the kmem_cache is not known so we are
guaranteed exlusive access to the following modifications to the
kmem_cache structure.
If the sysfs_slab_add fails then reacquire the slub_lock to
remove the kmem_cache structure from the list.
Reported-by: Sasha Levin <levinsasha928@gmail.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit f5bf18fa22f8c41a13eb8762c7373eb3a93a7333 upstream.
While testing AMS (Active Memory Sharing) / CMO (Cooperative Memory
Overcommit) on powerpc, we tripped the following:
kernel BUG at mm/bootmem.c:483!
cpu 0x0: Vector: 700 (Program Check) at [c000000000c03940]
pc: c000000000a62bd8: .alloc_bootmem_core+0x90/0x39c
lr: c000000000a64bcc: .sparse_early_usemaps_alloc_node+0x84/0x29c
sp: c000000000c03bc0
msr: 8000000000021032
current = 0xc000000000b0cce0
paca = 0xc000000001d80000
pid = 0, comm = swapper
kernel BUG at mm/bootmem.c:483!
enter ? for help
[c000000000c03c80] c000000000a64bcc
.sparse_early_usemaps_alloc_node+0x84/0x29c
[c000000000c03d50] c000000000a64f10 .sparse_init+0x12c/0x28c
[c000000000c03e20] c000000000a474f4 .setup_arch+0x20c/0x294
[c000000000c03ee0] c000000000a4079c .start_kernel+0xb4/0x460
[c000000000c03f90] c000000000009670 .start_here_common+0x1c/0x2c
This is
BUG_ON(limit && goal + size > limit);
and after some debugging, it seems that
goal = 0x7ffff000000
limit = 0x80000000000
and sparse_early_usemaps_alloc_node ->
sparse_early_usemaps_alloc_pgdat_section calls
return alloc_bootmem_section(usemap_size() * count, section_nr);
This is on a system with 8TB available via the AMS pool, and as a quirk
of AMS in firmware, all of that memory shows up in node 0. So, we end
up with an allocation that will fail the goal/limit constraints.
In theory, we could "fall-back" to alloc_bootmem_node() in
sparse_early_usemaps_alloc_node(), but since we actually have HOTREMOVE
defined, we'll BUG_ON() instead. A simple solution appears to be to
unconditionally remove the limit condition in alloc_bootmem_section,
meaning allocations are allowed to cross section boundaries (necessary
for systems of this size).
Johannes Weiner pointed out that if alloc_bootmem_section() no longer
guarantees section-locality, we need check_usemap_section_nr() to print
possible cross-dependencies between node descriptors and the usemaps
allocated through it. That makes the two loops in
sparse_early_usemaps_alloc_node() identical, so re-factor the code a
bit.
[akpm@linux-foundation.org: code simplification]
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Anton Blanchard <anton@au1.ibm.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ben Herrenschmidt <benh@kernel.crashing.org>
Cc: Robert Jennings <rcj@linux.vnet.ibm.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 1a5a9906d4e8d1976b701f889d8f35d54b928f25 upstream.
In some cases it may happen that pmd_none_or_clear_bad() is called with
the mmap_sem hold in read mode. In those cases the huge page faults can
allocate hugepmds under pmd_none_or_clear_bad() and that can trigger a
false positive from pmd_bad() that will not like to see a pmd
materializing as trans huge.
It's not khugepaged causing the problem, khugepaged holds the mmap_sem
in write mode (and all those sites must hold the mmap_sem in read mode
to prevent pagetables to go away from under them, during code review it
seems vm86 mode on 32bit kernels requires that too unless it's
restricted to 1 thread per process or UP builds). The race is only with
the huge pagefaults that can convert a pmd_none() into a
pmd_trans_huge().
Effectively all these pmd_none_or_clear_bad() sites running with
mmap_sem in read mode are somewhat speculative with the page faults, and
the result is always undefined when they run simultaneously. This is
probably why it wasn't common to run into this. For example if the
madvise(MADV_DONTNEED) runs zap_page_range() shortly before the page
fault, the hugepage will not be zapped, if the page fault runs first it
will be zapped.
Altering pmd_bad() not to error out if it finds hugepmds won't be enough
to fix this, because zap_pmd_range would then proceed to call
zap_pte_range (which would be incorrect if the pmd become a
pmd_trans_huge()).
The simplest way to fix this is to read the pmd in the local stack
(regardless of what we read, no need of actual CPU barriers, only
compiler barrier needed), and be sure it is not changing under the code
that computes its value. Even if the real pmd is changing under the
value we hold on the stack, we don't care. If we actually end up in
zap_pte_range it means the pmd was not none already and it was not huge,
and it can't become huge from under us (khugepaged locking explained
above).
All we need is to enforce that there is no way anymore that in a code
path like below, pmd_trans_huge can be false, but pmd_none_or_clear_bad
can run into a hugepmd. The overhead of a barrier() is just a compiler
tweak and should not be measurable (I only added it for THP builds). I
don't exclude different compiler versions may have prevented the race
too by caching the value of *pmd on the stack (that hasn't been
verified, but it wouldn't be impossible considering
pmd_none_or_clear_bad, pmd_bad, pmd_trans_huge, pmd_none are all inlines
and there's no external function called in between pmd_trans_huge and
pmd_none_or_clear_bad).
if (pmd_trans_huge(*pmd)) {
if (next-addr != HPAGE_PMD_SIZE) {
VM_BUG_ON(!rwsem_is_locked(&tlb->mm->mmap_sem));
split_huge_page_pmd(vma->vm_mm, pmd);
} else if (zap_huge_pmd(tlb, vma, pmd, addr))
continue;
/* fall through */
}
if (pmd_none_or_clear_bad(pmd))
Because this race condition could be exercised without special
privileges this was reported in CVE-2012-1179.
The race was identified and fully explained by Ulrich who debugged it.
I'm quoting his accurate explanation below, for reference.
====== start quote =======
mapcount 0 page_mapcount 1
kernel BUG at mm/huge_memory.c:1384!
At some point prior to the panic, a "bad pmd ..." message similar to the
following is logged on the console:
mm/memory.c:145: bad pmd ffff8800376e1f98(80000000314000e7).
The "bad pmd ..." message is logged by pmd_clear_bad() before it clears
the page's PMD table entry.
143 void pmd_clear_bad(pmd_t *pmd)
144 {
-> 145 pmd_ERROR(*pmd);
146 pmd_clear(pmd);
147 }
After the PMD table entry has been cleared, there is an inconsistency
between the actual number of PMD table entries that are mapping the page
and the page's map count (_mapcount field in struct page). When the page
is subsequently reclaimed, __split_huge_page() detects this inconsistency.
1381 if (mapcount != page_mapcount(page))
1382 printk(KERN_ERR "mapcount %d page_mapcount %d\n",
1383 mapcount, page_mapcount(page));
-> 1384 BUG_ON(mapcount != page_mapcount(page));
The root cause of the problem is a race of two threads in a multithreaded
process. Thread B incurs a page fault on a virtual address that has never
been accessed (PMD entry is zero) while Thread A is executing an madvise()
system call on a virtual address within the same 2 MB (huge page) range.
virtual address space
.---------------------.
| |
| |
.-|---------------------|
| | |
| | |<-- B(fault)
| | |
2 MB | |/////////////////////|-.
huge < |/////////////////////| > A(range)
page | |/////////////////////|-'
| | |
| | |
'-|---------------------|
| |
| |
'---------------------'
- Thread A is executing an madvise(..., MADV_DONTNEED) system call
on the virtual address range "A(range)" shown in the picture.
sys_madvise
// Acquire the semaphore in shared mode.
down_read(¤t->mm->mmap_sem)
...
madvise_vma
switch (behavior)
case MADV_DONTNEED:
madvise_dontneed
zap_page_range
unmap_vmas
unmap_page_range
zap_pud_range
zap_pmd_range
//
// Assume that this huge page has never been accessed.
// I.e. content of the PMD entry is zero (not mapped).
//
if (pmd_trans_huge(*pmd)) {
// We don't get here due to the above assumption.
}
//
// Assume that Thread B incurred a page fault and
.---------> // sneaks in here as shown below.
| //
| if (pmd_none_or_clear_bad(pmd))
| {
| if (unlikely(pmd_bad(*pmd)))
| pmd_clear_bad
| {
| pmd_ERROR
| // Log "bad pmd ..." message here.
| pmd_clear
| // Clear the page's PMD entry.
| // Thread B incremented the map count
| // in page_add_new_anon_rmap(), but
| // now the page is no longer mapped
| // by a PMD entry (-> inconsistency).
| }
| }
|
v
- Thread B is handling a page fault on virtual address "B(fault)" shown
in the picture.
...
do_page_fault
__do_page_fault
// Acquire the semaphore in shared mode.
down_read_trylock(&mm->mmap_sem)
...
handle_mm_fault
if (pmd_none(*pmd) && transparent_hugepage_enabled(vma))
// We get here due to the above assumption (PMD entry is zero).
do_huge_pmd_anonymous_page
alloc_hugepage_vma
// Allocate a new transparent huge page here.
...
__do_huge_pmd_anonymous_page
...
spin_lock(&mm->page_table_lock)
...
page_add_new_anon_rmap
// Here we increment the page's map count (starts at -1).
atomic_set(&page->_mapcount, 0)
set_pmd_at
// Here we set the page's PMD entry which will be cleared
// when Thread A calls pmd_clear_bad().
...
spin_unlock(&mm->page_table_lock)
The mmap_sem does not prevent the race because both threads are acquiring
it in shared mode (down_read). Thread B holds the page_table_lock while
the page's map count and PMD table entry are updated. However, Thread A
does not synchronize on that lock.
====== end quote =======
[akpm@linux-foundation.org: checkpatch fixes]
Reported-by: Ulrich Obergfell <uobergfe@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Jones <davej@redhat.com>
Acked-by: Larry Woodman <lwoodman@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Mark Salter <msalter@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|