Age | Commit message (Collapse) | Author |
|
ZONE_WRITEBACK
After the patch "mm: vmscan: Flatten kswapd priority loop" was merged
the scanning priority of kswapd changed.
The priority now rises until it is scanning enough pages to meet the
high watermark. shrink_inactive_list sets ZONE_WRITEBACK if a number of
pages were encountered under writeback but this value is scaled based on
the priority. As kswapd frequently scans with a higher priority now it
is relatively easy to set ZONE_WRITEBACK. This patch removes the
scaling and treates writeback pages similar to how it treats unqueued
dirty pages and congested pages. The user-visible effect should be that
kswapd will writeback fewer pages from reclaim context.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Direct reclaim is not aborting to allow compaction to go ahead properly.
do_try_to_free_pages is told to abort reclaim which is happily ignores
and instead increases priority instead until it reaches 0 and starts
shrinking file/anon equally. This patch corrects the situation by
aborting reclaim when requested instead of raising priority.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Similar to __pagevec_lru_add, this patch removes the LRU parameter from
__lru_cache_add and lru_cache_add_lru as the caller does not control the
exact LRU the page gets added to. lru_cache_add_lru gets renamed to
lru_cache_add the name is silly without the lru parameter. With the
parameter removed, it is required that the caller indicate if they want
the page added to the active or inactive list by setting or clearing
PageActive respectively.
[akpm@linux-foundation.org: Suggested the patch]
[gang.chen@asianux.com: fix used-unintialized warning]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Chen Gang <gang.chen@asianux.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Alexey Lyahkov <alexey.lyashkov@gmail.com>
Cc: Andrew Perepechko <anserper@ya.ru>
Cc: Robin Dong <sanbai@taobao.com>
Cc: Theodore Tso <tytso@mit.edu>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Bernd Schubert <bernd.schubert@fastmail.fm>
Cc: David Howells <dhowells@redhat.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Page reclaim keeps track of dirty and under writeback pages and uses it
to determine if wait_iff_congested() should stall or if kswapd should
begin writing back pages. This fails to account for buffer pages that
can be under writeback but not PageWriteback which is the case for
filesystems like ext3 ordered mode. Furthermore, PageDirty buffer pages
can have all the buffers clean and writepage does no IO so it should not
be accounted as congested.
This patch adds an address_space operation that filesystems may
optionally use to check if a page is really dirty or really under
writeback. An implementation is provided for for buffer_heads is added
and used for block operations and ext3 in ordered mode. By default the
page flags are obeyed.
Credit goes to Jan Kara for identifying that the page flags alone are
not sufficient for ext3 and sanity checking a number of ideas on how the
problem could be addressed.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Currently a zone will only be marked congested if the underlying BDI is
congested but if dirty pages are spread across zones it is possible that
an individual zone is full of dirty pages without being congested. The
impact is that zone gets scanned very quickly potentially reclaiming
really clean pages. This patch treats pages marked for immediate
reclaim as congested for the purposes of marking a zone ZONE_CONGESTED
and stalling in wait_iff_congested.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
shrink_inactive_list makes decisions on whether to stall based on the
number of dirty pages encountered. The wait_iff_congested() call in
shrink_page_list does no such thing and it's arbitrary.
This patch moves the decision on whether to set ZONE_CONGESTED and the
wait_iff_congested call into shrink_page_list. This keeps all the
decisions on whether to stall or not in the one place.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
In shrink_page_list a decision may be made to stall and flag a zone as
ZONE_WRITEBACK so that if a large number of unqueued dirty pages are
encountered later then the reclaimer will stall. Set ZONE_WRITEBACK
before potentially going to sleep so it is noticed sooner.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Commit "mm: vmscan: Block kswapd if it is encountering pages under
writeback" blocks page reclaim if it encounters pages under writeback
marked for immediate reclaim. It blocks while pages are still isolated
from the LRU which is unnecessary. This patch defers the blocking until
after the isolated pages have been processed and tidies up some of the
comments.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
pages encountered
Further testing of the "Reduce system disruption due to kswapd"
discovered a few problems. First and foremost, it's possible for pages
under writeback to be freed which will lead to badness. Second, as
pages were not being swapped the file LRU was being scanned faster and
clean file pages were being reclaimed. In some cases this results in
increased read IO to re-read data from disk. Third, more pages were
being written from kswapd context which can adversly affect IO
performance. Lastly, it was observed that PageDirty pages are not
necessarily dirty on all filesystems (buffers can be clean while
PageDirty is set and ->writepage generates no IO) and not all
filesystems set PageWriteback when the page is being written (e.g.
ext3). This disconnect confuses the reclaim stalling logic. This
follow-up series is aimed at these problems.
The tests were based on three kernels
vanilla: kernel 3.9 as that is what the current mmotm uses as a baseline
mmotm-20130522 is mmotm as of 22nd May with "Reduce system disruption due to
kswapd" applied on top as per what should be in Andrew's tree
right now
lessdisrupt-v7r10 is this follow-up series on top of the mmotm kernel
The first test used memcached+memcachetest while some background IO was
in progress as implemented by the parallel IO tests implement in MM
Tests. memcachetest benchmarks how many operations/second memcached can
service. It starts with no background IO on a freshly created ext4
filesystem and then re-runs the test with larger amounts of IO in the
background to roughly simulate a large copy in progress. The
expectation is that the IO should have little or no impact on
memcachetest which is running entirely in memory.
parallelio
3.9.0 3.9.0 3.9.0
vanilla mm1-mmotm-20130522 mm1-lessdisrupt-v7r10
Ops memcachetest-0M 23117.00 ( 0.00%) 22780.00 ( -1.46%) 22763.00 ( -1.53%)
Ops memcachetest-715M 23774.00 ( 0.00%) 23299.00 ( -2.00%) 22934.00 ( -3.53%)
Ops memcachetest-2385M 4208.00 ( 0.00%) 24154.00 (474.00%) 23765.00 (464.76%)
Ops memcachetest-4055M 4104.00 ( 0.00%) 25130.00 (512.33%) 24614.00 (499.76%)
Ops io-duration-0M 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops io-duration-715M 12.00 ( 0.00%) 7.00 ( 41.67%) 6.00 ( 50.00%)
Ops io-duration-2385M 116.00 ( 0.00%) 21.00 ( 81.90%) 21.00 ( 81.90%)
Ops io-duration-4055M 160.00 ( 0.00%) 36.00 ( 77.50%) 35.00 ( 78.12%)
Ops swaptotal-0M 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops swaptotal-715M 140138.00 ( 0.00%) 18.00 ( 99.99%) 18.00 ( 99.99%)
Ops swaptotal-2385M 385682.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops swaptotal-4055M 418029.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops swapin-0M 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops swapin-715M 144.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops swapin-2385M 134227.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops swapin-4055M 125618.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops minorfaults-0M 1536429.00 ( 0.00%) 1531632.00 ( 0.31%) 1533541.00 ( 0.19%)
Ops minorfaults-715M 1786996.00 ( 0.00%) 1612148.00 ( 9.78%) 1608832.00 ( 9.97%)
Ops minorfaults-2385M 1757952.00 ( 0.00%) 1614874.00 ( 8.14%) 1613541.00 ( 8.21%)
Ops minorfaults-4055M 1774460.00 ( 0.00%) 1633400.00 ( 7.95%) 1630881.00 ( 8.09%)
Ops majorfaults-0M 1.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops majorfaults-715M 184.00 ( 0.00%) 167.00 ( 9.24%) 166.00 ( 9.78%)
Ops majorfaults-2385M 24444.00 ( 0.00%) 155.00 ( 99.37%) 93.00 ( 99.62%)
Ops majorfaults-4055M 21357.00 ( 0.00%) 147.00 ( 99.31%) 134.00 ( 99.37%)
memcachetest is the transactions/second reported by memcachetest. In
the vanilla kernel note that performance drops from around
23K/sec to just over 4K/second when there is 2385M of IO going
on in the background. With current mmotm, there is no collapse
in performance and with this follow-up series there is little
change.
swaptotal is the total amount of swap traffic. With mmotm and the follow-up
series, the total amount of swapping is much reduced.
3.9.0 3.9.0 3.9.0
vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
Minor Faults 11160152 10706748 10622316
Major Faults 46305 755 678
Swap Ins 260249 0 0
Swap Outs 683860 18 18
Direct pages scanned 0 678 2520
Kswapd pages scanned 6046108 8814900 1639279
Kswapd pages reclaimed 1081954 1172267 1094635
Direct pages reclaimed 0 566 2304
Kswapd efficiency 17% 13% 66%
Kswapd velocity 5217.560 7618.953 1414.879
Direct efficiency 100% 83% 91%
Direct velocity 0.000 0.586 2.175
Percentage direct scans 0% 0% 0%
Zone normal velocity 5105.086 6824.681 671.158
Zone dma32 velocity 112.473 794.858 745.896
Zone dma velocity 0.000 0.000 0.000
Page writes by reclaim 1929612.000 6861768.000 32821.000
Page writes file 1245752 6861750 32803
Page writes anon 683860 18 18
Page reclaim immediate 7484 40 239
Sector Reads 1130320 93996 86900
Sector Writes 13508052 10823500 11804436
Page rescued immediate 0 0 0
Slabs scanned 33536 27136 18560
Direct inode steals 0 0 0
Kswapd inode steals 8641 1035 0
Kswapd skipped wait 0 0 0
THP fault alloc 8 37 33
THP collapse alloc 508 552 515
THP splits 24 1 1
THP fault fallback 0 0 0
THP collapse fail 0 0 0
There are a number of observations to make here
1. Swap outs are almost eliminated. Swap ins are 0 indicating that the
pages swapped were really unused anonymous pages. Related to that,
major faults are much reduced.
2. kswapd efficiency was impacted by the initial series but with these
follow-up patches, the efficiency is now at 66% indicating that far
fewer pages were skipped during scanning due to dirty or writeback
pages.
3. kswapd velocity is reduced indicating that fewer pages are being scanned
with the follow-up series as kswapd now stalls when the tail of the
LRU queue is full of unqueued dirty pages. The stall gives flushers a
chance to catch-up so kswapd can reclaim clean pages when it wakes
4. In light of Zlatko's recent reports about zone scanning imbalances,
mmtests now reports scanning velocity on a per-zone basis. With mainline,
you can see that the scanning activity is dominated by the Normal
zone with over 45 times more scanning in Normal than the DMA32 zone.
With the series currently in mmotm, the ratio is slightly better but it
is still the case that the bulk of scanning is in the highest zone. With
this follow-up series, the ratio of scanning between the Normal and
DMA32 zone is roughly equal.
5. As Dave Chinner observed, the current patches in mmotm increased the
number of pages written from kswapd context which is expected to adversly
impact IO performance. With the follow-up patches, far fewer pages are
written from kswapd context than the mainline kernel
6. With the series in mmotm, fewer inodes were reclaimed by kswapd. With
the follow-up series, there is less slab shrinking activity and no inodes
were reclaimed.
7. Note that "Sectors Read" is drastically reduced implying that the source
data being used for the IO is not being aggressively discarded due to
page reclaim skipping over dirty pages and reclaiming clean pages. Note
that the reducion in reads could also be due to inode data not being
re-read from disk after a slab shrink.
3.9.0 3.9.0 3.9.0
vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
Mean sda-avgqz 166.99 32.09 33.44
Mean sda-await 853.64 192.76 185.43
Mean sda-r_await 6.31 9.24 5.97
Mean sda-w_await 2992.81 202.65 192.43
Max sda-avgqz 1409.91 718.75 698.98
Max sda-await 6665.74 3538.00 3124.23
Max sda-r_await 58.96 111.95 58.00
Max sda-w_await 28458.94 3977.29 3148.61
In light of the changes in writes from reclaim context, the number of
reads and Dave Chinner's concerns about IO performance I took a closer
look at the IO stats for the test disk. Few observations
1. The average queue size is reduced by the initial series and roughly
the same with this follow up.
2. Average wait times for writes are reduced and as the IO
is completing faster it at least implies that the gain is because
flushers are writing the files efficiently instead of page reclaim
getting in the way.
3. The reduction in maximum write latency is staggering. 28 seconds down
to 3 seconds.
Jan Kara asked how NFS is affected by all of this. Unstable pages can
be taken into account as one of the patches in the series shows but it
is still the case that filesystems with unusual handling of dirty or
writeback could still be treated better.
Tests like postmark, fsmark and largedd showed up nothing useful. On my test
setup, pages are simply not being written back from reclaim context with or
without the patches and there are no changes in performance. My test setup
probably is just not strong enough network-wise to be really interesting.
I ran a longer-lived memcached test with IO going to NFS instead of a local disk
parallelio
3.9.0 3.9.0 3.9.0
vanilla mm1-mmotm-20130522 mm1-lessdisrupt-v7r10
Ops memcachetest-0M 23323.00 ( 0.00%) 23241.00 ( -0.35%) 23321.00 ( -0.01%)
Ops memcachetest-715M 25526.00 ( 0.00%) 24763.00 ( -2.99%) 23242.00 ( -8.95%)
Ops memcachetest-2385M 8814.00 ( 0.00%) 26924.00 (205.47%) 23521.00 (166.86%)
Ops memcachetest-4055M 5835.00 ( 0.00%) 26827.00 (359.76%) 25560.00 (338.05%)
Ops io-duration-0M 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops io-duration-715M 65.00 ( 0.00%) 71.00 ( -9.23%) 11.00 ( 83.08%)
Ops io-duration-2385M 129.00 ( 0.00%) 94.00 ( 27.13%) 53.00 ( 58.91%)
Ops io-duration-4055M 301.00 ( 0.00%) 100.00 ( 66.78%) 108.00 ( 64.12%)
Ops swaptotal-0M 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops swaptotal-715M 14394.00 ( 0.00%) 949.00 ( 93.41%) 63.00 ( 99.56%)
Ops swaptotal-2385M 401483.00 ( 0.00%) 24437.00 ( 93.91%) 30118.00 ( 92.50%)
Ops swaptotal-4055M 554123.00 ( 0.00%) 35688.00 ( 93.56%) 63082.00 ( 88.62%)
Ops swapin-0M 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops swapin-715M 4522.00 ( 0.00%) 560.00 ( 87.62%) 63.00 ( 98.61%)
Ops swapin-2385M 169861.00 ( 0.00%) 5026.00 ( 97.04%) 13917.00 ( 91.81%)
Ops swapin-4055M 192374.00 ( 0.00%) 10056.00 ( 94.77%) 25729.00 ( 86.63%)
Ops minorfaults-0M 1445969.00 ( 0.00%) 1520878.00 ( -5.18%) 1454024.00 ( -0.56%)
Ops minorfaults-715M 1557288.00 ( 0.00%) 1528482.00 ( 1.85%) 1535776.00 ( 1.38%)
Ops minorfaults-2385M 1692896.00 ( 0.00%) 1570523.00 ( 7.23%) 1559622.00 ( 7.87%)
Ops minorfaults-4055M 1654985.00 ( 0.00%) 1581456.00 ( 4.44%) 1596713.00 ( 3.52%)
Ops majorfaults-0M 0.00 ( 0.00%) 1.00 (-99.00%) 0.00 ( 0.00%)
Ops majorfaults-715M 763.00 ( 0.00%) 265.00 ( 65.27%) 75.00 ( 90.17%)
Ops majorfaults-2385M 23861.00 ( 0.00%) 894.00 ( 96.25%) 2189.00 ( 90.83%)
Ops majorfaults-4055M 27210.00 ( 0.00%) 1569.00 ( 94.23%) 4088.00 ( 84.98%)
1. Performance does not collapse due to IO which is good. IO is also completing
faster. Note with mmotm, IO completes in a third of the time and faster again
with this series applied
2. Swapping is reduced, although not eliminated. The figures for the follow-up
look bad but it does vary a bit as the stalling is not perfect for nfs
or filesystems like ext3 with unusual handling of dirty and writeback
pages
3. There are swapins, particularly with larger amounts of IO indicating
that active pages are being reclaimed. However, the number of much
reduced.
3.9.0 3.9.0 3.9.0
vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
Minor Faults 36339175 35025445 35219699
Major Faults 310964 27108 51887
Swap Ins 2176399 173069 333316
Swap Outs 3344050 357228 504824
Direct pages scanned 8972 77283 43242
Kswapd pages scanned 20899983 8939566 14772851
Kswapd pages reclaimed 6193156 5172605 5231026
Direct pages reclaimed 8450 73802 39514
Kswapd efficiency 29% 57% 35%
Kswapd velocity 3929.743 1847.499 3058.840
Direct efficiency 94% 95% 91%
Direct velocity 1.687 15.972 8.954
Percentage direct scans 0% 0% 0%
Zone normal velocity 3721.907 939.103 2185.142
Zone dma32 velocity 209.522 924.368 882.651
Zone dma velocity 0.000 0.000 0.000
Page writes by reclaim 4082185.000 526319.000 537114.000
Page writes file 738135 169091 32290
Page writes anon 3344050 357228 504824
Page reclaim immediate 9524 170 5595843
Sector Reads 8909900 861192 1483680
Sector Writes 13428980 1488744 2076800
Page rescued immediate 0 0 0
Slabs scanned 38016 31744 28672
Direct inode steals 0 0 0
Kswapd inode steals 424 0 0
Kswapd skipped wait 0 0 0
THP fault alloc 14 15 119
THP collapse alloc 1767 1569 1618
THP splits 30 29 25
THP fault fallback 0 0 0
THP collapse fail 8 5 0
Compaction stalls 17 41 100
Compaction success 7 31 95
Compaction failures 10 10 5
Page migrate success 7083 22157 62217
Page migrate failure 0 0 0
Compaction pages isolated 14847 48758 135830
Compaction migrate scanned 18328 48398 138929
Compaction free scanned 2000255 355827 1720269
Compaction cost 7 24 68
I guess the main takeaway again is the much reduced page writes
from reclaim context and reduced reads.
3.9.0 3.9.0 3.9.0
vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
Mean sda-avgqz 23.58 0.35 0.44
Mean sda-await 133.47 15.72 15.46
Mean sda-r_await 4.72 4.69 3.95
Mean sda-w_await 507.69 28.40 33.68
Max sda-avgqz 680.60 12.25 23.14
Max sda-await 3958.89 221.83 286.22
Max sda-r_await 63.86 61.23 67.29
Max sda-w_await 11710.38 883.57 1767.28
And as before, write wait times are much reduced.
This patch:
The patch "mm: vmscan: Have kswapd writeback pages based on dirty pages
encountered, not priority" decides whether to writeback pages from reclaim
context based on the number of dirty pages encountered. This situation is
flagged too easily and flushers are not given the chance to catch up
resulting in more pages being written from reclaim context and potentially
impacting IO performance. The check for PageWriteback is also misplaced
as it happens within a PageDirty check which is nonsense as the dirty may
have been cleared for IO. The accounting is updated very late and pages
that are already under writeback, were reactivated, could not unmapped or
could not be released are all missed. Similarly, a page is considered
congested for reasons other than being congested and pages that cannot be
written out in the correct context are skipped. Finally, it considers
stalling and writing back filesystem pages due to encountering dirty
anonymous pages at the tail of the LRU which is dumb.
This patch causes kswapd to begin writing filesystem pages from reclaim
context only if page reclaim found that all filesystem pages at the tail
of the LRU were unqueued dirty pages. Before it starts writing filesystem
pages, it will stall to give flushers a chance to catch up. The decision
on whether wait_iff_congested is also now determined by dirty filesystem
pages only. Congested pages are based on whether the underlying BDI is
congested regardless of the context of the reclaiming process.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
balance_pgdat() is very long and some of the logic can and should be
internal to kswapd_shrink_zone(). Move it so the flow of
balance_pgdat() is marginally easier to follow.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Tested-by: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Currently kswapd checks if it should start writepage as it shrinks each
zone without taking into consideration if the zone is balanced or not.
This is not wrong as such but it does not make much sense either. This
patch checks once per pgdat scan if kswapd should be writing pages.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Tested-by: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Historically, kswapd used to congestion_wait() at higher priorities if
it was not making forward progress. This made no sense as the failure
to make progress could be completely independent of IO. It was later
replaced by wait_iff_congested() and removed entirely by commit 258401a6
(mm: don't wait on congested zones in balance_pgdat()) as it was
duplicating logic in shrink_inactive_list().
This is problematic. If kswapd encounters many pages under writeback
and it continues to scan until it reaches the high watermark then it
will quickly skip over the pages under writeback and reclaim clean young
pages or push applications out to swap.
The use of wait_iff_congested() is not suited to kswapd as it will only
stall if the underlying BDI is really congested or a direct reclaimer
was unable to write to the underlying BDI. kswapd bypasses the BDI
congestion as it sets PF_SWAPWRITE but even if this was taken into
account then it would cause direct reclaimers to stall on writeback
which is not desirable.
This patch sets a ZONE_WRITEBACK flag if direct reclaim or kswapd is
encountering too many pages under writeback. If this flag is set and
kswapd encounters a PageReclaim page under writeback then it'll assume
that the LRU lists are being recycled too quickly before IO can complete
and block waiting for some IO to complete.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Tested-by: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
not priority
Currently kswapd queues dirty pages for writeback if scanning at an
elevated priority but the priority kswapd scans at is not related to the
number of unqueued dirty encountered. Since commit "mm: vmscan: Flatten
kswapd priority loop", the priority is related to the size of the LRU
and the zone watermark which is no indication as to whether kswapd
should write pages or not.
This patch tracks if an excessive number of unqueued dirty pages are
being encountered at the end of the LRU. If so, it indicates that dirty
pages are being recycled before flusher threads can clean them and flags
the zone so that kswapd will start writing pages until the zone is
balanced.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Tested-by: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Page reclaim at priority 0 will scan the entire LRU as priority 0 is
considered to be a near OOM condition. Kswapd can reach priority 0
quite easily if it is encountering a large number of pages it cannot
reclaim such as pages under writeback. When this happens, kswapd
reclaims very aggressively even though there may be no real risk of
allocation failure or OOM.
This patch prevents kswapd reaching priority 0 and trying to reclaim the
world. Direct reclaimers will still reach priority 0 in the event of an
OOM situation.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Tested-by: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
In the past, kswapd makes a decision on whether to compact memory after
the pgdat was considered balanced. This more or less worked but it is
late to make such a decision and does not fit well now that kswapd makes
a decision whether to exit the zone scanning loop depending on reclaim
progress.
This patch will compact a pgdat if at least the requested number of
pages were reclaimed from unbalanced zones for a given priority. If any
zone is currently balanced, kswapd will not call compaction as it is
expected the necessary pages are already available.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Tested-by: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
kswapd stops raising the scanning priority when at least
SWAP_CLUSTER_MAX pages have been reclaimed or the pgdat is considered
balanced. It then rechecks if it needs to restart at DEF_PRIORITY and
whether high-order reclaim needs to be reset. This is not wrong per-se
but it is confusing to follow and forcing kswapd to stay at DEF_PRIORITY
may require several restarts before it has scanned enough pages to meet
the high watermark even at 100% efficiency. This patch irons out the
logic a bit by controlling when priority is raised and removing the
"goto loop_again".
This patch has kswapd raise the scanning priority until it is scanning
enough pages that it could meet the high watermark in one shrink of the
LRU lists if it is able to reclaim at 100% efficiency. It will not
raise the scanning prioirty higher unless it is failing to reclaim any
pages.
To avoid infinite looping for high-order allocation requests kswapd will
not reclaim for high-order allocations when it has reclaimed at least
twice the number of pages as the allocation request.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Tested-by: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Simplistically, the anon and file LRU lists are scanned proportionally
depending on the value of vm.swappiness although there are other factors
taken into account by get_scan_count(). The patch "mm: vmscan: Limit
the number of pages kswapd reclaims" limits the number of pages kswapd
reclaims but it breaks this proportional scanning and may evenly shrink
anon/file LRUs regardless of vm.swappiness.
This patch preserves the proportional scanning and reclaim. It does
mean that kswapd will reclaim more than requested but the number of
pages will be related to the high watermark.
[mhocko@suse.cz: Correct proportional reclaim for memcg and simplify]
[kamezawa.hiroyu@jp.fujitsu.com: Recalculate scan based on target]
[hannes@cmpxchg.org: Account for already scanned pages properly]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Tested-by: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This series does not fix all the current known problems with reclaim but
it addresses one important swapping bug when there is background IO.
Changelog since V3
- Drop the slab shrink changes in light of Glaubers series and
discussions highlighted that there were a number of potential
problems with the patch. (mel)
- Rebased to 3.10-rc1
Changelog since V2
- Preserve ratio properly for proportional scanning (kamezawa)
Changelog since V1
- Rename ZONE_DIRTY to ZONE_TAIL_LRU_DIRTY (andi)
- Reformat comment in shrink_page_list (andi)
- Clarify some comments (dhillf)
- Rework how the proportional scanning is preserved
- Add PageReclaim check before kswapd starts writeback
- Reset sc.nr_reclaimed on every full zone scan
Kswapd and page reclaim behaviour has been screwy in one way or the
other for a long time. Very broadly speaking it worked in the far past
because machines were limited in memory so it did not have that many
pages to scan and it stalled congestion_wait() frequently to prevent it
going completely nuts. In recent times it has behaved very
unsatisfactorily with some of the problems compounded by the removal of
stall logic and the introduction of transparent hugepage support with
high-order reclaims.
There are many variations of bugs that are rooted in this area. One
example is reports of a large copy operations or backup causing the
machine to grind to a halt or applications pushed to swap. Sometimes in
low memory situations a large percentage of memory suddenly gets
reclaimed. In other cases an application starts and kswapd hits 100%
CPU usage for prolonged periods of time and so on. There is now talk of
introducing features like an extra free kbytes tunable to work around
aspects of the problem instead of trying to deal with it. It's
compounded by the problem that it can be very workload and machine
specific.
This series aims at addressing some of the worst of these problems
without attempting to fundmentally alter how page reclaim works.
Patches 1-2 limits the number of pages kswapd reclaims while still obeying
the anon/file proportion of the LRUs it should be scanning.
Patches 3-4 control how and when kswapd raises its scanning priority and
deletes the scanning restart logic which is tricky to follow.
Patch 5 notes that it is too easy for kswapd to reach priority 0 when
scanning and then reclaim the world. Down with that sort of thing.
Patch 6 notes that kswapd starts writeback based on scanning priority which
is not necessarily related to dirty pages. It will have kswapd
writeback pages if a number of unqueued dirty pages have been
recently encountered at the tail of the LRU.
Patch 7 notes that sometimes kswapd should stall waiting on IO to complete
to reduce LRU churn and the likelihood that it'll reclaim young
clean pages or push applications to swap. It will cause kswapd
to block on IO if it detects that pages being reclaimed under
writeback are recycling through the LRU before the IO completes.
Patchies 8-9 are cosmetic but balance_pgdat() is easier to follow after they
are applied.
This was tested using memcached+memcachetest while some background IO
was in progress as implemented by the parallel IO tests implement in MM
Tests.
memcachetest benchmarks how many operations/second memcached can service
and it is run multiple times. It starts with no background IO and then
re-runs the test with larger amounts of IO in the background to roughly
simulate a large copy in progress. The expectation is that the IO
should have little or no impact on memcachetest which is running
entirely in memory.
3.10.0-rc1 3.10.0-rc1
vanilla lessdisrupt-v4
Ops memcachetest-0M 22155.00 ( 0.00%) 22180.00 ( 0.11%)
Ops memcachetest-715M 22720.00 ( 0.00%) 22355.00 ( -1.61%)
Ops memcachetest-2385M 3939.00 ( 0.00%) 23450.00 (495.33%)
Ops memcachetest-4055M 3628.00 ( 0.00%) 24341.00 (570.92%)
Ops io-duration-0M 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops io-duration-715M 12.00 ( 0.00%) 7.00 ( 41.67%)
Ops io-duration-2385M 118.00 ( 0.00%) 21.00 ( 82.20%)
Ops io-duration-4055M 162.00 ( 0.00%) 36.00 ( 77.78%)
Ops swaptotal-0M 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops swaptotal-715M 140134.00 ( 0.00%) 18.00 ( 99.99%)
Ops swaptotal-2385M 392438.00 ( 0.00%) 0.00 ( 0.00%)
Ops swaptotal-4055M 449037.00 ( 0.00%) 27864.00 ( 93.79%)
Ops swapin-0M 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops swapin-715M 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops swapin-2385M 148031.00 ( 0.00%) 0.00 ( 0.00%)
Ops swapin-4055M 135109.00 ( 0.00%) 0.00 ( 0.00%)
Ops minorfaults-0M 1529984.00 ( 0.00%) 1530235.00 ( -0.02%)
Ops minorfaults-715M 1794168.00 ( 0.00%) 1613750.00 ( 10.06%)
Ops minorfaults-2385M 1739813.00 ( 0.00%) 1609396.00 ( 7.50%)
Ops minorfaults-4055M 1754460.00 ( 0.00%) 1614810.00 ( 7.96%)
Ops majorfaults-0M 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops majorfaults-715M 185.00 ( 0.00%) 180.00 ( 2.70%)
Ops majorfaults-2385M 24472.00 ( 0.00%) 101.00 ( 99.59%)
Ops majorfaults-4055M 22302.00 ( 0.00%) 229.00 ( 98.97%)
Note how the vanilla kernels performance collapses when there is enough
IO taking place in the background. This drop in performance is part of
what users complain of when they start backups. Note how the swapin and
major fault figures indicate that processes were being pushed to swap
prematurely. With the series applied, there is no noticable performance
drop and while there is still some swap activity, it's tiny.
20 iterations of this test were run in total and averaged. Every 5
iterations, additional IO was generated in the background using dd to
measure how the workload was impacted. The 0M, 715M, 2385M and 4055M
subblock refer to the amount of IO going on in the background at each
iteration. So memcachetest-2385M is reporting how many
transactions/second memcachetest recorded on average over 5 iterations
while there was 2385M of IO going on in the ground. There are six
blocks of information reported here
memcachetest is the transactions/second reported by memcachetest. In
the vanilla kernel note that performance drops from around
22K/sec to just under 4K/second when there is 2385M of IO going
on in the background. This is one type of performance collapse
users complain about if a large cp or backup starts in the
background
io-duration refers to how long it takes for the background IO to
complete. It's showing that with the patched kernel that the IO
completes faster while not interfering with the memcache
workload
swaptotal is the total amount of swap traffic. With the patched kernel,
the total amount of swapping is much reduced although it is
still not zero.
swapin in this case is an indication as to whether we are swap trashing.
The closer the swapin/swapout ratio is to 1, the worse the
trashing is. Note with the patched kernel that there is no swapin
activity indicating that all the pages swapped were really inactive
unused pages.
minorfaults are just minor faults. An increased number of minor faults
can indicate that page reclaim is unmapping the pages but not
swapping them out before they are faulted back in. With the
patched kernel, there is only a small change in minor faults
majorfaults are just major faults in the target workload and a high
number can indicate that a workload is being prematurely
swapped. With the patched kernel, major faults are much reduced. As
there are no swapin's recorded so it's not being swapped. The likely
explanation is that that libraries or configuration files used by
the workload during startup get paged out by the background IO.
Overall with the series applied, there is no noticable performance drop
due to background IO and while there is still some swap activity, it's
tiny and the lack of swapins imply that the swapped pages were inactive
and unused.
3.10.0-rc1 3.10.0-rc1
vanilla lessdisrupt-v4
Page Ins 1234608 101892
Page Outs 12446272 11810468
Swap Ins 283406 0
Swap Outs 698469 27882
Direct pages scanned 0 136480
Kswapd pages scanned 6266537 5369364
Kswapd pages reclaimed 1088989 930832
Direct pages reclaimed 0 120901
Kswapd efficiency 17% 17%
Kswapd velocity 5398.371 4635.115
Direct efficiency 100% 88%
Direct velocity 0.000 117.817
Percentage direct scans 0% 2%
Page writes by reclaim 1655843 4009929
Page writes file 957374 3982047
Page writes anon 698469 27882
Page reclaim immediate 5245 1745
Page rescued immediate 0 0
Slabs scanned 33664 25216
Direct inode steals 0 0
Kswapd inode steals 19409 778
Kswapd skipped wait 0 0
THP fault alloc 35 30
THP collapse alloc 472 401
THP splits 27 22
THP fault fallback 0 0
THP collapse fail 0 1
Compaction stalls 0 4
Compaction success 0 0
Compaction failures 0 4
Page migrate success 0 0
Page migrate failure 0 0
Compaction pages isolated 0 0
Compaction migrate scanned 0 0
Compaction free scanned 0 0
Compaction cost 0 0
NUMA PTE updates 0 0
NUMA hint faults 0 0
NUMA hint local faults 0 0
NUMA pages migrated 0 0
AutoNUMA cost 0 0
Unfortunately, note that there is a small amount of direct reclaim due to
kswapd no longer reclaiming the world. ftrace indicates that the direct
reclaim stalls are mostly harmless with the vast bulk of the stalls
incurred by dd
23 tclsh-3367
38 memcachetest-13733
49 memcachetest-12443
57 tee-3368
1541 dd-13826
1981 dd-12539
A consequence of the direct reclaim for dd is that the processes for the
IO workload may show a higher system CPU usage. There is also a risk that
kswapd not reclaiming the world may mean that it stays awake balancing
zones, does not stall on the appropriate events and continually scans
pages it cannot reclaim consuming CPU. This will be visible as continued
high CPU usage but in my own tests I only saw a single spike lasting less
than a second and I did not observe any problems related to reclaim while
running the series on my desktop.
This patch:
The number of pages kswapd can reclaim is bound by the number of pages it
scans which is related to the size of the zone and the scanning priority.
In many cases the priority remains low because it's reset every
SWAP_CLUSTER_MAX reclaimed pages but in the event kswapd scans a large
number of pages it cannot reclaim, it will raise the priority and
potentially discard a large percentage of the zone as sc->nr_to_reclaim is
ULONG_MAX. The user-visible effect is a reclaim "spike" where a large
percentage of memory is suddenly freed. It would be bad enough if this
was just unused memory but because of how anon/file pages are balanced it
is possible that applications get pushed to swap unnecessarily.
This patch limits the number of pages kswapd will reclaim to the high
watermark. Reclaim will still overshoot due to it not being a hard limit
as shrink_lruvec() will ignore the sc.nr_to_reclaim at DEF_PRIORITY but it
prevents kswapd reclaiming the world at higher priorities. The number of
pages it reclaims is not adjusted for high-order allocations as kswapd
will reclaim excessively if it is to balance zones for high-order
allocations.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Tested-by: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
In page reclaim, huge page is split. split_huge_page() adds tail pages
to LRU list. Since we are reclaiming a huge page, it's better we
reclaim all subpages of the huge page instead of just the head page.
This patch adds split tail pages to shrink page list so the tail pages
can be reclaimed soon.
Before this patch, run a swap workload:
thp_fault_alloc 3492
thp_fault_fallback 608
thp_collapse_alloc 6
thp_collapse_alloc_failed 0
thp_split 916
With this patch:
thp_fault_alloc 4085
thp_fault_fallback 16
thp_collapse_alloc 90
thp_collapse_alloc_failed 0
thp_split 1272
fallback allocation is reduced a lot.
[akpm@linux-foundation.org: fix CONFIG_SWAP=n build]
Signed-off-by: Shaohua Li <shli@fusionio.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
With this patch userland applications that want to maintain the
interactivity/memory allocation cost can use the pressure level
notifications. The levels are defined like this:
The "low" level means that the system is reclaiming memory for new
allocations. Monitoring this reclaiming activity might be useful for
maintaining cache level. Upon notification, the program (typically
"Activity Manager") might analyze vmstat and act in advance (i.e.
prematurely shutdown unimportant services).
The "medium" level means that the system is experiencing medium memory
pressure, the system might be making swap, paging out active file
caches, etc. Upon this event applications may decide to further analyze
vmstat/zoneinfo/memcg or internal memory usage statistics and free any
resources that can be easily reconstructed or re-read from a disk.
The "critical" level means that the system is actively thrashing, it is
about to out of memory (OOM) or even the in-kernel OOM killer is on its
way to trigger. Applications should do whatever they can to help the
system. It might be too late to consult with vmstat or any other
statistics, so it's advisable to take an immediate action.
The events are propagated upward until the event is handled, i.e. the
events are not pass-through. Here is what this means: for example you
have three cgroups: A->B->C. Now you set up an event listener on
cgroups A, B and C, and suppose group C experiences some pressure. In
this situation, only group C will receive the notification, i.e. groups
A and B will not receive it. This is done to avoid excessive
"broadcasting" of messages, which disturbs the system and which is
especially bad if we are low on memory or thrashing. So, organize the
cgroups wisely, or propagate the events manually (or, ask us to
implement the pass-through events, explaining why would you need them.)
Performance wise, the memory pressure notifications feature itself is
lightweight and does not require much of bookkeeping, in contrast to the
rest of memcg features. Unfortunately, as of curren |