Age | Commit message (Collapse) | Author |
|
commit f982f91516fa4cfd9d20518833cd04ad714585be upstream.
Commit db64fe02258f ("mm: rewrite vmap layer") introduced code that does
address calculations under the assumption that VMAP_BLOCK_SIZE is a
power of two. However, this might not be true if CONFIG_NR_CPUS is not
set to a power of two.
Wrong vmap_block index/offset values could lead to memory corruption.
However, this has never been observed in practice (or never been
diagnosed correctly); what caught this was the BUG_ON in vb_alloc() that
checks for inconsistent vmap_block indices.
To fix this, ensure that VMAP_BLOCK_SIZE always is a power of two.
BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=31572
Reported-by: Pavel Kysilka <goldenfish@linuxsoft.cz>
Reported-by: Matias A. Fonzo <selk@dragora.org>
Signed-off-by: Clemens Ladisch <clemens@ladisch.de>
Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Krzysztof Helt <krzysztof.h1@poczta.fm>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit c027a474a68065391c8773f6e83ed5412657e369 upstream.
exit_mm() sets ->mm == NULL then it does mmput()->exit_mmap() which
frees the memory.
However select_bad_process() checks ->mm != NULL before TIF_MEMDIE,
so it continues to kill other tasks even if we have the oom-killed
task freeing its memory.
Change select_bad_process() to check ->mm after TIF_MEMDIE, but skip
the tasks which have already passed exit_notify() to ensure a zombie
with TIF_MEMDIE set can't block oom-killer. Alternatively we could
probably clear TIF_MEMDIE after exit_mmap().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit 108b6a78463bb8c7163e4f9779f36ad8bbade334 upstream.
Commit 22a668d7c3ef ("memcg: fix behavior under memory.limit equals to
memsw.limit") introduced "memsw_is_minimum" flag, which becomes true
when mem_limit == memsw_limit. The flag is checked at the beginning of
reclaim, and "noswap" is set if the flag is true, because using swap is
meaningless in this case.
This works well in most cases, but when we try to shrink mem_limit,
which is the same as memsw_limit now, we might fail to shrink mem_limit
because swap doesn't used.
This patch fixes this behavior by:
- check MEM_CGROUP_RECLAIM_SHRINK at the begining of reclaim
- If it is set, don't set "noswap" flag even if memsw_is_minimum is true.
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <bsingharora@gmail.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit ccb6108f5b0b541d3eb332c3a73e645c0f84278e upstream.
Vito said:
: The system has many usb disks coming and going day to day, with their
: respective bdi's having min_ratio set to 1 when inserted. It works for
: some time until eventually min_ratio can no longer be set, even when the
: active set of bdi's seen in /sys/class/bdi/*/min_ratio doesn't add up to
: anywhere near 100.
:
: This then leads to an unrelated starvation problem caused by write-heavy
: fuse mounts being used atop the usb disks, a problem the min_ratio setting
: at the underlying devices bdi effectively prevents.
Fix this leakage by resetting the bdi min_ratio when unregistering the
BDI.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Reported-by: Vito Caputo <lkml@pengaru.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit 2efaca927f5cd7ecd0f1554b8f9b6a9a2c329c03 upstream.
I haven't reproduced it myself but the fail scenario is that on such
machines (notably ARM and some embedded powerpc), if you manage to hit
that futex path on a writable page whose dirty bit has gone from the PTE,
you'll livelock inside the kernel from what I can tell.
It will go in a loop of trying the atomic access, failing, trying gup to
"fix it up", getting succcess from gup, go back to the atomic access,
failing again because dirty wasn't fixed etc...
So I think you essentially hang in the kernel.
The scenario is probably rare'ish because affected architecture are
embedded and tend to not swap much (if at all) so we probably rarely hit
the case where dirty is missing or young is missing, but I think Shan has
a piece of SW that can reliably reproduce it using a shared writable
mapping & fork or something like that.
On archs who use SW tracking of dirty & young, a page without dirty is
effectively mapped read-only and a page without young unaccessible in the
PTE.
Additionally, some architectures might lazily flush the TLB when relaxing
write protection (by doing only a local flush), and expect a fault to
invalidate the stale entry if it's still present on another processor.
The futex code assumes that if the "in_atomic()" access -EFAULT's, it can
"fix it up" by causing get_user_pages() which would then be equivalent to
taking the fault.
However that isn't the case. get_user_pages() will not call
handle_mm_fault() in the case where the PTE seems to have the right
permissions, regardless of the dirty and young state. It will eventually
update those bits ... in the struct page, but not in the PTE.
Additionally, it will not handle the lazy TLB flushing that can be
required by some architectures in the fault case.
Basically, gup is the wrong interface for the job. The patch provides a
more appropriate one which boils down to just calling handle_mm_fault()
since what we are trying to do is simulate a real page fault.
The futex code currently attempts to write to user memory within a
pagefault disabled section, and if that fails, tries to fix it up using
get_user_pages().
This doesn't work on archs where the dirty and young bits are maintained
by software, since they will gate access permission in the TLB, and will
not be updated by gup().
In addition, there's an expectation on some archs that a spurious write
fault triggers a local TLB flush, and that is missing from the picture as
well.
I decided that adding those "features" to gup() would be too much for this
already too complex function, and instead added a new simpler
fixup_user_fault() which is essentially a wrapper around handle_mm_fault()
which the futex code can call.
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix some nits Darren saw, fiddle comment layout]
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Reported-by: Shan Hai <haishan.bai@gmail.com>
Tested-by: Shan Hai <haishan.bai@gmail.com>
Cc: David Laight <David.Laight@ACULAB.COM>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Darren Hart <darren.hart@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
I'm running a workload which triggers a lot of swap in a machine with 4
nodes. After I kill the workload, I found a kswapd livelock. Sometimes
kswapd3 or kswapd2 are keeping running and I can't access filesystem,
but most memory is free.
This looks like a regression since commit 08951e545918c159 ("mm: vmscan:
correct check for kswapd sleeping in sleeping_prematurely").
Node 2 and 3 have only ZONE_NORMAL, but balance_pgdat() will return 0
for classzone_idx. The reason is end_zone in balance_pgdat() is 0 by
default, if all zones have watermark ok, end_zone will keep 0.
Later sleeping_prematurely() always returns true. Because this is an
order 3 wakeup, and if classzone_idx is 0, both balanced_pages and
present_pages in pgdat_balanced() are 0. We add a special case here.
If a zone has no page, we think it's balanced. This fixes the livelock.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
remap_pfn_range() means map physical address pfn<<PAGE_SHIFT to user addr.
For nommu arch it's implemented by vma->vm_start = pfn << PAGE_SHIFT which
is wrong acroding the original meaning of this function. And some driver
developer using remap_pfn_range() with correct parameter will get
unexpected result because vm_start is changed. It should be implementd
like addr = pfn << PAGE_SHIFT but which is meanless on nommu arch, this
patch just make it simply return.
Parameter name and setting of vma->vm_flags also be fixed.
Signed-off-by: Bob Liu <lliubbo@gmail.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: David Howells <dhowells@redhat.com>
Acked-by: Greg Ungerer <gerg@uclinux.org>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Bob Liu <lliubbo@gmail.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
commit 889976dbcb12 ("memcg: reclaim memory from nodes in round-robin
order") adds an numa node round-robin for memcg. But the information is
updated once per 10sec.
This patch changes the update trigger from jiffies to memcg's event count.
After this patch, numa scan information will be updated when we see 1024
events of pagein/pageout under a memcg.
[akpm@linux-foundation.org: attempt to repair code layout]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Ying Han <yinghan@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Now, in mem_cgroup_hierarchical_reclaim(), mem_cgroup_local_usage() is
used for checking whether the memcg contains reclaimable pages or not. If
no pages in it, the routine skips it.
But, mem_cgroup_local_usage() contains Unevictable pages and cannot handle
"noswap" condition correctly. This doesn't work on a swapless system.
This patch adds test_mem_cgroup_reclaimable() and replaces
mem_cgroup_local_usage(). test_mem_cgroup_reclaimable() see LRU counter
and returns correct answer to the caller. And this new function has
"noswap" argument and can see only FILE LRU if necessary.
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix kerneldoc layout]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Ying Han <yinghan@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
__tlb_remove_page() switches to a new batch page, but still checks space
in the old batch. This check always fails, and causes a forced tlb flush.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
During allocator-intensive workloads, kswapd will be woken frequently
causing free memory to oscillate between the high and min watermark. This
is expected behaviour. Unfortunately, if the highest zone is small, a
problem occurs.
When balance_pgdat() returns, it may be at a lower classzone_idx than it
started because the highest zone was unreclaimable. Before checking if it
should go to sleep though, it checks pgdat->classzone_idx which when there
is no other activity will be MAX_NR_ZONES-1. It interprets this as it has
been woken up while reclaiming, skips scheduling and reclaims again. As
there is no useful reclaim work to do, it enters into a loop of shrinking
slab consuming loads of CPU until the highest zone becomes reclaimable for
a long period of time.
There are two problems here. 1) If the returned classzone or order is
lower, it'll continue reclaiming without scheduling. 2) if the highest
zone was marked unreclaimable but balance_pgdat() returns immediately at
DEF_PRIORITY, the new lower classzone is not communicated back to kswapd()
for sleeping.
This patch does two things that are related. If the end_zone is
unreclaimable, this information is communicated back. Second, if the
classzone or order was reduced due to failing to reclaim, new information
is not read from pgdat and instead an attempt is made to go to sleep. Due
to this, it is also necessary that pgdat->classzone_idx be initialised
each time to pgdat->nr_zones - 1 to avoid re-reads being interpreted as
wakeups.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reported-by: Pádraig Brady <P@draigBrady.com>
Tested-by: Pádraig Brady <P@draigBrady.com>
Tested-by: Andrew Lutomirski <luto@mit.edu>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
When deciding if kswapd is sleeping prematurely, the classzone is taken
into account but this is different to what balance_pgdat() and the
allocator are doing. Specifically, the DMA zone will be checked based on
the classzone used when waking kswapd which could be for a GFP_KERNEL or
GFP_HIGHMEM request. The lowmem reserve limit kicks in, the watermark is
not met and kswapd thinks it's sleeping prematurely keeping kswapd awake in
error.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reported-by: Pádraig Brady <P@draigBrady.com>
Tested-by: Pádraig Brady <P@draigBrady.com>
Tested-by: Andrew Lutomirski <luto@mit.edu>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
zone
During allocator-intensive workloads, kswapd will be woken frequently
causing free memory to oscillate between the high and min watermark. This
is expected behaviour.
When kswapd applies pressure to zones during node balancing, it checks if
the zone is above a high+balance_gap threshold. If it is, it does not
apply pressure but it unconditionally shrinks slab on a global basis which
is excessive. In the event kswapd is being kept awake due to a high small
unreclaimable zone, it skips zone shrinking but still calls shrink_slab().
Once pressure has been applied, the check for zone being unreclaimable is
being made before the check is made if all_unreclaimable should be set.
This miss of unreclaimable can cause has_under_min_watermark_zone to be
set due to an unreclaimable zone preventing kswapd backing off on
congestion_wait().
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reported-by: Pádraig Brady <P@draigBrady.com>
Tested-by: Pádraig Brady <P@draigBrady.com>
Tested-by: Andrew Lutomirski <luto@mit.edu>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
During allocator-intensive workloads, kswapd will be woken frequently
causing free memory to oscillate between the high and min watermark. This
is expected behaviour. Unfortunately, if the highest zone is small, a
problem occurs.
This seems to happen most with recent sandybridge laptops but it's
probably a co-incidence as some of these laptops just happen to have a
small Normal zone. The reproduction case is almost always during copying
large files that kswapd pegs at 100% CPU until the file is deleted or
cache is dropped.
The problem is mostly down to sleeping_prematurely() keeping kswapd awake
when the highest zone is small and unreclaimable and compounded by the
fact we shrink slabs even when not shrinking zones causing a lot of time
to be spent in shrinkers and a lot of memory to be reclaimed.
Patch 1 corrects sleeping_prematurely to check the zones matching
the classzone_idx instead of all zones.
Patch 2 avoids shrinking slab when we are not shrinking a zone.
Patch 3 notes that sleeping_prematurely is checking lower zones against
a high classzone which is not what allocators or balance_pgdat()
is doing leading to an artifical belief that kswapd should be
still awake.
Patch 4 notes that when balance_pgdat() gives up on a high zone that the
decision is not communicated to sleeping_prematurely()
This problem affects 2.6.38.8 for certain and is expected to affect 2.6.39
and 3.0-rc4 as well. If accepted, they need to go to -stable to be picked
up by distros and this series is against 3.0-rc4. I've cc'd people that
reported similar problems recently to see if they still suffer from the
problem and if this fixes it.
This patch: correct the check for kswapd sleeping in sleeping_prematurely()
During allocator-intensive workloads, kswapd will be woken frequently
causing free memory to oscillate between the high and min watermark. This
is expected behaviour.
A problem occurs if the highest zone is small. balance_pgdat() only
considers unreclaimable zones when priority is DEF_PRIORITY but
sleeping_prematurely considers all zones. It's possible for this sequence
to occur
1. kswapd wakes up and enters balance_pgdat()
2. At DEF_PRIORITY, marks highest zone unreclaimable
3. At DEF_PRIORITY-1, ignores highest zone setting end_zone
4. At DEF_PRIORITY-1, calls shrink_slab freeing memory from
highest zone, clearing all_unreclaimable. Highest zone
is still unbalanced
5. kswapd returns and calls sleeping_prematurely
6. sleeping_prematurely looks at *all* zones, not just the ones
being considered by balance_pgdat. The highest small zone
has all_unreclaimable cleared but the zone is not
balanced. all_zones_ok is false so kswapd stays awake
This patch corrects the behaviour of sleeping_prematurely to check the
zones balance_pgdat() checked.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reported-by: Pádraig Brady <P@draigBrady.com>
Tested-by: Pádraig Brady <P@draigBrady.com>
Tested-by: Andrew Lutomirski <luto@mit.edu>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Commit d149e3b25d7c ("memcg: add the soft_limit reclaim in global direct
reclaim") adds a softlimit hook to shrink_zones(). By this, soft limit
is called as
try_to_free_pages()
do_try_to_free_pages()
shrink_zones()
mem_cgroup_soft_limit_reclaim()
Then, direct reclaim is memcg softlimit hint aware, now.
But, the memory cgroup's "limit" path can call softlimit shrinker.
try_to_free_mem_cgroup_pages()
do_try_to_free_pages()
shrink_zones()
mem_cgroup_soft_limit_reclaim()
This will cause a global reclaim when a memcg hits limit.
This is bug. soft_limit_reclaim() should be called when
scanning_global_lru(sc) == true.
And the commit adds a variable "total_scanned" for counting softlimit
scanned pages....it's not "total". This patch removes the variable and
update sc->nr_scanned instead of it. This will affect shrink_slab()'s
scan condition but, global LRU is scanned by softlimit and I think this
change makes sense.
TODO: avoid too much scanning of a zone when softlimit did enough work.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Ying Han <yinghan@google.com>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Under heavy memory and filesystem load, users observe the assertion
mapping->nrpages == 0 in end_writeback() trigger. This can be caused by
page reclaim reclaiming the last page from a mapping in the following
race:
CPU0 CPU1
...
shrink_page_list()
__remove_mapping()
__delete_from_page_cache()
radix_tree_delete()
evict_inode()
truncate_inode_pages()
truncate_inode_pages_range()
pagevec_lookup() - finds nothing
end_writeback()
mapping->nrpages != 0 -> BUG
page->mapping = NULL
mapping->nrpages--
Fix the problem by doing a reliable check of mapping->nrpages under
mapping->tree_lock in end_writeback().
Analyzed by Jay <jinshan.xiong@whamcloud.com>, lost in LKML, and dug out
by Miklos Szeredi <mszeredi@suse.de>.
Cc: Jay <jinshan.xiong@whamcloud.com>
Cc: Miklos Szeredi <mszeredi@suse.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
We cannot take a mutex while holding a spinlock, so flip the order and
fix the locking documentation.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Although it is used (by i915) on nothing but tmpfs, read_cache_page_gfp()
is unsuited to tmpfs, because it inserts a page into pagecache before
calling the filesystem's ->readpage: tmpfs may have pages in swapcache
which only it knows how to locate and switch to filecache.
At present tmpfs provides a ->readpage method, and copes with this by
copying pages; but soon we can simplify it by removing its ->readpage.
Provide shmem_read_mapping_page_gfp() now, ready for that transition,
Export shmem_read_mapping_page_gfp() and add it to list in shmem_fs.h,
with shmem_read_mapping_page() inline for the common mapping_gfp case.
(shmem_read_mapping_page_gfp or shmem_read_cache_page_gfp? Generally the
read_mapping_page functions use the mapping's ->readpage, and the
read_cache_page functions use the supplied filler, so I think
read_cache_page_gfp was slightly misnamed.)
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
2.6.35's new truncate convention gave tmpfs the opportunity to control
its file truncation, no longer enforced from outside by vmtruncate().
We shall want to build upon that, to handle pagecache and swap together.
Slightly redefine the ->truncate_range interface: let it now be called
between the unmap_mapping_range()s, with the filesystem responsible for
doing the truncate_inode_pages_range() from it - just as the filesystem
is nowadays responsible for doing that from its ->setattr.
Let's rename shmem_notify_change() to shmem_setattr(). Instead of
calling the generic truncate_setsize(), bring that code in so we can
call shmem_truncate_range() - which will later be updated to perform its
own variant of truncate_inode_pages_range().
Remove the punch_hole unmap_mapping_range() from shmem_truncate_range():
now that the COW's unmap_mapping_range() comes after ->truncate_range,
there is no need to call it a third time.
Export shmem_truncate_range() and add it to the list in shmem_fs.h, so
that i915_gem_object_truncate() can call it explicitly in future; get
this patch in first, then update drm/i915 once this is available (until
then, i915 will just be doing the truncate_inode_pages() twice).
Though introduced five years ago, no other filesystem is implementing
->truncate_range, and its only other user is madvise(,,MADV_REMOVE): we
expect to convert it to fallocate(,FALLOC_FL_PUNCH_HOLE,,) shortly,
whereupon ->truncate_range can be removed from inode_operations -
shmem_truncate_range() will help i915 across that transition too.
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Before adding any more global entry points into shmem.c, gather such
prototypes into shmem_fs.h. Remove mm's own declarations from swap.h,
but for now leave the ones in mm.h: because shmem_file_setup() and
shmem_zero_setup() are called from various places, and we should not
force other subsystems to update immediately.
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
You would expect to find vmtruncate_range() next to vmtruncate() in
mm/truncate.c: move it there.
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Commit 959ecc48fc75 ("mm/memory_hotplug.c: fix building of node hotplug
zonelist") does not protect the build_all_zonelists() call with
zonelists_mutex as needed. This can lead to races in constructing
zonelist ordering if a concurrent build is underway. Protecting this
with lock_memory_hotplug() is insufficient since zonelists can be
rebuild though sysfs as well.
Signed-off-by: David Rientjes <rientjes@google.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The error handling in mem_online_node() is incorrect: hotadd_new_pgdat()
returns NULL if the new pgdat could not have been allocated and a pointer
to it otherwise.
mem_online_node() should fail if hotadd_new_pgdat() fails, not the
inverse. This fixes an issue when memoryless nodes are not onlined and
their sysfs interface is not registered when their first cpu is brought
up.
The bug was introduced by commit cf23422b9d76 ("cpu/mem hotplug: enable
CPUs online before local memory online") iow v2.6.35.
Signed-off-by: David Rientjes <rientjes@google.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Hugh Dickins points out that lockdep (correctly) spots a potential
deadlock on the anon_vma lock, because we now do a GFP_KERNEL allocation
of anon_vma_chain while doing anon_vma_clone(). The problem is that
page reclaim will want to take the anon_vma lock of any anonymous pages
that it will try to reclaim.
So re-organize the code in anon_vma_clone() slightly: first do just a
GFP_NOWAIT allocation, which will usually work fine. But if that fails,
let's just drop the lock and re-do the allocation, now with GFP_KERNEL.
End result: not only do we avoid the locking problem, this also ends up
getting better concurrency in case the allocation does need to block.
Tim Chen reports that with all these anon_vma locking tweaks, we're now
almost back up to the spinlock performance.
Reported-and-tested-by: Hugh Dickins <hughd@google.com>
Tested-by: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This matches the anon_vma_clone() case, and uses the same lock helper
functions. Because of the need to potentially release the anon_vma's,
it's a bit more complex, though.
We traverse the 'vma->anon_vma_chain' in two phases: the first loop gets
the anon_vma lock (with the helper function that only takes the lock
once for the whole loop), and removes any entries that don't need any
more processing.
The second phase just traverses the remaining list entries (without
holding the anon_vma lock), and does any actual freeing of the
anon_vma's that is required.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Tested-by: Hugh Dickins <hughd@google.com>
Tested-by: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
In anon_vma_clone() we traverse the vma->anon_vma_chain of the source
vma, locking the anon_vma for each entry.
But they are all going to have the same root entry, which means that
we're locking and unlocking the same lock over and over again. Which is
expensive in locked operations, but can get _really_ expensive when that
root entry sees any kind of lock contention.
In fact, Tim Chen reports a big performance regression due to this: when
we switched to use a mutex instead of a spinlock, the contention case
gets much worse.
So to alleviate this all, this commit creates a small helper function
(lock_anon_vma_root()) that can be used to take the lock just once
rather than taking and releasing it over and over again.
We still have the same "take the lock and release" it behavior in the
exit path (in unlink_anon_vmas()), but that one is a bit harder to fix
since we're actually freeing the anon_vma entries as we go, and that
will touch the lock too.
Reported-and-tested-by: Tim Chen <tim.c.chen@linux.intel.com>
Tested-by: Hugh Dickins <hughd@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
swapcache will reach the below code path in migrate_page_move_mapping,
and swapcache is accounted as NR_FILE_PAGES but it's not accounted as
NR_SHMEM.
Hugh pointed out we must use PageSwapCache instead of comparing
mapping to &swapper_space, to avoid build failure with CONFIG_SWAP=n.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
We have some users of this function that date back to before the vma
list was doubly linked, and just are silly. These days, you can find
the previous vma by just following the vma->vm_prev pointer.
In some cases you don't need any find_vma() lookup at all, and in other
cases you're better off with the regular "find_vma()" that uses the vma
cache front-end lookup.
Some "find_vma_prev()" users are still valid, though. For example, in
the case of a stack that grows up, it can be the case that we don't find
any 'vma' at all (because we're looking up an address that is past the
last vma), and that the stack that we want to grow is the 'prev' vma.
But that kind of special case aside, we generally should prefer to use
'find_vma()'.
Noticed due to a totally unrelated POWER memory corruption bug that just
happened to hit in 'find_vma_prev()' and made me go "Hmm - why are we
using that function here?".
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Andrea Righi reported a case where an exiting task can race against
ksmd::scan_get_next_rmap_item (http://lkml.org/lkml/2011/6/1/742) easily
triggering a NULL pointer dereference in ksmd.
ksm_scan.mm_slot == &ksm_mm_head with only one registered mm
CPU 1 (__ksm_exit) CPU 2 (scan_get_next_rmap_item)
list_empty() is false
lock slot == &ksm_mm_head
list_del(slot->mm_list)
(list now empty)
unlock
lock
slot = list_entry(slot->mm_list.next)
(list is empty, so slot is still ksm_mm_head)
unlock
slot->mm == NULL ... Oops
Close this race by revalidating that the new slot is not simply the list
head again.
Andrea's test case:
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/mman.h>
#define BUFSIZE getpagesize()
int main(int argc, char **argv)
{
void *ptr;
if (posix_memalign(&ptr, getpagesize(), BUFSIZE) < 0) {
perror("posix_memalign");
exit(1);
}
if (madvise(ptr, BUFSIZE, MADV_MERGEABLE) < 0) {
perror("madvise");
exit(1);
}
*(char *)NULL = 0;
return 0;
}
Reported-by: Andrea Righi <andrea@betterlinux.com>
Tested-by: Andrea Righi <andrea@betterlinux.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
is asynchronous V2
Asynchronous compaction is used when promoting to huge pages. This is all
very nice but if there are a number of processes in compacting memory, a
large number of pages can be isolated. An "asynchronous" process can
stall for long periods of time as a result with a user reporting that
firefox can stall for 10s of seconds. This patch aborts asynchronous
compaction if too many pages are isolated as it's better to fail a
hugepage promotion than stall a process.
[minchan.kim@gmail.com: return COMPACT_PARTIAL for abort]
Reported-and-tested-by: Ury Stankevich <urykhy@gmail.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
It is unsafe to run page_count during the physical pfn scan because
compound_head could trip on a dangling pointer when reading
page->first_page if the compound page is being freed by another CPU.
[mgorman@suse.de: split out patch]
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
next zone
Compaction works with two scanners, a migration and a free scanner. When
the scanners crossover, migration within the zone is complete. The
location of the scanner is recorded on each cycle to avoid excesive
scanning.
When a zone is small and mostly reserved, it's very easy for the migration
scanner to be close to the end of the zone. Then the following situation
can occurs
o migration scanner isolates some pages near the end of the zone
o free scanner starts at the end of the zone but finds that the
migration scanner is already there
o free scanner gets reinitialised for the next cycle as
cc->migrate_pfn + pageblock_nr_pages
moving the free scanner into the next zone
o migration scanner moves into the next zone
When this happens, NR_ISOLATED accounting goes haywire because some of the
accounting happens against the wrong zone. One zones counter remains
positive while the other goes negative even though the overall global
count is accurate. This was reported on X86-32 with !SMP because !SMP
allows the negative counters to be visible. The fact that it is the bug
should theoritically be possible there.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
fragmentation_index() returns -1000 when the allocation might succeed
This doesn't match the comment and code in compaction_suitable(). I
thought compaction_suitable should return COMPACT_PARTIAL in -1000
case, because in this case allocation could succeed depending on
watermarks.
The impact of this is that compaction starts and compact_finished() is
called which rechecks the watermarks and the free lists. It should have
the same result in that compaction should not start but is more expensive.
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Pages isolated for migration are accounted with the vmstat counters
NR_ISOLATE_[ANON|FILE]. Callers of migrate_pages() are expected to
increment these counters when pages are isolated from the LRU. Once the
pages have been migrated, they are put back on the LRU or freed and the
isolated count is decremented.
Memory failure is not properly accounting for pages it isolates causing
the NR_ISOLATED counters to be negative. On SMP builds, this goes
unnoticed as negative counters are treated as 0 due to expected per-cpu
drift. On UP builds, the counter is treated by too_many_isolated() as a
large value causing processes to enter D state during page reclaim or
compaction. This patch accounts for pages isolated by memory failure
correctly.
[mel@csn.ul.ie: rewrote changelog]
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Andi Kleen <andi@firstfloor.org>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Based on Michal Hocko's comment.
We are not draining per cpu cached charges during soft limit reclaim
because background reclaim doesn't care about charges. It tries to free
some memory and charges will not give any.
Cached charges might influence only selection of the biggest soft limit
offender but as the call is done only after the selection has been already
done it makes no change.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
For performance, memory cgroup caches some "charge" from res_counter into
per cpu cache. This works well but because it's cache, it needs to be
flushed in some cases. Typical cases are
1. when someone hit limit.
2. when rmdir() is called and need to charges to be 0.
But "1" has problem.
Recently, with large SMP machines, we see many kworker runs because of
flushing memcg's cache. Bad things in implementation are that even if a
cpu contains a cache for memcg not related to a memcg which hits limit,
drain code is called.
This patch does
A) check percpu cache contains a useful data or not.
B) check other asynchronous percpu draining doesn't run.
C) don't call local cpu callback.
(*)This patch avoid changing the calling condition with hard-limit.
When I run "cat 1Gfile > /dev/null" under 300M limit memcg,
[Before]
13767 kamezawa 20 0 98.6m 424 416 D 10.0 0.0 0:00.61 cat
58 root 20 0 0 0 0 S 0.6 0.0 0:00.09 kworker/2:1
60 root 20 0 0 0 0 S 0.6 0.0 0:00.08 kworker/4:1
4 root 20 0 0 0 0 S 0.3 0.0 0:00.02 kworker/0:0
57 root 20 0 0 0 0 S 0.3 0.0 0:00.05 kworker/1:1
61 root 20 0 0 0 0 S 0.3 0.0 0:00.05 kworker/5:1
62 root 20 0 0 0 0 S 0.3 0.0 0:00.05 kworker/6:1
63 root 20 0 0 0 0 S 0.3 0.0 0:00.05 kworker/7:1
[After]
2676 root 20 0 98.6m 416 416 D 9.3 0.0 0:00.87 cat
2626 kamezawa 20 0 15192 1312 920 R 0.3 0.0 0:00.28 top
1 root 20 0 19384 1496 1204 S 0.0 0.0 0:00.66 init
2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd
3 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/0
4 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0
[akpm@linux-foundation.org: make percpu_charge_mutex static, tweak comments]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Tested-by: Ying Han <yinghan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Hierarchical reclaim doesn't swap out if memsw and resource limits are
thye same (memsw_is_minimum == true) because we would hit mem+swap limit
anyway (during hard limit reclaim).
If it comes to the soft limit we shouldn't consider memsw_is_minimum at
all because it doesn't make much sense. Either the soft limit is bellow
the hard limit and then we cannot hit mem+swap limit or the direct reclaim
takes a precedence.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Commit 21a3c9646873 ("memcg: allocate memory cgroup structures in local
nodes") makes page_cgroup allocation as NUMA aware. But that caused a
problem https://bugzilla.kernel.org/show_bug.cgi?id=36192.
The problem was getting a NID from invalid struct pages, which was not
initialized because it was out-of-node, out of [node_start_pfn,
node_end_pfn)
Now, with sparsemem, page_cgroup_init scans pfn from 0 to max_pfn. But
this may scan a pfn which is not on any node and can access memmap which
is not initialized.
This makes page_cgroup_init() for SPARSEMEM node aware and remove a code
to get nid from page->flags. (Then, we'll use valid NID always.)
[akpm@linux-foundation.org: try to fix up comments]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Commit 406eb0c9ba76 ("memcg: add memory.numastat api for numa
statistics") adds memory.numa_stat file for memory cgroup. But the file
permissions are wrong.
[kamezawa@bluextal linux-2.6]$ ls -l /cgroup/memory/A/memory.numa_stat
---------- 1 root root 0 Jun 9 18:36 /cgroup/memory/A/memory.numa_stat
This patch fixes the permission as
[root@bluextal kamezawa]# ls -l /cgroup/memory/A/memory.numa_stat
-r--r--r-- 1 root root 0 Jun 10 16:49 /cgroup/memory/A/memory.numa_stat
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Ying Han <yinghan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
When 1GB hugepages are allocated on a system, free(1) reports less
available memory than what really is installed in the box. Also, if the
total size of hugepages allocated on a system is over half of the total
memory size, CommitLimit becomes a negative number.
The problem is that gigantic hugepages (order > MAX_ORDER) can only be
allocated at boot with bootmem, thus its frames are not accounted to
'totalram_pages'. However, they are accounted to hugetlb_total_pages()
What happens to turn CommitLimit into a negative number is this
calculation, in fs/proc/meminfo.c:
allowed = ((totalram_pages - hugetlb_total_pages())
* sysctl_overcommit_ratio / 100) + total_swap_pages;
A similar calculation occurs in __vm_enough_memory() in mm/mmap.c.
Also, every vm statistic which depends on 'totalram_pages' will render
confusing values, as if system were 'missing' some part of its memory.
Impact of this bug:
When gigantic hugepages are allocated and sysctl_overcommit_memory ==
OVERCOMMIT_NEVER. In a such situation, __vm_enough_memory() goes through
the mentioned 'allowed' calculation and might end up mistakenly returning
-ENOMEM, thus forcing the system to start reclaiming pages earlier than it
would be ususal, and this could cause detrimental impact to overall
system's performance, depending on the workload.
Besides the aforementioned scenario, I can only think of this causing
annoyances with memory reports from /proc/meminfo and free(1).
[akpm@linux-foundation.org: standardize comment layout]
Reported-by: Russ Anderson <rja@sgi.com>
Signed-off-by: Rafael Aquini <aquini@linux.com>
Acked-by: Russ Anderson <rja@sgi.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
During memory hotplug we refresh zonelists when we online a page in a new
zone. It means that the node's zonelist is not initialized until pages
are onlined. So for example, "nid" passed by MEM_GOING_ONLINE notifier
will point to NODE_DATA(nid) which has no zone fallback list. Moreover,
if we hot-add cpu-only nodes, alloc_pages() will do no fallback.
This patch makes a zonelist when a new pgdata is available.
Note: in production, at fujitsu, memory should be onlined before cpu
and our server didn't have any memory-less nodes and had no problems.
But recent changes in MEM_GOING_ONLINE+page_cgroup
will access not initialized zonelist of node.
Anyway, there are memory-less node and we need some care.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Commit 56de7263fcf3 ("mm: compaction: direct compact when a high-order
allocation fails") introduced a check for cc->order == -1 in
compact_finished. We should continue compacting in that case because
the request came from userspace and there is no particular order to
compact for. Similar check has been added by 82478fb7 (mm: compaction:
prevent division-by-zero during user-requested compaction) for
compaction_suitable.
The check is, however, done after zone_watermark_ok which uses order as a
right hand argument for shifts. Not only watermark check is pointless if
we can break out without it but it also uses 1 << -1 which is not well
defined (at least from C standard). Let's move the -1 check above
zone_watermark_ok.
[minchan.kim@gmail.com> - caught compaction_suitable]
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hioryu@jp.fujitsu.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Running a ktest.pl test, I hit the following bug on x86_32:
------------[ cut here ]------------
WARNING: at arch/x86/mm/highmem_32.c:81 __kunmap_atomic+0x64/0xc1()
Hardware name:
Modules linked in:
Pid: 93, comm: sh Not tainted 2.6.39-test+ #1
Call Trace:
[<c04450da>] warn_slowpath_common+0x7c/0x91
[<c042f5df>] ? __kunmap_atomic+0x64/0xc1
[<c042f5df>] ? __kunmap_atomic+0x64/0xc1^M
[<c0445111>] warn_slowpath_null+0x22/0x24
[<c042f5df>] __kunmap_atomic+0x64/0xc1
[<c04d4a22>] unmap_vmas+0x43a/0x4e0
[<c04d9065>] exit_mmap+0x91/0xd2
[<c0443057>] mmput+0x43/0xad
[<c0448358>] exit_mm+0x111/0x119
[<c044855f>] do_exit+0x1ff/0x5fa
[<c0454ea2>] ? set_current_blocked+0x3c/0x40
[<c0454f24>] ? sigprocmask+0x7e/0x8e
[<c0448b55>] do_group_exit+0x65/0x88
[<c0448b90>] sys_exit_group+0x18/0x1c
[<c0c3915f>] sysenter_do_call+0x12/0x38
---[ end trace 8055f74ea3c0eb62 ]---
Running a ktest.pl git bisect, found the culprit: commit e303297e6c3a
("mm: extended batches for generic mmu_gather")
But although this was the commit triggering the bug, it was not the one
originally responsible for the bug. That was commit d16dfc550f53 ("mm:
mmu_gather rework").
The code in zap_pte_range() has something that looks like the following:
pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
do {
[...]
} while (pte++, addr += PAGE_SIZE, addr != end);
pte_unmap_unlock(pte - 1, ptl);
The pte starts off pointing at the first element in the page table
directory that was returned by the pte_offset_map_lock(). When it's done
with the page, pte will be pointing to anything between the next entry and
the first entry of the next page inclusive. By doing a pte - 1, this puts
the pte back onto the original page, which is all that pte_unmap_unlock()
needs.
In most archs (64 bit), this is not an issue as the pte is ignored in the
pte_unmap_unlock(). But on 32 bit archs, where things may be kmapped, it
is essential that the pte passed to pte_unmap_unlock() resides on the same
page that was given by pte_offest_map_lock().
The problem came in d16dfc55 ("mm: mmu_gather rework") where it introduced
a "break;" from the while loop. This alone did not seem to easily trigger
the bug. But the modifications made by e303297e6 caused that "break;" to
be hit on the first iteration, before the pte++.
The pte not being incremented will now cause pte_unmap_unlock(pte - 1) to
be pointing to the previous page. This will cause the wrong page to be
unmapped, and also trigger the warning above.
The simple solution is to just save the pointer given by
pte_offset_map_lock() and use it in the unlock.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
While testing for memcg aware swap token, I observed a swap token was
often grabbed an intermittent running process (eg init, auditd) and they
never release a token.
Why?
Some processes (eg init, auditd, audispd) wake up when a process exiting.
And swap token can be get first page-in process when a process exiting
makes no swap token owner. Thus such above intermittent running process
often get a token.
And currently, swap token priority is only decreased at page fault path.
Then, if the process sleep immediately after to grab swap token, the swap
token priority never be decreased. That's obviously undesirable.
This patch implement very poor (and lightweight) priority aging. It only
be affect to the above corner case and doesn't change swap tendency
workload performance (eg multi process qsbench load)
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This is useful for observing swap token activity.
example output:
zsh-1845 [000] 598.962716: update_swap_token_priority:
mm=ffff88015eaf7700 old_prio=1 new_prio=0
memtoy-1830 [001] 602.033900: update_swap_token_priority:
mm=ffff880037a45880 old_prio=947 new_prio=949
memtoy-1830 [000] 602.041509: update_swap_token_priority:
mm=ffff880037a45880 old_prio=949 new_prio=951
memtoy-1830 [000] 602.051959: update_swap_token_priority:
mm=ffff880037a45880 old_prio=951 new_prio=953
memtoy-1830 [000] 602.052188: update_swap_token_priority:
mm=ffff880037a45880 old_prio=953 new_prio=955
memtoy-1830 [001] 602.427184: put_swap_token:
token_mm=ffff880037a45880
zsh-1789 [000] 602.427281: replace_swap_token:
old_token_mm= (null) old_prio=0 new_token_mm=ffff88015eaf7018
new_prio=2
zsh-1789 [001] 602.433456: update_swap_token_priority:
mm=ffff88015eaf7018 old_prio=2 new_prio=4
zsh-1789 [000] 602.437613: update_swap_token_priority:
mm=ffff88015eaf7018 old_prio=4 new_prio=6
zsh-1789 [000] 602.443924: update_swap_token_priority:
mm=ffff88015eaf7018 old_prio=6 new_prio=8
zsh-1789 [000] 602.451873: update_swap_token_priority:
mm=ffff88015eaf7018 old_prio=8 new_prio=10
zsh-1789 [001] 602.462639: update_swap_token_priority:
mm=ffff88015eaf7018 old_prio=10 new_prio=12
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Rik van Riel<riel@redhat.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Currently, memcg reclaim can disable swap token even if the swap token mm
doesn't belong in its memory cgroup. It's slightly risky. If an admin
creates very small mem-cgroup and silly guy runs contentious heavy memory
pressure workload, every tasks are going to lose swap token and then
system may become unresponsive. That's bad.
This patch adds 'memcg' parameter into disable_swap_token(). and if the
parameter doesn't match swap token, VM doesn't disable it.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Rik van Riel<riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Fix new kernel-doc warnings in mm/memory.c:
Warning(mm/memory.c:1327): No description found for parameter 'tlb'
Warning(mm/memory.c:1327): Excess function parameter 'tlbp' description in 'unmap_vmas'
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Johannes noticed the vmstat update is already taken care of by
khugepaged_alloc_hugepage() internally. The only places that are required
to update the vmstat are the callers of alloc_hugepage (callers of
khugepaged_alloc_hugepage aren't).
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Johannes Weiner <jweiner@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6:
SLAB: Record actual last user of freed objects.
slub: always align cpu_slab to honor cmpxchg_double requirement
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
vfs: make unlink() and rmdir() return ENOENT in preference to EROFS
lmLogOpen() broken failure exit
usb: remove bad dput after dentry_unhash
more conservative S_NOSEC handling
|