linux/mm, branch v3.14

mm: fix swapops.h:131 bug if remap_file_pages raced migration

2014-03-21T05:09:09Z

Add remove_linear_migration_ptes_from_nonlinear(), to fix an interesting little include/linux/swapops.h:131 BUG_ON(!PageLocked) found by trinity: indicating that remove_migration_ptes() failed to find one of the migration entries that was temporarily inserted. The problem comes from remap_file_pages()'s switch from vma_interval_tree (good for inserting the migration entry) to i_mmap_nonlinear list (no good for locating it again); but can only be a problem if the remap_file_pages() range does not cover the whole of the vma (zap_pte() clears the range). remove_migration_ptes() needs a file_nonlinear method to go down the i_mmap_nonlinear list, applying linear location to look for migration entries in those vmas too, just in case there was this race. The file_nonlinear method does need rmap_walk_control.arg to do this; but it never needed vma passed in - vma comes from its own iteration. Reported-and-tested-by: Dave Jones Reported-and-tested-by: Sasha Levin Signed-off-by: Hugh Dickins Signed-off-by: Linus Torvalds

mm: fix bad rss-counter if remap_file_pages raced migration

2014-03-19T23:21:49Z

Fix some "Bad rss-counter state" reports on exit, arising from the interaction between page migration and remap_file_pages(): zap_pte() must count a migration entry when zapping it. And yes, it is possible (though very unusual) to find an anon page or swap entry in a VM_SHARED nonlinear mapping: coming from that horrid get_user_pages(write, force) case which COWs even in a shared mapping. Signed-off-by: Hugh Dickins Tested-by: Sasha Levin sasha.levin@oracle.com> Tested-by: Dave Jones davej@redhat.com> Cc: Cyrill Gorcunov Cc: Andrew Morton Signed-off-by: Linus Torvalds

mm/Kconfig: fix URL for zsmalloc benchmark

2014-03-11T00:26:20Z

The help text for CONFIG_PGTABLE_MAPPING has an incorrect URL. While we're at it, remove the unnecessary footnote notation. Signed-off-by: Ben Hutchings Acked-by: Minchan Kim Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

mm/compaction: break out of loop on !PageBuddy in isolate_freepages_block

2014-03-11T00:26:20Z

We received several reports of bad page state when freeing CMA pages previously allocated with alloc_contig_range: BUG: Bad page state in process Binder_A pfn:63202 page:d21130b0 count:0 mapcount:1 mapping: (null) index:0x7dfbf page flags: 0x40080068(uptodate|lru|active|swapbacked) Based on the page state, it looks like the page was still in use. The page flags do not make sense for the use case though. Further debugging showed that despite alloc_contig_range returning success, at least one page in the range still remained in the buddy allocator. There is an issue with isolate_freepages_block. In strict mode (which CMA uses), if any pages in the range cannot be isolated, isolate_freepages_block should return failure 0. The current check keeps track of the total number of isolated pages and compares against the size of the range: if (strict && nr_strict_required > total_isolated) total_isolated = 0; After taking the zone lock, if one of the pages in the range is not in the buddy allocator, we continue through the loop and do not increment total_isolated. If in the last iteration of the loop we isolate more than one page (e.g. last page needed is a higher order page), the check for total_isolated may pass and we fail to detect that a page was skipped. The fix is to bail out if the loop immediately if we are in strict mode. There's no benfit to continuing anyway since we need all pages to be isolated. Additionally, drop the error checking based on nr_strict_required and just check the pfn ranges. This matches with what isolate_freepages_range does. Signed-off-by: Laura Abbott Acked-by: Minchan Kim Cc: Mel Gorman Acked-by: Vlastimil Babka Cc: Joonsoo Kim Acked-by: Bartlomiej Zolnierkiewicz Acked-by: Michal Nazarewicz Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

mm: fix GFP_THISNODE callers and clarify

2014-03-11T00:26:19Z

GFP_THISNODE is for callers that implement their own clever fallback to remote nodes. It restricts the allocation to the specified node and does not invoke reclaim, assuming that the caller will take care of it when the fallback fails, e.g. through a subsequent allocation request without GFP_THISNODE set. However, many current GFP_THISNODE users only want the node exclusive aspect of the flag, without actually implementing their own fallback or triggering reclaim if necessary. This results in things like page migration failing prematurely even when there is easily reclaimable memory available, unless kswapd happens to be running already or a concurrent allocation attempt triggers the necessary reclaim. Convert all callsites that don't implement their own fallback strategy to __GFP_THISNODE. This restricts the allocation a single node too, but at the same time allows the allocator to enter the slowpath, wake kswapd, and invoke direct reclaim if necessary, to make the allocation happen when memory is full. Signed-off-by: Johannes Weiner Acked-by: Rik van Riel Cc: Jan Stancek Cc: Mel Gorman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

mm: page_alloc: exempt GFP_THISNODE allocations from zone fairness

2014-03-04T15:55:50Z

Jan Stancek reports manual page migration encountering allocation failures after some pages when there is still plenty of memory free, and bisected the problem down to commit 81c0a2bb515f ("mm: page_alloc: fair zone allocator policy"). The problem is that GFP_THISNODE obeys the zone fairness allocation batches on one hand, but doesn't reset them and wake kswapd on the other hand. After a few of those allocations, the batches are exhausted and the allocations fail. Fixing this means either having GFP_THISNODE wake up kswapd, or GFP_THISNODE not participating in zone fairness at all. The latter seems safer as an acute bugfix, we can clean up later. Reported-by: Jan Stancek Signed-off-by: Johannes Weiner Acked-by: Rik van Riel Acked-by: Mel Gorman Cc: [3.12+] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

mm: include VM_MIXEDMAP flag in the VM_SPECIAL list to avoid m(un)locking

2014-03-04T15:55:48Z

Daniel Borkmann reported a VM_BUG_ON assertion failing: ------------[ cut here ]------------ kernel BUG at mm/mlock.c:528! invalid opcode: 0000 [#1] SMP Modules linked in: ccm arc4 iwldvm [...] video CPU: 3 PID: 2266 Comm: netsniff-ng Not tainted 3.14.0-rc2+ #8 Hardware name: LENOVO 2429BP3/2429BP3, BIOS G4ET37WW (1.12 ) 05/29/2012 task: ffff8801f87f9820 ti: ffff88002cb44000 task.ti: ffff88002cb44000 RIP: 0010:[] [] munlock_vma_pages_range+0x2e0/0x2f0 Call Trace: do_munmap+0x18f/0x3b0 vm_munmap+0x41/0x60 SyS_munmap+0x22/0x30 system_call_fastpath+0x1a/0x1f RIP munlock_vma_pages_range+0x2e0/0x2f0 ---[ end trace a0088dcf07ae10f2 ]--- because munlock_vma_pages_range() thinks it's unexpectedly in the middle of a THP page. This can be reproduced with default config since 3.11 kernels. A reproducer can be found in the kernel's selftest directory for networking by running ./psock_tpacket. The problem is that an order=2 compound page (allocated by alloc_one_pg_vec_page() is part of the munlocked VM_MIXEDMAP vma (mapped by packet_mmap()) and mistaken for a THP page and assumed to be order=9. The checks for THP in munlock came with commit ff6a6da60b89 ("mm: accelerate munlock() treatment of THP pages"), i.e. since 3.9, but did not trigger a bug. It just makes munlock_vma_pages_range() skip such compound pages until the next 512-pages-aligned page, when it encounters a head page. This is however not a problem for vma's where mlocking has no effect anyway, but it can distort the accounting. Since commit 7225522bb429 ("mm: munlock: batch non-THP page isolation and munlock+putback using pagevec") this can trigger a VM_BUG_ON in PageTransHuge() check. This patch fixes the issue by adding VM_MIXEDMAP flag to VM_SPECIAL, a list of flags that make vma's non-mlockable and non-mergeable. The reasoning is that VM_MIXEDMAP vma's are similar to VM_PFNMAP, which is already on the VM_SPECIAL list, and both are intended for non-LRU pages where mlocking makes no sense anyway. Related Lkml discussion can be found in [2]. [1] tools/testing/selftests/net/psock_tpacket [2] https://lkml.org/lkml/2014/1/10/427 Signed-off-by: Vlastimil Babka Signed-off-by: Daniel Borkmann Reported-by: Daniel Borkmann Tested-by: Daniel Borkmann Cc: Thomas Hellstrom Cc: John David Anglin Cc: HATAYAMA Daisuke Cc: Konstantin Khlebnikov Cc: Carsten Otte Cc: Jared Hulbert Tested-by: Hannes Frederic Sowa Cc: Kirill A. Shutemov Acked-by: Rik van Riel Cc: Andrea Arcangeli Cc: [3.11.x+] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

memcg: reparent charges of children before processing parent

2014-03-04T15:55:48Z

Sometimes the cleanup after memcg hierarchy testing gets stuck in mem_cgroup_reparent_charges(), unable to bring non-kmem usage down to 0. There may turn out to be several causes, but a major cause is this: the workitem to offline parent can get run before workitem to offline child; parent's mem_cgroup_reparent_charges() circles around waiting for the child's pages to be reparented to its lrus, but it's holding cgroup_mutex which prevents the child from reaching its mem_cgroup_reparent_charges(). Further testing showed that an ordered workqueue for cgroup_destroy_wq is not always good enough: percpu_ref_kill_and_confirm's call_rcu_sched stage on the way can mess up the order before reaching the workqueue. Instead, when offlining a memcg, call mem_cgroup_reparent_charges() on all its children (and grandchildren, in the correct order) to have their charges reparented first. Fixes: e5fca243abae ("cgroup: use a dedicated workqueue for cgroup destruction") Signed-off-by: Filipe Brandenburger Signed-off-by: Hugh Dickins Reviewed-by: Tejun Heo Acked-by: Michal Hocko Cc: Johannes Weiner Cc: [v3.10+] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

memcg: fix endless loop in __mem_cgroup_iter_next()

2014-03-04T15:55:47Z

Commit 0eef615665ed ("memcg: fix css reference leak and endless loop in mem_cgroup_iter") got the interaction with the commit a few before it d8ad30559715 ("mm/memcg: iteration skip memcgs not yet fully initialized") slightly wrong, and we didn't notice at the time. It's elusive, and harder to get than the original, but for a couple of days before rc1, I several times saw a endless loop similar to that supposedly being fixed. This time it was a tighter loop in __mem_cgroup_iter_next(): because we can get here when our root has already been offlined, and the ordering of conditions was such that we then just cycled around forever. Fixes: 0eef615665ed ("memcg: fix css reference leak and endless loop in mem_cgroup_iter"). Signed-off-by: Hugh Dickins Acked-by: Michal Hocko Cc: Johannes Weiner Cc: Greg Thelen Cc: [3.12+] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

mm: close PageTail race

2014-03-04T15:55:47Z

Commit bf6bddf1924e ("mm: introduce compaction and migration for ballooned pages") introduces page_count(page) into memory compaction which dereferences page->first_page if PageTail(page). This results in a very rare NULL pointer dereference on the aforementioned page_count(page). Indeed, anything that does compound_head(), including page_count() is susceptible to racing with prep_compound_page() and seeing a NULL or dangling page->first_page pointer. This patch uses Andrea's implementation of compound_trans_head() that deals with such a race and makes it the default compound_head() implementation. This includes a read memory barrier that ensures that if PageTail(head) is true that we return a head page that is neither NULL nor dangling. The patch then adds a store memory barrier to prep_compound_page() to ensure page->first_page is set. This is the safest way to ensure we see the head page that we are expecting, PageTail(page) is already in the unlikely() path and the memory barriers are unfortunately required. Hugetlbfs is the exception, we don't enforce a store memory barrier during init since no race is possible. Signed-off-by: David Rientjes Cc: Holger Kiehl Cc: Christoph Lameter Cc: Rafael Aquini Cc: Vlastimil Babka Cc: Michal Hocko Cc: Mel Gorman Cc: Andrea Arcangeli Cc: Rik van Riel Cc: "Kirill A. Shutemov" Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds