linux/mm, branch v3.2.38

mm: use aligned zone start for pfn_to_bitidx calculation

2013-02-06T04:33:31Z

commit c060f943d0929f3e429c5d9522290584f6281d6e upstream. The current calculation in pfn_to_bitidx assumes that (pfn - zone->zone_start_pfn) >> pageblock_order will return the same bit for all pfn in a pageblock. If zone_start_pfn is not aligned to pageblock_nr_pages, this may not always be correct. Consider the following with pageblock order = 10, zone start 2MB: pfn | pfn - zone start | (pfn - zone start) >> page block order ---------------------------------------------------------------- 0x26000 | 0x25e00 | 0x97 0x26100 | 0x25f00 | 0x97 0x26200 | 0x26000 | 0x98 0x26300 | 0x26100 | 0x98 This means that calling {get,set}_pageblock_migratetype on a single page will not set the migratetype for the full block. Fix this by rounding down zone_start_pfn when doing the bitidx calculation. For our use case, the effects of this bug were mostly tied to the fact that CMA allocations would either take a long time or fail to happen. Depending on the driver using CMA, this could result in anything from visual glitches to application failures. Signed-off-by: Laura Abbott Acked-by: Mel Gorman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Ben Hutchings

mm: compaction: fix echo 1 > compact_memory return error issue

2013-02-06T04:33:31Z

commit 7964c06d66c76507d8b6b662bffea770c29ef0ce upstream. when run the folloing command under shell, it will return error sh/$ echo 1 > /proc/sys/vm/compact_memory sh/$ sh: write error: Bad address After strace, I found the following log: ... write(1, "1\n", 2) = 3 write(1, "", 4294967295) = -1 EFAULT (Bad address) write(2, "echo: write error: Bad address\n", 31echo: write error: Bad address ) = 31 This tells system return 3(COMPACT_COMPLETE) after write data to compact_memory. The fix is to make the system just return 0 instead 3(COMPACT_COMPLETE) from sysctl_compaction_handler after compaction_nodes finished. Signed-off-by: Jason Liu Suggested-by: David Rientjes Acked-by: Mel Gorman Cc: Rik van Riel Cc: Minchan Kim Cc: KAMEZAWA Hiroyuki Acked-by: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Ben Hutchings

mm: limit mmu_gather batching to fix soft lockups on !CONFIG_PREEMPT

2013-01-16T01:13:20Z

commit 53a59fc67f97374758e63a9c785891ec62324c81 upstream. Since commit e303297e6c3a ("mm: extended batches for generic mmu_gather") we are batching pages to be freed until either tlb_next_batch cannot allocate a new batch or we are done. This works just fine most of the time but we can get in troubles with non-preemptible kernel (CONFIG_PREEMPT_NONE or CONFIG_PREEMPT_VOLUNTARY) on large machines where too aggressive batching might lead to soft lockups during process exit path (exit_mmap) because there are no scheduling points down the free_pages_and_swap_cache path and so the freeing can take long enough to trigger the soft lockup. The lockup is harmless except when the system is setup to panic on softlockup which is not that unusual. The simplest way to work around this issue is to limit the maximum number of batches in a single mmu_gather. 10k of collected pages should be safe to prevent from soft lockups (we would have 2ms for one) even if they are all freed without an explicit scheduling point. This patch doesn't add any new explicit scheduling points because it relies on zap_pmd_range during page tables zapping which calls cond_resched per PMD. The following lockup has been reported for 3.0 kernel with a huge process (in order of hundreds gigs but I do know any more details). BUG: soft lockup - CPU#56 stuck for 22s! [kernel:31053] Modules linked in: af_packet nfs lockd fscache auth_rpcgss nfs_acl sunrpc mptctl mptbase autofs4 binfmt_misc dm_round_robin dm_multipath bonding cpufreq_conservative cpufreq_userspace cpufreq_powersave pcc_cpufreq mperf microcode fuse loop osst sg sd_mod crc_t10dif st qla2xxx scsi_transport_fc scsi_tgt netxen_nic i7core_edac iTCO_wdt joydev e1000e serio_raw pcspkr edac_core iTCO_vendor_support acpi_power_meter rtc_cmos hpwdt hpilo button container usbhid hid dm_mirror dm_region_hash dm_log linear uhci_hcd ehci_hcd usbcore usb_common scsi_dh_emc scsi_dh_alua scsi_dh_hp_sw scsi_dh_rdac scsi_dh dm_snapshot pcnet32 mii edd dm_mod raid1 ext3 mbcache jbd fan thermal processor thermal_sys hwmon cciss scsi_mod Supported: Yes CPU 56 Pid: 31053, comm: kernel Not tainted 3.0.31-0.9-default #1 HP ProLiant DL580 G7 RIP: 0010: _raw_spin_unlock_irqrestore+0x8/0x10 RSP: 0018:ffff883ec1037af0 EFLAGS: 00000206 RAX: 0000000000000e00 RBX: ffffea01a0817e28 RCX: ffff88803ffd9e80 RDX: 0000000000000200 RSI: 0000000000000206 RDI: 0000000000000206 RBP: 0000000000000002 R08: 0000000000000001 R09: ffff887ec724a400 R10: 0000000000000000 R11: dead000000200200 R12: ffffffff8144c26e R13: 0000000000000030 R14: 0000000000000297 R15: 000000000000000e FS: 00007ed834282700(0000) GS:ffff88c03f200000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 000000000068b240 CR3: 0000003ec13c5000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process kernel (pid: 31053, threadinfo ffff883ec1036000, task ffff883ebd5d4100) Call Trace: release_pages+0xc5/0x260 free_pages_and_swap_cache+0x9d/0xc0 tlb_flush_mmu+0x5c/0x80 tlb_finish_mmu+0xe/0x50 exit_mmap+0xbd/0x120 mmput+0x49/0x120 exit_mm+0x122/0x160 do_exit+0x17a/0x430 do_group_exit+0x3d/0xb0 get_signal_to_deliver+0x247/0x480 do_signal+0x71/0x1b0 do_notify_resume+0x98/0xb0 int_signal+0x12/0x17 DWARF2 unwinder stuck at int_signal+0x12/0x17 Signed-off-by: Michal Hocko Cc: Mel Gorman Cc: Rik van Riel Cc: Peter Zijlstra Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Ben Hutchings

tmpfs mempolicy: fix /proc/mounts corrupting memory

2013-01-16T01:13:14Z

commit f2a07f40dbc603c15f8b06e6ec7f768af67b424f upstream. Recently I suggested using "mount -o remount,mpol=local /tmp" in NUMA mempolicy testing. Very nasty. Reading /proc/mounts, /proc/pid/mounts or /proc/pid/mountinfo may then corrupt one bit of kernel memory, often in a page table (causing "Bad swap" or "Bad page map" warning or "Bad pagetable" oops), sometimes in a vm_area_struct or rbnode or somewhere worse. "mpol=prefer" and "mpol=prefer:Node" are equally toxic. Recent NUMA enhancements are not to blame: this dates back to 2.6.35, when commit e17f74af351c "mempolicy: don't call mpol_set_nodemask() when no_context" skipped mpol_parse_str()'s call to mpol_set_nodemask(), which used to initialize v.preferred_node, or set MPOL_F_LOCAL in flags. With slab poisoning, you can then rely on mpol_to_str() to set the bit for node 0x6b6b, probably in the next page above the caller's stack. mpol_parse_str() is only called from shmem_parse_options(): no_context is always true, so call it unused for now, and remove !no_context code. Set v.nodes or v.preferred_node or MPOL_F_LOCAL as mpol_to_str() might expect. Then mpol_to_str() can ignore its no_context argument also, the mpol being appropriately initialized whether contextualized or not. Rename its no_context unused too, and let subsequent patch remove them (that's not needed for stable backporting, which would involve rejects). I don't understand why MPOL_LOCAL is described as a pseudo-policy: it's a reasonable policy which suffers from a confusing implementation in terms of MPOL_PREFERRED with MPOL_F_LOCAL. I believe this would be much more robust if MPOL_LOCAL were recognized in switch statements throughout, MPOL_F_LOCAL deleted, and MPOL_PREFERRED use the (possibly empty) nodes mask like everyone else, instead of its preferred_node variant (I presume an optimization from the days before MPOL_LOCAL). But that would take me too long to get right and fully tested. Signed-off-by: Hugh Dickins Signed-off-by: Linus Torvalds Signed-off-by: Ben Hutchings

thp, memcg: split hugepage for memcg oom on cow

2013-01-03T03:33:55Z

commit 1f1d06c34f7675026326cd9f39ff91e4555cf355 upstream. On COW, a new hugepage is allocated and charged to the memcg. If the system is oom or the charge to the memcg fails, however, the fault handler will return VM_FAULT_OOM which results in an oom kill. Instead, it's possible to fallback to splitting the hugepage so that the COW results only in an order-0 page being allocated and charged to the memcg which has a higher liklihood to succeed. This is expensive because the hugepage must be split in the page fault handler, but it is much better than unnecessarily oom killing a process. Signed-off-by: David Rientjes Cc: Andrea Arcangeli Cc: Johannes Weiner Acked-by: KAMEZAWA Hiroyuki Cc: Michal Hocko Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Ben Hutchings

Revert "mm: vmscan: fix endless loop in kswapd balancing"

2013-01-03T03:33:55Z

This reverts commit 39d18dc4b8b0c000fa681cbae10ac3f8a132814b which was commit 60cefed485a02bd99b6299dad70666fe49245da7 upstream. This was not needed and is not suitable for 3.2.y. Reported-by: Johannes Weiner Signed-off-by: Ben Hutchings

mm: dmapool: use provided gfp flags for all dma_alloc_coherent() calls

2013-01-03T03:33:36Z

commit 387870f2d6d679746020fa8e25ef786ff338dc98 upstream. dmapool always calls dma_alloc_coherent() with GFP_ATOMIC flag, regardless the flags provided by the caller. This causes excessive pruning of emergency memory pools without any good reason. Additionaly, on ARM architecture any driver which is using dmapools will sooner or later trigger the following error: "ERROR: 256 KiB atomic DMA coherent pool is too small! Please increase it with coherent_pool= kernel parameter!". Increasing the coherent pool size usually doesn't help much and only delays such error, because all GFP_ATOMIC DMA allocations are always served from the special, very limited memory pool. This patch changes the dmapool code to correctly use gfp flags provided by the dmapool caller. Reported-by: Soeren Moch Reported-by: Thomas Petazzoni Signed-off-by: Marek Szyprowski Tested-by: Andrew Lunn Tested-by: Soeren Moch Signed-off-by: Ben Hutchings

mm: highmem: export kmap_to_page for modules

2013-01-03T03:32:57Z

commit f0263d2d222e9e25f2587e51a9dc58c6fb2a9352 upstream. Some virtio device drivers (9p) need to translate high virtual addresses to physical addresses, which are inserted into the virtqueue for processing by userspace. This patch exports the kmap_to_page symbol, so that the affected drivers can be compiled as modules. Signed-off-by: Will Deacon Signed-off-by: Rusty Russell Signed-off-by: Ben Hutchings

mm: add kmap_to_page()

2013-01-03T03:32:57Z

This is extracted from Mel Gorman's commit 5a178119b0fb ('mm: add support for direct_IO to highmem pages') upstream. Required to backport commit b9cdc88df8e6 ('virtio: 9p: correctly pass physical address to userspace for high pages'). Cc: Mel Gorman Cc: Rik van Riel Cc: Andrew Morton Signed-off-by: Ben Hutchings

tmpfs: fix shared mempolicy leak

2013-01-03T03:32:54Z

commit 18a2f371f5edf41810f6469cb9be39931ef9deb9 upstream. This fixes a regression in 3.7-rc, which has since gone into stable. Commit 00442ad04a5e ("mempolicy: fix a memory corruption by refcount imbalance in alloc_pages_vma()") changed get_vma_policy() to raise the refcount on a shmem shared mempolicy; whereas shmem_alloc_page() went on expecting alloc_page_vma() to drop the refcount it had acquired. This deserves a rework: but for now fix the leak in shmem_alloc_page(). Hugh: shmem_swapin() did not need a fix, but surely it's clearer to use the same refcounting there as in shmem_alloc_page(), delete its onstack mempolicy, and the strange mpol_cond_copy() and __mpol_cond_copy() - those were invented to let swapin_readahead() make an unknown number of calls to alloc_pages_vma() with one mempolicy; but since 00442ad04a5e, alloc_pages_vma() has kept refcount in balance, so now no problem. Reported-and-tested-by: Tommi Rantala Signed-off-by: Mel Gorman Signed-off-by: Hugh Dickins Signed-off-by: Linus Torvalds [bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings