aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2012-12-11mm: migrate: Add a tracepoint for migrate_pagesMel Gorman
The pgmigrate_success and pgmigrate_fail vmstat counters tells the user about migration activity but not the type or the reason. This patch adds a tracepoint to identify the type of page migration and why the page is being migrated. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com>
2012-12-11mm: compaction: Move migration fail/success stats to migrate.cMel Gorman
The compact_pages_moved and compact_pagemigrate_failed events are convenient for determining if compaction is active and to what degree migration is succeeding but it's at the wrong level. Other users of migration may also want to know if migration is working properly and this will be particularly true for any automated NUMA migration. This patch moves the counters down to migration with the new events called pgmigrate_success and pgmigrate_fail. The compact_blocks_moved counter is removed because while it was useful for debugging initially, it's worthless now as no meaningful conclusions can be drawn from its value. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com>
2012-12-11mm: Optimize the TLB flush of sys_mprotect() and change_protection() usersIngo Molnar
Reuse the NUMA code's 'modified page protections' count that change_protection() computes and skip the TLB flush if there's no changes to a range that sys_mprotect() modifies. Given that mprotect() already optimizes the same-flags case I expected this optimization to dominantly trigger on CONFIG_NUMA_BALANCING=y kernels - but even with that feature disabled it triggers rather often. There's two reasons for that: 1) While sys_mprotect() already optimizes the same-flag case: if (newflags == oldflags) { *pprev = vma; return 0; } and this test works in many cases, but it is too sharp in some others, where it differentiates between protection values that the underlying PTE format makes no distinction about, such as PROT_EXEC == PROT_READ on x86. 2) Even where the pte format over vma flag changes necessiates a modification of the pagetables, there might be no pagetables yet to modify: they might not be instantiated yet. During a regular desktop bootup this optimization hits a couple of hundred times. During a Java test I measured thousands of hits. So this optimization improves sys_mprotect() in general, not just CONFIG_NUMA_BALANCING=y kernels. [ We could further increase the efficiency of this optimization if change_pte_range() and change_huge_pmd() was a bit smarter about recognizing exact-same-value protection masks - when the hardware can do that safely. This would probably further speed up mprotect(). ] Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-12-11mm: Count the number of pages affected in change_protection()Peter Zijlstra
This will be used for three kinds of purposes: - to optimize mprotect() - to speed up working set scanning for working set areas that have not been touched - to more accurately scan per real working set No change in functionality from this patch. Suggested-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-12-11mm: Check if PTE is already allocated during page faultMel Gorman
With transparent hugepage support, handle_mm_fault() has to be careful that a normal PMD has been established before handling a PTE fault. To achieve this, it used __pte_alloc() directly instead of pte_alloc_map as pte_alloc_map is unsafe to run against a huge PMD. pte_offset_map() is called once it is known the PMD is safe. pte_alloc_map() is smart enough to check if a PTE is already present before calling __pte_alloc but this check was lost. As a consequence, PTEs may be allocated unnecessarily and the page table lock taken. Thi useless PTE does get cleaned up but it's a performance hit which is visible in page_test from aim9. This patch simply re-adds the check normally done by pte_alloc_map to check if the PTE needs to be allocated before taking the page table lock. The effect is noticable in page_test from aim9. AIM9 2.6.38-vanilla 2.6.38-checkptenone creat-clo 446.10 ( 0.00%) 424.47 (-5.10%) page_test 38.10 ( 0.00%) 42.04 ( 9.37%) brk_test 52.45 ( 0.00%) 51.57 (-1.71%) exec_test 382.00 ( 0.00%) 456.90 (16.39%) fork_test 60.11 ( 0.00%) 67.79 (11.34%) MMTests Statistics: duration Total Elapsed Time (seconds) 611.90 612.22 (While this affects 2.6.38, it is a performance rather than a functional bug and normally outside the rules -stable. While the big performance differences are to a microbench, the difference in fork and exec performance may be significant enough that -stable wants to consider the patch) Reported-by: Raz Ben Yehuda <raziebe@gmail.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Rik van Riel <riel@redhat.com> [ Picked this up from the AutoNUMA tree to help it upstream and to allow apples-to-apples performance comparisons. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-12-11mm: Only flush the TLB when clearing an accessible pteRik van Riel
If ptep_clear_flush() is called to clear a page table entry that is accessible anyway by the CPU, eg. a _PAGE_PROTNONE page table entry, there is no need to flush the TLB on remote CPUs. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/n/tip-vm3rkzevahelwhejx5uwm8ex@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-12-11x86/mm: Introduce pte_accessible()Rik van Riel
We need pte_present to return true for _PAGE_PROTNONE pages, to indicate that the pte is associated with a page. However, for TLB flushing purposes, we would like to know whether the pte points to an actually accessible page. This allows us to skip remote TLB flushes for pages that are not actually accessible. Fill in this method for x86 and provide a safe (but slower) method on other architectures. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Fixed-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-66p11te4uj23gevgh4j987ip@git.kernel.org [ Added Linus's review fixes. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-12-11mm,generic: only flush the local TLB in ptep_set_access_flagsRik van Riel
The function ptep_set_access_flags is only ever used to upgrade access permissions to a page. That means the only negative side effect of not flushing remote TLBs is that other CPUs may incur spurious page faults, if they happen to access the same address, and still have a PTE with the old permissions cached in their TLB. Having another CPU maybe incur a spurious page fault is faster than always incurring the cost of a remote TLB flush, so replace the remote TLB flush with a purely local one. This should be safe on every architecture that correctly implements flush_tlb_fix_spurious_fault() to actually invalidate the local TLB entry that caused a page fault, as well as on architectures where the hardware invalidates TLB entries that cause page faults. In the unlikely event that you are hitting what appears to be an infinite loop of page faults, and 'git bisect' took you to this changeset, your architecture needs to implement flush_tlb_fix_spurious_fault to actually flush the TLB entry. Signed-off-by: Rik van Riel <riel@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Michel Lespinasse <walken@google.com> Cc: Ingo Molnar <mingo@kernel.org>
2012-12-11x86: mm: drop TLB flush from ptep_set_access_flagsRik van Riel
Intel has an architectural guarantee that the TLB entry causing a page fault gets invalidated automatically. This means we should be able to drop the local TLB invalidation. Because of the way other areas of the page fault code work, chances are good that all x86 CPUs do this. However, if someone somewhere has an x86 CPU that does not invalidate the TLB entry causing a page fault, this one-liner should be easy to revert. Signed-off-by: Rik van Riel <riel@redhat.com> Cc: Linus Torvalds <torvalds@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Michel Lespinasse <walken@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com>
2012-12-11x86: mm: only do a local tlb flush in ptep_set_access_flags()Rik van Riel
The function ptep_set_access_flags() is only ever invoked to set access flags or add write permission on a PTE. The write bit is only ever set together with the dirty bit. Because we only ever upgrade a PTE, it is safe to skip flushing entries on remote TLBs. The worst that can happen is a spurious page fault on other CPUs, which would flush that TLB entry. Lazily letting another CPU incur a spurious page fault occasionally is (much!) cheaper than aggressively flushing everybody else's TLB. Signed-off-by: Rik van Riel <riel@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Michel Lespinasse <walken@google.com> Cc: Ingo Molnar <mingo@kernel.org>
2012-11-16Linux 3.7-rc6v3.7-rc6Linus Torvalds
2012-11-16Merge git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull KVM fix from Marcelo Tosatti: "A correction for oops on module init with older Intel hosts." * git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: x86: Fix invalid secondary exec controls in vmx_cpuid_update()
2012-11-16Merge branch 'akpm' (Fixes from Andrew)Linus Torvalds
Merge misc fixes from Andrew Morton. * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (12 patches) revert "mm: fix-up zone present pages" tmpfs: change final i_blocks BUG to WARNING tmpfs: fix shmem_getpage_gfp() VM_BUG_ON mm: highmem: don't treat PKMAP_ADDR(LAST_PKMAP) as a highmem address mm: revert "mm: vmscan: scale number of pages reclaimed by reclaim/compaction based on failures" rapidio: fix kernel-doc warnings swapfile: fix name leak in swapoff memcg: fix hotplugged memory zone oops mips, arc: fix build failure memcg: oom: fix totalpages calculation for memory.swappiness==0 mm: fix build warning for uninitialized value mm: add anon_vma_lock to validate_mm()
2012-11-16revert "mm: fix-up zone present pages"Andrew Morton
Revert commit 7f1290f2f2a4 ("mm: fix-up zone present pages") That patch tried to fix a issue when calculating zone->present_pages, but it caused a regression on 32bit systems with HIGHMEM. With that change, reset_zone_present_pages() resets all zone->present_pages to zero, and fixup_zone_present_pages() is called to recalculate zone->present_pages when the boot allocator frees core memory pages into buddy allocator. Because highmem pages are not freed by bootmem allocator, all highmem zones' present_pages becomes zero. Various options for improving the situation are being discussed but for now, let's return to the 3.6 code. Cc: Jianguo Wu <wujianguo@huawei.com> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Petr Tesarik <ptesarik@suse.cz> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: David Rientjes <rientjes@google.com> Tested-by: Chris Clayton <chris2553@googlemail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-11-16tmpfs: change final i_blocks BUG to WARNINGHugh Dickins
Under a particular load on one machine, I have hit shmem_evict_inode()'s BUG_ON(inode->i_blocks), enough times to narrow it down to a particular race between swapout and eviction. It comes from the "if (freed > 0)" asymmetry in shmem_recalc_inode(), and the lack of coherent locking between mapping's nrpages and shmem's swapped count. There's a window in shmem_writepage(), between lowering nrpages in shmem_delete_from_page_cache() and then raising swapped count, when the freed count appears to be +1 when it should be 0, and then the asymmetry stops it from being corrected with -1 before hitting the BUG. One answer is coherent locking: using tree_lock throughout, without info->lock; reasonable, but the raw_spin_lock in percpu_counter_add() on used_blocks makes that messier than expected. Another answer may be a further effort to eliminate the weird shmem_recalc_inode() altogether, but previous attempts at that failed. So far undecided, but for now change the BUG_ON to WARN_ON: in usual circumstances it remains a useful consistency check. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-11-16tmpfs: fix shmem_getpage_gfp() VM_BUG_ONHugh Dickins
Fuzzing with trinity hit the "impossible" VM_BUG_ON(error) (which Fedora has converted to WARNING) in shmem_getpage_gfp(): WARNING: at mm/shmem.c:1151 shmem_getpage_gfp+0xa5c/0xa70() Pid: 29795, comm: trinity-child4 Not tainted 3.7.0-rc2+ #49 Call Trace: warn_slowpath_common+0x7f/0xc0 warn_slowpath_null+0x1a/0x20 shmem_getpage_gfp+0xa5c/0xa70 shmem_fault+0x4f/0xa0 __do_fault+0x71/0x5c0 handle_pte_fault+0x97/0xae0 handle_mm_fault+0x289/0x350 __do_page_fault+0x18e/0x530 do_page_fault+0x2b/0x50 page_fault+0x28/0x30 tracesys+0xe1/0xe6 Thanks to Johannes for pointing to truncation: free_swap_and_cache() only does a trylock on the page, so the page lock we've held since before confirming swap is not enough to protect against truncation. What cleanup is needed in this case? Just delete_from_swap_cache(), which takes care of the memcg uncharge. Signed-off-by: Hugh Dickins <hughd@google.com> Reported-by: Dave Jones <davej@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-11-16mm: highmem: don't treat PKMAP_ADDR(LAST_PKMAP) as a highmem addressWill Deacon
kmap_to_page returns the corresponding struct page for a virtual address of an arbitrary mapping. This works by checking whether the address falls in the pkmap region and using the pkmap page tables instead of the linear mapping if appropriate. Unfortunately, the bounds checking means that PKMAP_ADDR(LAST_PKMAP) is incorrectly treated as a highmem address and we can end up walking off the end of pkmap_page_table and subsequently passing junk to pte_page. This patch fixes the bound check to stay within the pkmap tables. Signed-off-by: Will Deacon <will.deacon@arm.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-11-16mm: revert "mm: vmscan: scale number of pages reclaimed by ↵Mel Gorman
reclaim/compaction based on failures" Jiri Slaby reported the following: (It's an effective revert of "mm: vmscan: scale number of pages reclaimed by reclaim/compaction based on failures".) Given kswapd had hours of runtime in ps/top output yesterday in the morning and after the revert it's now 2 minutes in sum for the last 24h, I would say, it's gone. The intention of the patch in question was to compensate for the loss of lumpy reclaim. Part of the reason lumpy reclaim worked is because it aggressively reclaimed pages and this patch was meant to be a sane compromise. When compaction fails, it gets deferred and both compaction and reclaim/compaction is deferred avoid excessive reclaim. However, since commit c654345924f7 ("mm: remove __GFP_NO_KSWAPD"), kswapd is woken up each time and continues reclaiming which was not taken into account when the patch was developed. Attempts to address the problem ended up just changing the shape of the problem instead of fixing it. The release window gets closer and while a THP allocation failing is not a major problem, kswapd chewing up a lot of CPU is. This patch reverts commit 83fde0f22872 ("mm: vmscan: scale number of pages reclaimed by reclaim/compaction based on failures") and will be revisited in the future. Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Zdenek Kabelac <zkabelac@redhat.com> Tested-by: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Cc: Jiri Slaby <jirislaby@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Johannes Hirte <johannes.hirte@fem.tu-ilmenau.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-11-16rapidio: fix kernel-doc warningsRandy Dunlap
Fix rapidio kernel-doc warnings: Warning(drivers/rapidio/rio.c:415): No description found for parameter 'local' Warning(drivers/rapidio/rio.c:415): Excess function parameter 'lstart' description in 'rio_map_inb_region' Warning(include/linux/rio.h:290): No description found for parameter 'switches' Warning(include/linux/rio.h:290): No description found for parameter 'destid_table' Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Matt Porter <mporter@kernel.crashing.org> Acked-by: Alexandre Bounine <alexandre.bounine@idt.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-11-16swapfile: fix name leak in swapoffXiaotian Feng
There's a name leak introduced by commit 91a27b2a7567 ("vfs: define struct filename and have getname() return it"). Add the missing putname. [akpm@linux-foundation.org: cleanup] Signed-off-by: Xiaotian Feng <dannyfeng@tencent.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-11-16memcg: fix hotplugged memory zone oopsHugh Dickins
When MEMCG is configured on (even when it's disabled by boot option), when adding or removing a page to/from its lru list, the zone pointer used for stats updates is nowadays taken from the struct lruvec. (On many configurations, calculating zone from page is slower.) But we have no code to update all the lruvecs (per zone, per memcg) when a memory node is hotadded. Here's an extract from the oops which results when running numactl to bind a program to a newly onlined node: BUG: unable to handle kernel NULL pointer dereference at 0000000000000f60 IP: __mod_zone_page_state+0x9/0x60 Pid: 1219, comm: numactl Not tainted 3.6.0-rc5+ #180 Bochs Bochs Process numactl (pid: 1219, threadinfo ffff880039abc000, task ffff8800383c4ce0) Call Trace: __pagevec_lru_add_fn+0xdf/0x140 pagevec_lru_move_fn+0xb1/0x100 __pagevec_lru_add+0x1c/0x30 lru_add_drain_cpu+0xa3/0x130 lru_add_drain+0x2f/0x40 ... The natural solution might be to use a memcg callback whenever memory is hotadded; but that solution has not been scoped out, and it happens that we do have an easy location at which to update lruvec->zone. The lruvec pointer is discovered either by mem_cgroup_zone_lruvec() or by mem_cgroup_page_lruvec(), and both of those do know the right zone. So check and set lruvec->zone in those; and remove the inadequate attempt to set lruvec->zone from lruvec_init(), which is called before NODE_DATA(node) has been allocated in such cases. Ah, there was one exceptionr. For no particularly good reason, mem_cgroup_force_empty_list() has its own code for deciding lruvec. Change it to use the standard mem_cgroup_zone_lruvec() and mem_cgroup_get_lru_size() too. In fact it was already safe against such an oops (the lru lists in danger could only be empty), but we're better proofed against future changes this way. I've marked this for stable (3.6) since we introduced the problem in 3.5 (now closed to stable); but I have no idea if this is the only fix needed to get memory hotadd working with memcg in 3.6, and received no answer when I enquired twice before. Reported-by: Tang Chen <tangchen@cn.fujitsu.com> Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-11-16mips, arc: fix build failureDavid Rientjes
Using a cross-compiler to fix another issue, the following build error occurred for mips defconfig: arch/mips/fw/arc/misc.c: In function 'ArcHalt': arch/mips/fw/arc/misc.c:25:2: error: implicit declaration of function 'local_irq_disable' Fix it up by including irqflags.h. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-11-16memcg: oom: fix totalpages calculation for memory.swappiness==0Michal Hocko
oom_badness() takes a totalpages argument which says how many pages are available and it uses it as a base for the score calculation. The value is calculated by mem_cgroup_get_limit which considers both limit and total_swap_pages (resp. memsw portion of it). This is usually correct but since fe35004fbf9e ("mm: avoid swapping out with swappiness==0") we do not swap when swappiness is 0 which means that we cannot really use up all the totalpages pages. This in turn confuses oom score calculation if the memcg limit is much smaller than the available swap because the used memory (capped by the limit) is negligible comparing to totalpages so the resulting score is too small if adj!=0 (typically task with CAP_SYS_ADMIN or non zero oom_score_adj). A wrong process might be selected as result. The problem can be worked around by checking mem_cgroup_swappiness==0 and not considering swap at all in such a case. Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-11-16mm: fix build warning for uninitialized valueDavid Rientjes
do_wp_page() sets mmun_called if mmun_start and mmun_end were initialized and, if so, may call mmu_notifier_invalidate_range_end() with these values. This doesn't prevent gcc from emitting a build warning though: mm/memory.c: In function `do_wp_page': mm/memory.c:2530: warning: `mmun_start' may be used uninitialized in this function mm/memory.c:2531: warning: `mmun_end' may be used uninitialized in this function It's much easier to initialize the variables to impossible values and do a simple comparison to determine if they were initialized to remove the bool entirely. Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-11-16mm: add anon_vma_lock to validate_mm()Michel Lespinasse
Iterating over the vma->anon_vma_chain without anon_vma_lock may cause NULL ptr deref in anon_vma_interval_tree_verify(), because the node in the chain might have been removed. BUG: unable to handle kernel paging request at fffffffffffffff0 IP: [<ffffffff8122c29c>] anon_vma_interval_tree_verify+0xc/0xa0 PGD 4e28067 PUD 4e29067 PMD 0 Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC CPU 0 Pid: 9050, comm: trinity-child64 Tainted: G W 3.7.0-rc2-next-20121025-sasha-00001-g673f98e-dirty #77 RIP: 0010: anon_vma_interval_tree_verify+0xc/0xa0 Process trinity-child64 (pid: 9050, threadinfo ffff880045f80000, task ffff880048eb0000) Call Trace: validate_mm+0x58/0x1e0 vma_adjust+0x635/0x6b0 __split_vma.isra.22+0x161/0x220 split_vma+0x24/0x30 sys_madvise+0x5da/0x7b0 tracesys+0xe1/0xe6 RIP anon_vma_interval_tree_verify+0xc/0xa0 CR2: fffffffffffffff0 Figured out by Bob Liu. Reported-by: Sasha Levin <sasha.levin@oracle.com> Cc: Bob Liu <lliubbo@gmail.com> Signed-off-by: Michel Lespinasse <walken@google.com> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-11-16KVM: x86: Fix invalid secondary exec controls in vmx_cpuid_update()Takashi Iwai
The commit [ad756a16: KVM: VMX: Implement PCID/INVPCID for guests with EPT] introduced the unconditional access to SECONDARY_VM_EXEC_CONTROL, and this triggers kernel warnings like below on old CPUs: vmwrite error: reg 401e value a0568000 (err 12) Pid: 13649, comm: qemu-kvm Not tainted 3.7.0-rc4-test2+ #154 Call Trace: [<ffffffffa0558d86>] vmwrite_error+0x27/0x29 [kvm_intel] [<ffffffffa054e8cb>] vmcs_writel+0x1b/0x20 [kvm_intel] [<ffffffffa054f114>] vmx_cpuid_update+0x74/0x170 [kvm_intel] [<ffffffffa03629b6>] kvm_vcpu_ioctl_set_cpuid2+0x76/0x90 [kvm] [<ffffffffa0341c67>] kvm_arch_vcpu_ioctl+0xc37/0xed0 [kvm] [<ffffffff81143f7c>] ? __vunmap+0x9c/0x110 [<ffffffffa0551489>] ? vmx_vcpu_load+0x39/0x1a0 [kvm_intel] [<ffffffffa0340ee2>] ? kvm_arch_vcpu_load+0x52/0x1a0 [kvm] [<ffffffffa032dcd4>] ? vcpu_load+0x74/0xd0 [kvm] [<ffffffffa032deb0>] kvm_vcpu_ioctl+0x110/0x5e0 [kvm] [<ffffffffa032e93d>] ? kvm_dev_ioctl+0x4d/0x4a0 [kvm] [<ffffffff8117dc6f>] do_vfs_ioctl+0x8f/0x530 [<ffffffff81139d76>] ? remove_vma+0x56/0x60 [<ffffffff8113b708>] ? do_munmap+0x328/0x400 [<ffffffff81187c8c>] ? fget_light+0x4c/0x100 [<ffffffff8117e1a1>] sys_ioctl+0x91/0xb0 [<ffffffff815a942d>] system_call_fastpath+0x1a/0x1f This patch adds a check for the availability of secondary exec control to avoid these warnings. Cc: <stable@vger.kernel.org> [v3.6+] Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-11-16Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking updates from David Miller: 1) tx_filtered/ps_tx_buf queues need to be accessed with the SKB queue lock, from Arik Nemtsov. 2) Don't call 802.11 driver's filter configure method until it's actually open, from Felix Fietkau. 3) Use ieee80211_free_txskb otherwise we leak control information. From Johannes Berg. 4) Fix memory leak in bluetooth UUID removal,f rom Johan Hedberg. 5) The shift mask trick doesn't work properly when 'optname' is out of range in do_ip_setsockopt(). Use a straightforward switch statement instead, the compiler emits essentially the same code but without the missing range check. From Xi Wang. 6) Fix when we call tcp_replace_ts_recent() otherwise we can erroneously accept a too-high tsval. From Eric Dumazet. 7) VXLAN bug fixes, mostly to do with VLAN header length handling, from Alexander Duyck. 8) Missing return value initialization for IPV6_MINHOPCOUNT socket option handling. From Hannes Frederic. 9) Fix regression in tasklet handling in jme/ksz884x/xilinx drivers, from Xiaotian Feng. 10) At smsc911x driver init time, we don't know if the chip is in word swap mode or not. However we do need to wait for the control register's ready bit to be set before we program any other part of the chip. Adjust the wait loop to account for this. From Kamlakant Patel. 11) Revert erroneous MDIO bus unregister change to mdio-bitbang.c 12) Fix memory leak in /proc/net/sctp/, from Tommi Rantala. 13) tilegx driver registers IRQ with NULL name, oops, from Simon Marchi. 14) TCP metrics hash table kzalloc() based allocation can fail, back down to using vmalloc() if it does. From Eric Dumazet. 15) Fix packet steering out-of-order delivery regression, from Tom Herbert. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (40 commits) net-rps: Fix brokeness causing OOO packets tcp: handle tcp_net_metrics_init() order-5 memory allocation failures batman-adv: process broadcast packets in BLA earlier batman-adv: don't add TEMP clients belonging to other backbone nodes batman-adv: correctly pass the client flag on tt_response batman-adv: fix tt_global_entries flags update tilegx: request_irq with a non-null device name net: correct check in dev_addr_del() tcp: fix retransmission in repair mode sctp: fix /proc/net/sctp/ memory leak Revert "drivers/net/phy/mdio-bitbang.c: Call mdiobus_unregister before mdiobus_free" net/smsc911x: Fix ready check in cases where WORD_SWAP is needed drivers/net: fix tasklet misuse issue ipv4/ip_vti.c: VTI fix post-decryption forwarding brcmfmac: fix typo in CONFIG_BRCMISCAN vxlan: Update hard_header_len based on lowerdev when instantiating VXLAN vxlan: fix a typo. ipv6: setsockopt(IPIPPROTO_IPV6, IPV6_MINHOPCOUNT) forgot to set return value doc/net: Fix typo in netdev-features.txt vxlan: Fix error that was resulting in VXLAN MTU size being 10 bytes too large ...
2012-11-16Merge branch 'for-davem' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless John W. Linville says: ==================== This batch of fixes is intended for the 3.7 stream... This includes a pull of the Bluetooth tree. Gustavo says: "A few important fixes to go into 3.7. There is a new hw support by Marcos Chaparro. Johan added a memory leak fix and hci device index list fix. Also Marcel fixed a race condition in the device set up that was prevent the bt monitor to work properly. Last, Paulo Sérgio added a fix to the error status when pairing for LE fails. This was prevent userspace to work to handle the failure properly." Regarding the mac80211 pull, Johannes says: "I have a locking fix for some SKB queues, a variable initialization to avoid crashes in a certain failure case, another free_txskb fix from Felix and another fix from him to avoid calling a stopped driver, a fix for a (very unlikely) memory leak and a fix to not send null data packets when resuming while not associated." Regarding the iwlwifi pull, Johannes says: "Two more fixes for iwlwifi ... one to use ieee80211_free_txskb(), and one to check DMA mapping errors, please pull." On top of that, Johannes also included a wireless regulatory fix to allow 40 MHz on channels 12 and 13 in world roaming mode. Also, Hauke Mehrtens fixes a #ifdef typo in brcmfmac. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-16net-rps: Fix brokeness causing OOO packetsTom Herbert
In commit c445477d74ab3779 which adds aRFS to the kernel, the CPU selected for RFS is not set correctly when CPU is changing. This is causing OOO packets and probably other issues. Signed-off-by: Tom Herbert <therbert@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-16Merge tag 'batman-adv-fix-for-davem' of git://git.open-mesh.org/linux-mergeDavid S. Miller
Included fixes are: - update the client entry status flags when using the "early client detection". This makes the Distributed AP isolation correctly work; - transfer the client entry status flags when recovering the translation table from another node. This makes the Distributed AP isolation correctly work; - prevent the "early client detection mechanism" to add clients belonging to other backbone nodes in the same LAN. This breaks connectivity when using this mechanism together with the Bridge Loop Avoidance - process broadcast packets with the Bridge Loop Avoidance before any other component. BLA can possibly drop the packets based on the source address. This makes the "early client detection mechanism" correctly work when used with BLA. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-16tcp: handle tcp_net_metrics_init() order-5 memory allocation failuresEric Dumazet
order-5 allocations can fail with current kernels, we should try vmalloc() as well. Reported-by: Julien Tinnes <jln@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-16Thermal: Add Linux/Thermal subsystem info in MAINTAINER fileZhang Rui
All the changes made to the generic thermal layer, or platform thermal drivers that make use of the thermal layer, should be sent to linux-pm@vger.kernel.org for discussion. And as the maintainer, I will only apply the patches that have been sent to linux-pm@vger.kernel.org. Signed-off-by: Zhang Rui <rui.zhang@intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-11-16mm, oom: reintroduce /proc/pid/oom_adjDavid Rientjes
This is mostly a revert of 01dc52ebdf47 ("oom: remove deprecated oom_adj") from Davidlohr Bueso. It reintroduces /proc/pid/oom_adj for backwards compatibility with earlier kernels. It simply scales the value linearly when /proc/pid/oom_score_adj is written. The major difference is that its scheduled removal is no longer included in Documentation/feature-removal-schedule.txt. We do warn users with a single printk, though, to suggest the more powerful and supported /proc/pid/oom_score_adj interface. Reported-by: Artem S. Tashkinov <t.artem@lycos.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-11-16Merge tag 'fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc Pull ARM SoC fixes from Olof Johansson: "We've been sitting on this longer than we meant to due to travel and other activities, but the number of patches is luckily not that high. Biggest changes are from a batch of OMAP bugfixes, but there are a few for the broader set of SoCs too (bcm2835, pxa, highbank, tegra, at91 and i.MX). The OMAP patches contain some fixes for MUSB/PHY on omap4 which ends up being a bit on the large side but needed for legacy (non-DT) platforms. Beyond that there are a handful of hwmod/pm changes. So, fairly noncontroversial stuff all in all, and as usual around this time the fixes are well targeted at specific problems." * tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: ARM: imx: ehci: fix host power mask bit ARM i.MX: fix error-valued pointer dereference in clk_register_gate2() ARM: at91/usbh: fix overcurrent gpio setup ARM: at91/AT91SAM9G45: fix crypto peripherals irq issue due to sparse irq support ARM: boot: Fix usage of kecho ARM: OMAP: ocp2scp: create omap device for ocp2scp ARM: OMAP4: add _dev_attr_ to ocp2scp for representing usb_phy drivers: bus: ocp2scp: add pdata support irqchip: irq-bcm2835: Add terminating entry for of_device_id table ARM: highbank: retry wfi on reset request ARM: OMAP4: PM: fix regulator name for VDD_MPU ARM: OMAP4: hwmod data: do not enable or reset the McPDM during kernel init ARM: OMAP2+: hwmod: add flag to prevent hwmod code from touching IP block during init ARM: dt: tegra: fix length of pad control and mux registers ARM: OMAP: hwmod: wait for sysreset complete after enabling hwmod ARM: OMAP2+: clockdomain: Fix OMAP4 ISS clk domain to support only SWSUP ARM: pxa/spitz_pm: Fix hang when resuming from STR ARM: pxa: hx4700: Fix backlight PWM device number ARM: OMAP2+: PM: add missing newline to VC warning message
2012-11-16Merge branch 'master' of ↵John W. Linville
git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless into for-davem
2012-11-16Merge tag 'arm64-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-aarch64 Pull arm64 bugfix from Catalin Marinas: "Arm64 page permission bug fix. Without this fix, the CPU speculatively accesses the interrupt controller memory causing random IRQ acknowledge." * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-aarch64: arm64: Distinguish between user and kernel XN bits
2012-11-16Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid Pull HID fix from Jiri Kosina: "This has a build fix for architectures where memcmp() is macro, from Jiri Slaby" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid: HID: microsoft: do not use compound literal - fix build
2012-11-16arm64: Distinguish between user and kernel XN bitsCatalin Marinas
On AArch64, the meaning of the XN bit has changed to UXN (user). The PXN (privileged) bit must be set to prevent kernel execution. Without the PXN bit set, the CPU may speculatively access device memory. This patch ensures that all the mappings that the kernel must not execute from (including user mappings) have the PXN bit set. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2012-11-16Merge tag 'usb-3.7-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb Pull USB fixes from Greg Kroah-Hartman: "Here are some USB fixes for the 3.7 tree. Nothing huge here, just a number of tiny bugfixes resolving issues that have been found, and two reverts of patches that were found to have caused problems. All of these have been in linux-next already. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>" * tag 'usb-3.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: Revert "USB/host: Cleanup unneccessary irq disable code" USB: option: add Alcatel X220/X500D USB IDs USB: option: add Novatel E362 and Dell Wireless 5800 USB IDs USB: keyspan: fix typo causing GPF on open USB: fix build with XEN and EARLY_PRINTK_DBGP enabled but USB_SUPPORT disabled USB: usb_wwan: fix bulk-urb allocation usb: otg: Fix build errors if USB_MUSB_OMAP2PLUS is selected as module usb: musb: ux500: fix 'musbid' undeclared error in ux500_remove() Revert "usb: musb: use DMA mode 1 whenever possible"
2012-11-16Merge tag 'tty-3.7-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty Pull TTY fixes from Greg Kroah-Hartman: "Here are two TTY driver fixes for 3.7-rc5. They resolve a bug in the hvc driver that has been reported, and fix a problem with the list of device ids in the max310x serial driver. Both have been in linux-next for a while. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>" * tag 'tty-3.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: tty: serial: max310x: Add terminating entry for spi_device_id table TTY: hvc_console, fix port reference count going to zero prematurely
2012-11-16Merge tag 'staging-3.7-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging Pull staging tree fix from Greg Kroah-Hartman: "Here is a single patch, a revert of an android driver patch, that resolves a bug that has been reported in the Android alarm driver. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>" * tag 'staging-3.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: Revert "Staging: Android alarm: IOCTL command encoding fix"
2012-11-16Merge tag 'at91-fixes' of git://github.com/at91linux/linux-at91 into fixesArnd Bergmann
From Nicolas Ferre <nicolas.ferre@atmel.com>: Two little fixes, one related to the move to sparse irq and another one fixing the check of a GPIO for USB host overcurrent. * tag 'at91-fixes' of git://github.com/at91linux/linux-at91: ARM: at91/usbh: fix overcurrent gpio setup ARM: at91/AT91SAM9G45: fix crypto peripherals irq issue due to sparse irq support Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2012-11-16Merge tag 'imx-fixes-rc' of git://git.pengutronix.de/git/imx/linux-2.6 into ↵Arnd Bergmann
fixes From Sascha Hauer <s.hauer@pengutronix.de>: ARM i.MX fixes for 3.7-rc * tag 'imx-fixes-rc' of git://git.pengutronix.de/git/imx/linux-2.6: ARM: imx: ehci: fix host power mask bit ARM i.MX: fix error-valued pointer dereference in clk_register_gate2() Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2012-11-16Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 patches from Martin Schwidefsky: "Some more bug fixes and a config change. The signal bug is nasty, if the clock_gettime vdso function is interrupted by a signal while in access-register-mode we end up with an endless signal loop until the signal stack is full. The config change is for aligned struct pages, gives us 8% improvement with hackbench." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: s390/3215: fix tty close handling s390/mm: have 16 byte aligned struct pages s390/gup: fix access_ok() usage in __get_user_pages_fast() s390/gup: add missing TASK_SIZE check to get_user_pages_fast() s390/topology: fix core id vs physical package id mix-up s390/signal: set correct address space control
2012-11-16Merge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linuxLinus Torvalds
Pull drm fixes from Dave Airlie: "All pretty normal: one TTM oops fix, one radeon, a few intel and a vmwgfx fix." * 'drm-fixes' of git://people.freedesktop.org/~airlied/linux: drm/ttm: remove unneeded preempt_disable/enable ttm: Clear the ttm page allocated from high memory zone correctly vmwgfx: return an -EFAULT if copy_to_user() fails drm/radeon: fix logic error in atombios_encoders.c drm/i915: do not ignore eDP bpc settings from vbt drm/i915/sdvo: clean up connectors on intel_sdvo_init() failures drm/i915/crt: fix DPMS standby and suspend mode handling
2012-11-16Merge tag 'clk-fixes-for-linus' of git://git.linaro.org/people/mturquette/linuxLinus Torvalds
Pull another clk layer fix from Michael Turquette: "GCC 4.7 users get compilation errors from unnecessary use of inline in clk-provider.h. This pull request fixes the regression by removing inline usage from those function declarations." * tag 'clk-fixes-for-linus' of git://git.linaro.org/people/mturquette/linux: clk: remove inline usage from clk-provider.h
2012-11-16ARM: imx: ehci: fix host power mask bitChristoph Fritz
This patch sets HPM (Host power mask bit) to bit 16 according to i.MX Reference Manual. Falsely it was set to bit 8, but this controls pull-up Impedance. Reported-by: Michael Burkey <mdburkey@gmail.com> Cc: Stable <stable@vger.kernel.org> Signed-off-by: Christoph Fritz <chf.fritz@googlemail.com> Acked-by: Eric Bénard <eric@eukrea.com> Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
2012-11-16ARM i.MX: fix error-valued pointer dereference in clk_register_gate2()Wei Yongjun
The error-valued pointer clk is used for the arg of kfree, it should be kfree(gate) if clk_register() return ERR_PTR(). dpatch engine is used to auto generate this patch. (https://github.com/weiyj/dpatch) Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
2012-11-16ARM: at91/usbh: fix overcurrent gpio setupJohan Hovold
Use gpio_is_valid also for overcurrent pins (which are currently negative in many board files). Signed-off-by: Johan Hovold <jhovold@gmail.com> Acked-by: Jean-Christophe PLAGNIOL-VILLARD <plagnioj@jcrosoft.com> Signed-off-by: Nicolas Ferre <nicolas.ferre@atmel.com>
2012-11-16ARM: at91/AT91SAM9G45: fix crypto peripherals irq issue due to sparse irq ↵Nicolas Royer
support Spare irq support introduced by commit 8fe82a5 (ARM: at91: sparse irq support) involves to add the NR_IRQS_LEGACY offset to irq number. Signed-off-by: Nicolas Royer <nicolas@eukrea.com> Acked-by: Nicolas Ferre <nicolas.ferre@atmel.com> Acked-by: Eric Bénard <eric@eukrea.com> Tested-by: Eric Bénard <eric@eukrea.com> Cc: stable@vger.kernel.org # 3.6