linux - Linux kernel source tree

Age	Commit message (Collapse)	Author
2013-09-11	mm: use zone_end_pfn() instead of zone_start_pfn+spanned_pages	Xishi Qiu
	Use "zone_end_pfn()" instead of "zone->zone_start_pfn + zone->spanned_pages". Simplify the code, no functional change. [akpm@linux-foundation.org: fix build] Signed-off-by: Xishi Qiu <qiuxishi@huawei.com> Cc: Cody P Schafer <cody@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	lib/genalloc.c: fix overflow of ending address of memory chunk	Joonyoung Shim
	In struct gen_pool_chunk, end_addr means the end address of memory chunk (inclusive), but in the implementation it is treated as address + size of memory chunk (exclusive), so it points to the address plus one instead of correct ending address. The ending address of memory chunk plus one will cause overflow on the memory chunk including the last address of memory map, e.g. when starting address is 0xFFF00000 and size is 0x100000 on 32bit machine, ending address will be 0x100000000. Use correct ending address like starting address + size - 1. [akpm@linux-foundation.org: add comment to struct gen_pool_chunk:end_addr] Signed-off-by: Joonyoung Shim <jy0922.shim@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	mm/zbud: fix some trivial typos in comments	Jianguo Wu
	Signed-off-by: Jianguo Wu <wujianguo@huawei.com> Cc: Seth Jennings <sjenning@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	mm/hotplug: remove unnecessary BUG_ON in __offline_pages()	Xishi Qiu
	I think we can remove "BUG_ON(start_pfn >= end_pfn)" in __offline_pages(), because in memory_block_action() "nr_pages = PAGES_PER_SECTION * sections_per_block" is always greater than 0. memory_block_action() offline_pages() __offline_pages() BUG_ON(start_pfn >= end_pfn) In v2.6.32, If info->length==0, this way may hit this BUG_ON(). acpi_memory_disable_device() remove_memory(info->start_addr, info->length) offline_pages() A later Fujitsu patch renamed this function and the BUG_ON() is unnecessary. Signed-off-by: Xishi Qiu <qiuxishi@huawei.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Toshi Kani <toshi.kani@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	mm, vmalloc: use well-defined find_last_bit() func	Joonsoo Kim
	Our intention in here is to find last_bit within the region to flush. There is well-defined function, find_last_bit() for this purpose and its performance may be slightly better than current implementation. So change it. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	mm, vmalloc: remove useless variable in vmap_block	Joonsoo Kim
	vbq in vmap_block isn't used. So remove it. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	vmstat: use this_cpu() to avoid irqon/off sequence in refresh_cpu_vm_stats	Christoph Lameter
	Disabling interrupts repeatedly can be avoided in the inner loop if we use a this_cpu operation. Signed-off-by: Christoph Lameter <cl@linux.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> CC: Tejun Heo <tj@kernel.org> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	vmstat: create fold_diff	Christoph Lameter
	Both functions that update global counters use the same mechanism. Create a function that contains the common code. Signed-off-by: Christoph Lameter <cl@linux.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> CC: Tejun Heo <tj@kernel.org> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	vmstat: create separate function to fold per cpu diffs into local counters	Christoph Lameter
	The main idea behind this patchset is to reduce the vmstat update overhead by avoiding interrupt enable/disable and the use of per cpu atomics. This patch (of 3): It is better to have a separate folding function because refresh_cpu_vm_stats() also does other things like expire pages in the page allocator caches. If we have a separate function then refresh_cpu_vm_stats() is only called from the local cpu which allows additional optimizations. The folding function is only called when a cpu is being downed and therefore no other processor will be accessing the counters. Also simplifies synchronization. [akpm@linux-foundation.org: fix UP build] Signed-off-by: Christoph Lameter <cl@linux.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> CC: Tejun Heo <tj@kernel.org> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	swap: clean-up #ifdef in page_mapping()	Joonsoo Kim
	PageSwapCache() is always false when !CONFIG_SWAP, so compiler properly discard related code. Therefore, we don't need #ifdef explicitly. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	mm: move pgtable related functions to right place	Joonsoo Kim
	pgtable related functions are mostly in pgtable-generic.c. So move remaining functions from memory.c to pgtable-generic.c. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	mm, page_alloc: add unlikely macro to help compiler optimization	Joonsoo Kim
	We rarely allocate a page with ALLOC_NO_WATERMARKS and it is used in slow path. For helping compiler optimization, add unlikely macro to ALLOC_NO_WATERMARKS checking. This patch doesn't have any effect now, because gcc already optimize this properly. But we cannot assume that gcc always does right and nobody re-evaluate if gcc do proper optimization with their change, for example, it is not optimized properly on v3.10. So adding compiler hint here is reasonable. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	mm/mempolicy: return NULL if node is NUMA_NO_NODE in get_task_policy	Jianguo Wu
	If node == NUMA_NO_NODE, pol is NULL, we should return NULL instead of do "if (!pol->mode)" check. [akpm@linux-foundation.org: reorganise code] Signed-off-by: Jianguo Wu <wujianguo@huawei.com> Cc: Mel Gorman <mgorman@suse.de> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Hanjun Guo <guohanjun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	mm, hugetlb: decrement reserve count if VM_NORESERVE alloc page cache	Joonsoo Kim
	If a vma with VM_NORESERVE allocate a new page for page cache, we should check whether this area is reserved or not. If this address is already reserved by other process(in case of chg == 0), we should decrement reserve count, because this allocated page will go into page cache and currently, there is no way to know that this page comes from reserved pool or not when releasing inode. This may introduce over-counting problem to reserved count. With following example code, you can easily reproduce this situation. Assume 2MB, nr_hugepages = 100 size = 20 * MB; flag = MAP_SHARED; p = mmap(NULL, size, PROT_READ\|PROT_WRITE, flag, fd, 0); if (p == MAP_FAILED) { fprintf(stderr, "mmap() failed: %s\n", strerror(errno)); return -1; } flag = MAP_SHARED \| MAP_NORESERVE; q = mmap(NULL, size, PROT_READ\|PROT_WRITE, flag, fd, 0); if (q == MAP_FAILED) { fprintf(stderr, "mmap() failed: %s\n", strerror(errno)); } q[0] = 'c'; After finish the program, run 'cat /proc/meminfo'. You can see below result. HugePages_Free: 100 HugePages_Rsvd: 1 To fix this, we should check our mapping type and tracked region. If our mapping is VM_NORESERVE, VM_MAYSHARE and chg is 0, this imply that current allocated page will go into page cache which is already reserved region when mapping is created. In this case, we should decrease reserve count. As implementing above, this patch solve the problem. [akpm@linux-foundation.org: fix spelling in comment] Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Hillf Danton <dhillf@gmail.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Cc: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	mm, hugetlb: remove decrement_hugepage_resv_vma()	Joonsoo Kim
	Now, Checking condition of decrement_hugepage_resv_vma() and vma_has_reserves() is same, so we can clean-up this function with vma_has_reserves(). Additionally, decrement_hugepage_resv_vma() has only one call site, so we can remove function and embed it into dequeue_huge_page_vma() directly. This patch implement it. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Hillf Danton <dhillf@gmail.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Cc: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	mm, hugetlb: add VM_NORESERVE check in vma_has_reserves()	Joonsoo Kim
	If we map the region with MAP_NORESERVE and MAP_SHARED, we can skip to check reserve counting and eventually we cannot be ensured to allocate a huge page in fault time. With following example code, you can easily find this situation. Assume 2MB, nr_hugepages = 100 fd = hugetlbfs_unlinked_fd(); if (fd < 0) return 1; size = 200 * MB; flag = MAP_SHARED; p = mmap(NULL, size, PROT_READ\|PROT_WRITE, flag, fd, 0); if (p == MAP_FAILED) { fprintf(stderr, "mmap() failed: %s\n", strerror(errno)); return -1; } size = 2 * MB; flag = MAP_ANONYMOUS \| MAP_SHARED \| MAP_HUGETLB \| MAP_NORESERVE; p = mmap(NULL, size, PROT_READ\|PROT_WRITE, flag, -1, 0); if (p == MAP_FAILED) { fprintf(stderr, "mmap() failed: %s\n", strerror(errno)); } p[0] = '0'; sleep(10); During executing sleep(10), run 'cat /proc/meminfo' on another process. HugePages_Free: 99 HugePages_Rsvd: 100 Number of free should be higher or equal than number of reserve, but this aren't. This represent that non reserved shared mapping steal a reserved page. Non reserved shared mapping should not eat into reserve space. If we consider VM_NORESERVE in vma_has_reserve() and return 0 which mean that we don't have reserved pages, then we check that we have enough free pages in dequeue_huge_page_vma(). This prevent to steal a reserved page. With this change, above test generate a SIGBUG which is correct, because all free pages are reserved and non reserved shared mapping can't get a free page. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Hillf Danton <dhillf@gmail.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Cc: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	mm, hugetlb: do not use a page in page cache for cow optimization	Joonsoo Kim
	Currently, we use a page with mapped count 1 in page cache for cow optimization. If we find this condition, we don't allocate a new page and copy contents. Instead, we map this page directly. This may introduce a problem that writting to private mapping overwrite hugetlb file directly. You can find this situation with following code. size = 20 * MB; flag = MAP_SHARED; p = mmap(NULL, size, PROT_READ\|PROT_WRITE, flag, fd, 0); if (p == MAP_FAILED) { fprintf(stderr, "mmap() failed: %s\n", strerror(errno)); return -1; } p[0] = 's'; fprintf(stdout, "BEFORE STEAL PRIVATE WRITE: %c\n", p[0]); munmap(p, size); flag = MAP_PRIVATE; p = mmap(NULL, size, PROT_READ\|PROT_WRITE, flag, fd, 0); if (p == MAP_FAILED) { fprintf(stderr, "mmap() failed: %s\n", strerror(errno)); } p[0] = 'c'; munmap(p, size); flag = MAP_SHARED; p = mmap(NULL, size, PROT_READ\|PROT_WRITE, flag, fd, 0); if (p == MAP_FAILED) { fprintf(stderr, "mmap() failed: %s\n", strerror(errno)); return -1; } fprintf(stdout, "AFTER STEAL PRIVATE WRITE: %c\n", p[0]); munmap(p, size); We can see that "AFTER STEAL PRIVATE WRITE: c", not "AFTER STEAL PRIVATE WRITE: s". If we turn off this optimization to a page in page cache, the problem is disappeared. So, I change the trigger condition of optimization. If this page is not AnonPage, we don't do optimization. This makes this optimization turning off for a page cache. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Hillf Danton <dhillf@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Cc: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	mm, hugetlb: remove redundant list_empty check in gather_surplus_pages()	Joonsoo Kim
	If list is empty, list_for_each_entry_safe() doesn't do anything. So, this check is redundant. Remove it. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Hillf Danton <dhillf@gmail.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Cc: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	mm, hugetlb: fix and clean-up node iteration code to alloc or free	Joonsoo Kim
	Current node iteration code have a minor problem which do one more node rotation if we can't succeed to allocate. For example, if we start to allocate at node 0, we stop to iterate at node 0. Then we start to allocate at node 1 for next allocation. I introduce new macros "for_each_node_mask_to_[alloc\|free]" and fix and clean-up node iteration code to alloc or free. This makes code more understandable. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Hillf Danton <dhillf@gmail.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Cc: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	mm, hugetlb: clean-up alloc_huge_page()	Joonsoo Kim
	Unify successful allocation paths to make the code more readable. There are no functional changes. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Cc: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	mm, hugetlb: trivial commenting fix	Joonsoo Kim
	The name of the mutex written in comment is wrong. Fix it. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Hillf Danton <dhillf@gmail.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Cc: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	mm, hugetlb: move up the code which check availability of free huge page	Joonsoo Kim
	In this time we are holding a hugetlb_lock, so hstate values can't be changed. If we don't have any usable free huge page in this time, we don't need to proceed with the processing. So move this code up. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Hillf Danton <dhillf@gmail.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Cc: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	mm: revert "page-writeback.c: subtract min_free_kbytes from dirtyable memory"	Johannes Weiner
	This reverts commit 75f7ad8e043d. It was the result of a problem observed with a 3.2 kernel and merged in 3.9, while the issue had been resolved upstream in 3.3 (commit ab8fabd46f81: "mm: exclude reserved pages from dirtyable memory"). The "reserved pages" are a superset of min_free_kbytes, thus this change is redundant and confusing. Revert it. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Paul Szabo <psz@maths.usyd.edu.au> Cc: Rik van Riel <riel@redhat.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	mm: page_alloc: fair zone allocator policy	Johannes Weiner
	Each zone that holds userspace pages of one workload must be aged at a speed proportional to the zone size. Otherwise, the time an individual page gets to stay in memory depends on the zone it happened to be allocated in. Asymmetry in the zone aging creates rather unpredictable aging behavior and results in the wrong pages being reclaimed, activated etc. But exactly this happens right now because of the way the page allocator and kswapd interact. The page allocator uses per-node lists of all zones in the system, ordered by preference, when allocating a new page. When the first iteration does not yield any results, kswapd is woken up and the allocator retries. Due to the way kswapd reclaims zones below the high watermark while a zone can be allocated from when it is above the low watermark, the allocator may keep kswapd running while kswapd reclaim ensures that the page allocator can keep allocating from the first zone in the zonelist for extended periods of time. Meanwhile the other zones rarely see new allocations and thus get aged much slower in comparison. The result is that the occasional page placed in lower zones gets relatively more time in memory, even gets promoted to the active list after its peers have long been evicted. Meanwhile, the bulk of the working set may be thrashing on the preferred zone even though there may be significant amounts of memory available in the lower zones. Even the most basic test -- repeatedly reading a file slightly bigger than memory -- shows how broken the zone aging is. In this scenario, no single page should be able stay in memory long enough to get referenced twice and activated, but activation happens in spades: $ grep active_file /proc/zoneinfo nr_inactive_file 0 nr_active_file 0 nr_inactive_file 0 nr_active_file 8 nr_inactive_file 1582 nr_active_file 11994 $ cat data data data data >/dev/null $ grep active_file /proc/zoneinfo nr_inactive_file 0 nr_active_file 70 nr_inactive_file 258753 nr_active_file 443214 nr_inactive_file 149793 nr_active_file 12021 Fix this with a very simple round robin allocator. Each zone is allowed a batch of allocations that is proportional to the zone's size, after which it is treated as full. The batch counters are reset when all zones have been tried and the allocator enters the slowpath and kicks off kswapd reclaim. Allocation and reclaim is now fairly spread out to all available/allowable zones: $ grep active_file /proc/zoneinfo nr_inactive_file 0 nr_active_file 0 nr_inactive_file 174 nr_active_file 4865 nr_inactive_file 53 nr_active_file 860 $ cat data data data data >/dev/null $ grep active_file /proc/zoneinfo nr_inactive_file 0 nr_active_file 0 nr_inactive_file 666622 nr_active_file 4988 nr_inactive_file 190969 nr_active_file 937 When zone_reclaim_mode is enabled, allocations will now spread out to all zones on the local node, not just the first preferred zone (which on a 4G node might be a tiny Normal zone). Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Paul Bolle <paul.bollee@gmail.com> Cc: Zlatko Calusic <zcalusic@bitsync.net> Tested-by: Kevin Hilman <khilman@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	mm: page_alloc: rearrange watermark checking in get_page_from_freelist	Johannes Weiner
	Allocations that do not have to respect the watermarks are rare high-priority events. Reorder the code such that per-zone dirty limits and future checks important only to regular page allocations are ignored in these extraordinary situations. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Paul Bolle <paul.bollee@gmail.com> Tested-by: Zlatko Calusic <zcalusic@bitsync.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	mm: vmscan: fix numa reclaim balance problem in kswapd	Johannes Weiner
	The way the page allocator interacts with kswapd creates aging imbalances, where the amount of time a userspace page gets in memory under reclaim pressure is dependent on which zone, which node the allocator took the page frame from. #1 fixes missed kswapd wakeups on NUMA systems, which lead to some nodes falling behind for a full reclaim cycle relative to the other nodes in the system #3 fixes an interaction where kswapd and a continuous stream of page allocations keep the preferred zone of a task between the high and low watermark (allocations succeed + kswapd does not go to sleep) indefinitely, completely underutilizing the lower zones and thrashing on the preferred zone These patches are the aging fairness part of the thrash-detection based file LRU balancing. Andrea recommended to submit them separately as they are bugfixes in their own right. The following test ran a foreground workload (memcachetest) with background IO of various sizes on a 4 node 8G system (similar results were observed with single-node 4G systems): parallelio BAS FAIRALLO BASE FAIRALLOC Ops memcachetest-0M 5170.00 ( 0.00%) 5283.00 ( 2.19%) Ops memcachetest-791M 4740.00 ( 0.00%) 5293.00 ( 11.67%) Ops memcachetest-2639M 2551.00 ( 0.00%) 4950.00 ( 94.04%) Ops memcachetest-4487M 2606.00 ( 0.00%) 3922.00 ( 50.50%) Ops io-duration-0M 0.00 ( 0.00%) 0.00 ( 0.00%) Ops io-duration-791M 55.00 ( 0.00%) 18.00 ( 67.27%) Ops io-duration-2639M 235.00 ( 0.00%) 103.00 ( 56.17%) Ops io-duration-4487M 278.00 ( 0.00%) 173.00 ( 37.77%) Ops swaptotal-0M 0.00 ( 0.00%) 0.00 ( 0.00%) Ops swaptotal-791M 245184.00 ( 0.00%) 0.00 ( 0.00%) Ops swaptotal-2639M 468069.00 ( 0.00%) 108778.00 ( 76.76%) Ops swaptotal-4487M 452529.00 ( 0.00%) 76623.00 ( 83.07%) Ops swapin-0M 0.00 ( 0.00%) 0.00 ( 0.00%) Ops swapin-791M 108297.00 ( 0.00%) 0.00 ( 0.00%) Ops swapin-2639M 169537.00 ( 0.00%) 50031.00 ( 70.49%) Ops swapin-4487M 167435.00 ( 0.00%) 34178.00 ( 79.59%) Ops minorfaults-0M 1518666.00 ( 0.00%) 1503993.00 ( 0.97%) Ops minorfaults-791M 1676963.00 ( 0.00%) 1520115.00 ( 9.35%) Ops minorfaults-2639M 1606035.00 ( 0.00%) 1799717.00 (-12.06%) Ops minorfaults-4487M 1612118.00 ( 0.00%) 1583825.00 ( 1.76%) Ops majorfaults-0M 6.00 ( 0.00%) 0.00 ( 0.00%) Ops majorfaults-791M 13836.00 ( 0.00%) 10.00 ( 99.93%) Ops majorfaults-2639M 22307.00 ( 0.00%) 6490.00 ( 70.91%) Ops majorfaults-4487M 21631.00 ( 0.00%) 4380.00 ( 79.75%) BAS FAIRALLO BASE FAIRALLOC User 287.78 460.97 System 2151.67 3142.51 Elapsed 9737.00 8879.34 BAS FAIRALLO BASE FAIRALLOC Minor Faults 53721925 57188551 Major Faults 392195 15157 Swap Ins 2994854 112770 Swap Outs 4907092 134982 Direct pages scanned 0 41824 Kswapd pages scanned 32975063 8128269 Kswapd pages reclaimed 6323069 7093495 Direct pages reclaimed 0 41824 Kswapd efficiency 19% 87% Kswapd velocity 3386.573 915.414 Direct efficiency 100% 100% Direct velocity 0.000 4.710 Percentage direct scans 0% 0% Zone normal velocity 2011.338 550.661 Zone dma32 velocity 1365.623 369.221 Zone dma velocity 9.612 0.242 Page writes by reclaim 18732404.000 614807.000 Page writes file 13825312 479825 Page writes anon 4907092 134982 Page reclaim immediate 85490 5647 Sector Reads 12080532 483244 Sector Writes 88740508 65438876 Page rescued immediate 0 0 Slabs scanned 82560 12160 Direct inode steals 0 0 Kswapd inode steals 24401 40013 Kswapd skipped wait 0 0 THP fault alloc 6 8 THP collapse alloc 5481 5812 THP splits 75 22 THP fault fallback 0 0 THP collapse fail 0 0 Compaction stalls 0 54 Compaction success 0 45 Compaction failures 0 9 Page migrate success 881492 82278 Page migrate failure 0 0 Compaction pages isolated 0 60334 Compaction migrate scanned 0 53505 Compaction free scanned 0 1537605 Compaction cost 914 86 NUMA PTE updates 46738231 41988419 NUMA hint faults 31175564 24213387 NUMA hint local faults 10427393 6411593 NUMA pages migrated 881492 55344 AutoNUMA cost 156221 121361 The overall runtime was reduced, throughput for both the foreground workload as well as the background IO improved, major faults, swapping and reclaim activity shrunk significantly, reclaim efficiency more than quadrupled. This patch: When the page allocator fails to get a page from all zones in its given zonelist, it wakes up the per-node kswapds for all zones that are at their low watermark. However, with a system under load the free pages in a zone can fluctuate enough that the allocation fails but the kswapd wakeup is also skipped while the zone is still really close to the low watermark. When one node misses a wakeup like this, it won't be aged before all the other node's zones are down to their low watermarks again. And skipping a full aging cycle is an obvious fairness problem. Kswapd runs until the high watermarks are restored, so it should also be woken when the high watermarks are not met. This ages nodes more equally and creates a safety margin for the page counter fluctuation. By using zone_balanced(), it will now check, in addition to the watermark, if compaction requires more order-0 pages to create a higher order page. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Paul Bolle <paul.bollee@gmail.com> Tested-by: Zlatko Calusic <zcalusic@bitsync.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	mm/huge_memory.c: fix potential NULL pointer dereference	Libin
	In collapse_huge_page() there is a race window between releasing the mmap_sem read lock and taking the mmap_sem write lock, so find_vma() may return NULL. So check the return value to avoid NULL pointer dereference. collapse_huge_page khugepaged_alloc_page up_read(&mm->mmap_sem) down_write(&mm->mmap_sem) vma = find_vma(mm, address) Signed-off-by: Libin <huawei.libin@huawei.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: <stable@vger.kernel.org> # v3.0+ Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	mm: kill one if loop in __free_pages_bootmem()	Yinghai Lu
	We should not check loop+1 with loop end in loop body. Just duplicate two lines code to avoid it. That will help a bit when we have huge amount of pages on system with 16TiB memory. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	mm/page_alloc.c: fix the value of fallback_migratetype in alloc_extfrag ↵	Srivatsa S. Bhat
	tracepoint() In the current code, the value of fallback_migratetype that is printed using the mm_page_alloc_extfrag tracepoint, is the value of the migratetype after it has been set to the preferred migratetype (if the ownership was changed). Obviously that wouldn't have been the original intent. (We already have a separate 'change_ownership' field to tell whether the ownership of the pageblock was changed from the fallback_migratetype to the preferred type.) The intent of the fallback_migratetype field is to show the migratetype from which we borrowed pages in order to satisfy the allocation request. So fix the code to print that value correctly. Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Minchan Kim <minchan@kernel.org> Cc: Cody P Schafer <cody@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	mm/page_allo.c: restructure free-page stealing code and fix a bug	Srivatsa S. Bhat
	The free-page stealing code in __rmqueue_fallback() is somewhat hard to follow, and has an incredible amount of subtlety hidden inside! First off, there is a minor bug in the reporting of change-of-ownership of pageblocks. Under some conditions, we try to move upto 'pageblock_nr_pages' no. of pages to the preferred allocation list. But we change the ownership of that pageblock to the preferred type only if we manage to successfully move atleast half of that pageblock (or if page_group_by_mobility_disabled is set). However, the current code ignores the latter part and sets the 'migratetype' variable to the preferred type, irrespective of whether we actually changed the pageblock migratetype of that block or not. So, the page_alloc_extfrag tracepoint can end up printing incorrect info (i.e., 'change_ownership' might be shown as 1 when it must have been 0). So fixing this involves moving the update of the 'migratetype' variable to the right place. But looking closer, we observe that the 'migratetype' variable is used subsequently for checks such as "is_migrate_cma()". Obviously the intent there is to check if the fallback type is MIGRATE_CMA, but since we already set the 'migratetype' variable to start_migratetype, we end up checking if the preferred type is MIGRATE_CMA!! To make things more interesting, this actually doesn't cause a bug in practice, because we never change anything if the fallback type is CMA. So, restructure the code in such a way that it is trivial to understand what is going on, and also fix the above mentioned bug. And while at it, also add a comment explaining the subtlety behind the migratetype used in the call to expand(). [akpm@linux-foundation.org: remove unneeded `inline', small coding-style fix] Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Minchan Kim <minchan@kernel.org> Cc: Cody P Schafer <cody@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	mm/page_alloc.c: fix coding style and spelling	Pintu Kumar
	Fix all errors reported by checkpatch and some small spelling mistakes. Signed-off-by: Pintu Kumar <pintu.k@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	swap: make cluster allocation per-cpu	Shaohua Li
	swap cluster allocation is to get better request merge to improve performance. But the cluster is shared globally, if multiple tasks are doing swap, this will cause interleave disk access. While multiple tasks swap is quite common, for example, each numa node has a kswapd thread doing swap and multiple threads/processes doing direct page reclaim. ioscheduler can't help too much here, because tasks don't send swapout IO down to block layer in the meantime. Block layer does merge some IOs, but a lot not, depending on how many tasks are doing swapout concurrently. In practice, I've seen a lot of small size IO in swapout workloads. We makes the cluster allocation per-cpu here. The interleave disk access issue goes away. All tasks swapout to their own cluster, so swapout will become sequential, which can be easily merged to big size IO. If one CPU can't get its per-cpu cluster (for example, there is no free cluster anymore in the swap), it will fallback to scan swap_map. The CPU can still continue swap. We don't need recycle free swap entries of other CPUs. In my test (swap to a 2-disk raid0 partition), this improves around 10% swapout throughput, and request size is increased significantly. How does this impact swap readahead is uncertain though. On one side, page reclaim always isolates and swaps several adjancent pages, this will make page reclaim write the pages sequentially and benefit readahead. On the other side, several CPU write pages interleave means the pages don't live _sequentially_ but relatively _near_. In the per-cpu allocation case, if adjancent pages are written by different cpus, they will live relatively _far_. So how this impacts swap readahead depends on how many pages page reclaim isolates and swaps one time. If the number is big, this patch will benefit swap readahead. Of course, this is about sequential access pattern. The patch has no impact for random access pattern, because the new cluster allocation algorithm is just for SSD. Alternative solution is organizing swap layout to be per-mm instead of this per-cpu approach. In the per-mm layout, we allocate a disk range for each mm, so pages of one mm live in swap disk adjacently. per-mm layout has potential issues of lock contention if multiple reclaimers are swap pages from one mm. For a sequential workload, per-mm layout is better to implement swap readahead, because pages from the mm are adjacent in disk. But per-cpu layout isn't very bad in this workload, as page reclaim always isolates and swaps several pages one time, such pages will still live in disk sequentially and readahead can utilize this. For a random workload, per-mm layout isn't beneficial of request merge, because it's quite possible pages from different mm are swapout in the meantime and IO can't be merged in per-mm layout. while with per-cpu layout we can merge requests from any mm. Considering random workload is more popular in workloads with swap (and per-cpu approach isn't too bad for sequential workload too), I'm choosing per-cpu layout. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Shaohua Li <shli@fusionio.com> Cc: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Kyungmin Park <kmpark@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: Rafael Aquini <aquini@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	swap: fix races exposed by swap discard	Shaohua Li
	The previous patch can expose races, according to Hugh: swapoff was sometimes failing with "Cannot allocate memory", coming from try_to_unuse()'s -ENOMEM: it needs to allow for swap_duplicate() failing on a free entry temporarily SWAP_MAP_BAD while being discarded. We should use ACCESS_ONCE() there, and whenever accessing swap_map locklessly; but rather than peppering it throughout try_to_unuse(), just declare swap_map with volatile. try_to_unuse() is accustomed to swap_map going down racily, but not necessarily to it jumping up from 0 to SWAP_MAP_BAD: we'll be safer to prevent that transition once SWP_WRITEOK is switched off, when it's a waste of time to issue discards anyway (swapon can do a whole discard). Another issue is: In swapin_readahead(), read_swap_cache_async() can read a bad swap entry, because we don't check if readahead swap entry is bad. This doesn't break anything but such swapin page is wasteful and can only be freed at page reclaim. We should avoid read such swap entry. And in discard, we mark swap entry SWAP_MAP_BAD and then switch it to normal when discard is finished. If readahead reads such swap entry, we have the same issue, so we much check if swap entry is bad too. Thanks Hugh to inspire swapin_readahead could use bad swap entry. [include Hugh's patch 'swap: fix swapoff ENOMEMs from discard'] Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Kyungmin Park <kmpark@infradead.org> Cc: Rafael Aquini <aquini@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	swap: make swap discard async	Shaohua Li
	swap can do cluster discard for SSD, which is good, but there are some problems here: 1. swap do the discard just before page reclaim gets a swap entry and writes the disk sectors. This is useless for high end SSD, because an overwrite to a sector implies a discard to original sector too. A discard + overwrite == overwrite. 2. the purpose of doing discard is to improve SSD firmware garbage collection. Idealy we should send discard as early as possible, so firmware can do something smart. Sending discard just after swap entry is freed is considered early compared to sending discard before write. Of course, if workload is already bound to gc speed, sending discard earlier or later doesn't make 3. block discard is a sync API, which will delay scan_swap_map() significantly. 4. Write and discard command can be executed parallel in PCIe SSD. Making swap discard async can make execution more efficiently. This patch makes swap discard async and moves discard to where swap entry is freed. Discard and write have no dependence now, so above issues can be avoided. Idealy we should do discard for any freed sectors, but some SSD discard is very slow. This patch still does discard for a whole cluster. My test does a several round of 'mmap, write, unmap', which will trigger a lot of swap discard. In a fusionio card, with this patch, the test runtime is reduced to 18% of the time without it, so around 5.5x faster. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Shaohua Li <shli@fusionio.com> Cc: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Kyungmin Park <kmpark@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: Rafael Aquini <aquini@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	swap: change block allocation algorithm for SSD	Shaohua Li
	I'm using a fast SSD to do swap. scan_swap_map() sometimes uses up to 20~30% CPU time (when cluster is hard to find, the CPU time can be up to 80%), which becomes a bottleneck. scan_swap_map() scans a byte array to search a 256 page cluster, which is very slow. Here I introduced a simple algorithm to search cluster. Since we only care about 256 pages cluster, we can just use a counter to track if a cluster is free. Every 256 pages use one int to store the counter. If the counter of a cluster is 0, the cluster is free. All free clusters will be added to a list, so searching cluster is very efficient. With this, scap_swap_map() overhead disappears. This might help low end SD card swap too. Because if the cluster is aligned, SD firmware can do flash erase more efficiently. We only enable the algorithm for SSD. Hard disk swap isn't fast enough and has downside with the algorithm which might introduce regression (see below). The patch slightly changes which cluster is choosen. It always adds free cluster to list tail. This can help wear leveling for low end SSD too. And if no cluster found, the scan_swap_map() will do search from the end of last cluster. So if no cluster found, the scan_swap_map() will do search from the end of last free cluster, which is random. For SSD, this isn't a problem at all. Another downside is the cluster must be aligned to 256 pages, which will reduce the chance to find a cluster. I would expect this isn't a big problem for SSD because of the non-seek penality. (And this is the reason I only enable the algorithm for SSD). Signed-off-by: Shaohua Li <shli@fusionio.com> Cc: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Kyungmin Park <kmpark@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: Rafael Aquini <aquini@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	mm/page_alloc.c: use '__paginginit' instead of '__init'	Chen Gang
	set_pageblock_order() may be called when memory hotplug, so need use '__paginginit' instead of '__init'. The related warning: The function __meminit .free_area_init_node() references a function __init .set_pageblock_order(). If .set_pageblock_order is only used by .free_area_init_node then annotate .set_pageblock_order with a matching annotation. Signed-off-by: Chen Gang <gang.chen@asianux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	mm: fix negative left shift count when PAGE_SHIFT > 20	Jerry Zhou
	When PAGE_SHIFT > 20, the result of "20 - PAGE_SHIFT" is negative. The previous calculating here will generate an unexpected result. In addition, if PAGE_SIZE >= 1MB, The memory size of "numentries" was already integral multiple of 1MB. Signed-off-by: Jerry Zhou <uulinux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	mm: replace strict_strtoul() with kstrtoul()	Jingoo Han
	The use of strict_strtoul() is not preferred, because strict_strtoul() is obsolete. Thus, kstrtoul() should be used. Signed-off-by: Jingoo Han <jg1.han@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	mm: vmstats: track TLB flush stats on UP too	Dave Hansen
	The previous patch doing vmstats for TLB flushes ("mm: vmstats: tlb flush counters") effectively missed UP since arch/x86/mm/tlb.c is only compiled for SMP. UP systems do not do remote TLB flushes, so compile those counters out on UP. arch/x86/kernel/cpu/mtrr/generic.c calls __flush_tlb() directly. This is probably an optimization since both the mtrr code and __flush_tlb() write cr4. It would probably be safe to make that a flush_tlb_all() (and then get these statistics), but the mtrr code is ancient and I'm hesitant to touch it other than to just stick in the counters. [akpm@linux-foundation.org: tweak comments] Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	mm: vmstats: tlb flush counters	Dave Hansen
	I was investigating some TLB flush scaling issues and realized that we do not have any good methods for figuring out how many TLB flushes we are doing. It would be nice to be able to do these in generic code, but the arch-independent calls don't explicitly specify whether we actually need to do remote flushes or not. In the end, we really need to know if we actually _did_ global vs. local invalidations, so that leaves us with few options other than to muck with the counters from arch-specific code. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	mm/zswap.c: get swapper address_space by using macro	Sunghan Suh
	There is a proper macro to get the corresponding swapper address space from a swap entry. Instead of directly accessing "swapper_spaces" array, use the "swap_address_space" macro. Signed-off-by: Sunghan Suh <sunghan.suh@samsung.com> Reviewed-by: Bob Liu <bob.liu@oracle.com> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Acked-by: Seth Jennings <sjenning@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	mm: mmap_region: kill correct_wcount/inode, use allow_write_access()	Oleg Nesterov
	correct_wcount and inode in mmap_region() just complicate the code. This boolean was needed previously, when deny_write_access() was called before vma_merge(), now we can simply check VM_DENYWRITE and do allow_write_access() if it is set. allow_write_access() checks file != NULL, so this is safe even if it was possible to use VM_DENYWRITE && !file. Just we need to ensure we use the same file which was deny_write_access()'ed, so the patch also moves "file = vma->vm_file" down after allow_write_access(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Colin Cross <ccross@android.com> Cc: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	mm: do_mmap_pgoff: cleanup the usage of file_inode()	Oleg Nesterov
	Simple cleanup. Move "struct inode *inode" variable into "if (file)" block to simplify the code and avoid the unnecessary check. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Colin Cross <ccross@android.com> Cc: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	mm: shift VM_GROWS* check from mmap_region() to do_mmap_pgoff()	Oleg Nesterov
	mmap() doesn't allow the non-anonymous mappings with VM_GROWS* bit set. In particular this means that mmap_region()->vma_merge(file, vm_flags) must always fail if "vm_flags & VM_GROWS" is set incorrectly. So it does not make sense to check VM_GROWS* after we already allocated the new vma, the only caller, do_mmap_pgoff(), which can pass this flag can do the check itself. And this looks a bit more correct, mmap_region() already unmapped the old mapping at this stage. But if mmap() is going to fail, it should avoid do_munmap() if possible. Note: we check VM_GROWS at the end to ensure that do_mmap_pgoff() won't return EINVAL in the case when it currently returns another error code. Many thanks to Hugh who nacked the buggy v1. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	mm/swapfile.c: convert to pr_foo()	Andrew Morton
	A few 80-col gymnastics were cleaned up as a result. Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	swap: warn when a swap area overflows the maximum size	Raymond Jennings
	It is possible to swapon a swap area that is too big for the pte width to handle. Presently this failure happens silently. Instead, emit a diagnostic to warn the user. Testing results, root prompt commands and kernel log messages: # lvresize /dev/system/swap --size 16G # mkswap /dev/system/swap # swapon /dev/system/swap Jul 7 04:27:22 warfang kernel: Adding 16777212k swap on /dev/mapper/system-swap. Priority:-1 extents:1 across:16777212k # lvresize /dev/system/swap --size 64G # mkswap /dev/system/swap # swapon /dev/system/swap Jul 7 04:27:22 warfang kernel: Truncating oversized swap area, only using 33554432k out of 67108860k Jul 7 04:27:22 warfang kernel: Adding 33554428k swap on /dev/mapper/system-swap. Priority:-1 extents:1 across:33554428k [akpm@linux-foundation.org: fix warning] Signed-off-by: Raymond Jennings <shentino@gmail.com> Acked-by: Valdis Kletnieks <valdis.kletnieks@vt.edu> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	mm/madvise.c: fix coding-style errors	Vladimir Cernov
	This fixes following errors: - ERROR: "(foo)" should be "(foo )" - ERROR: "foo bar" should be "foo bar" Signed-off-by: Vladimir Cernov <gg.kaspersky@gmail.com> Reviewed-by: Pekka Enberg <penberg@kernel.org> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	mm: mempolicy: turn vma_set_policy() into vma_dup_policy()	Oleg Nesterov
	Simple cleanup. Every user of vma_set_policy() does the same work, this looks a bit annoying imho. And the new trivial helper which does mpol_dup() + vma_set_policy() to simplify the callers. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	drivers/block/swim.c: remove unnecessary platform_set_drvdata()	Jingoo Han
	The driver core clears the driver data to NULL after device_release or on probe failure. Thus, it is not needed to manually clear the device driver data to NULL. Signed-off-by: Jingoo Han <jg1.han@samsung.com> Cc: Jean Delvare <khali@linux-fr.org> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11	cciss: set max scatter gather entries to 32 on P600	Mike Miller
	At one time we used to set the maximum number of scatter gather elements on all Smart Array controllers to 32. At some point in time the firmware began to write the "appropriate" value for each controller into the config table. The cciss driver would then read that and set h->maxsgentries. h->maxsgentries = readl(&(h->cfgtable->MaxSGElements); On the P600 that value is 544. Under some workloads a significant performance reduction may result. This patch forces the P600 to use only 32 scatter gather elements. Other controllers are not affected. Signed-off-by: Mike Miller <mike.miller@hp.com> Signed-off-by: Dwight (Bud) Brown <bubrown@redhat.com> Signed-off-by: Tomas Henzl <thenzl@redhat.com> Acked-by: Stephen M. Cameron <steve.cameron@hp.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>