aboutsummaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2011-07-11Merge branch 'master' into for-nextJiri Kosina
Sync with Linus' tree to be able to apply pending patches that are based on newer code already present upstream.
2011-07-09writeback: trace global_dirty_stateWu Fengguang
Add trace event balance_dirty_state for showing the global dirty page counts and thresholds at each global_dirty_limits() invocation. This will cover the callers throttle_vm_writeout(), over_bground_thresh() and each balance_dirty_pages() loop. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-07-09writeback: introduce max-pause and pass-good dirty limitsWu Fengguang
The max-pause limit helps to keep the sleep time inside balance_dirty_pages() within MAX_PAUSE=200ms. The 200ms max sleep means per task rate limit of 8pages/200ms=160KB/s when dirty exceeded, which normally is enough to stop dirtiers from continue pushing the dirty pages high, unless there are a sufficient large number of slow dirtiers (eg. 500 tasks doing 160KB/s will still sum up to 80MB/s, exceeding the write bandwidth of a slow disk and hence accumulating more and more dirty pages). The pass-good limit helps to let go of the good bdi's in the presence of a blocked bdi (ie. NFS server not responding) or slow USB disk which for some reason build up a large number of initial dirty pages that refuse to go away anytime soon. For example, given two bdi's A and B and the initial state bdi_thresh_A = dirty_thresh / 2 bdi_thresh_B = dirty_thresh / 2 bdi_dirty_A = dirty_thresh / 2 bdi_dirty_B = dirty_thresh / 2 Then A get blocked, after a dozen seconds bdi_thresh_A = 0 bdi_thresh_B = dirty_thresh bdi_dirty_A = dirty_thresh / 2 bdi_dirty_B = dirty_thresh / 2 The (bdi_dirty_B < bdi_thresh_B) test is now useless and the dirty pages will be effectively throttled by condition (nr_dirty < dirty_thresh). This has two problems: (1) we lose the protections for light dirtiers (2) balance_dirty_pages() effectively becomes IO-less because the (bdi_nr_reclaimable > bdi_thresh) test won't be true. This is good for IO, but balance_dirty_pages() loses an important way to break out of the loop which leads to more spread out throttle delays. DIRTY_PASSGOOD_AREA can eliminate the above issues. The only problem is, DIRTY_PASSGOOD_AREA needs to be defined as 2 to fully cover the above example while this patch uses the more conservative value 8 so as not to surprise people with too many dirty pages than expected. The max-pause limit won't noticeably impact the speed dirty pages are knocked down when there is a sudden drop of global/bdi dirty thresholds. Because the heavy dirties will be throttled below 160KB/s which is slow enough. It does help to avoid long dirty throttle delays and especially will make light dirtiers more responsive. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-07-09writeback: introduce smoothed global dirty limitWu Fengguang
The start of a heavy weight application (ie. KVM) may instantly knock down determine_dirtyable_memory() if the swap is not enabled or full. global_dirty_limits() and bdi_dirty_limit() will in turn get global/bdi dirty thresholds that are _much_ lower than the global/bdi dirty pages. balance_dirty_pages() will then heavily throttle all dirtiers including the light ones, until the dirty pages drop below the new dirty thresholds. During this _deep_ dirty-exceeded state, the system may appear rather unresponsive to the users. About "deep" dirty-exceeded: task_dirty_limit() assigns 1/8 lower dirty threshold to heavy dirtiers than light ones, and the dirty pages will be throttled around the heavy dirtiers' dirty threshold and reasonably below the light dirtiers' dirty threshold. In this state, only the heavy dirtiers will be throttled and the dirty pages are carefully controlled to not exceed the light dirtiers' dirty threshold. However if the threshold itself suddenly drops below the number of dirty pages, the light dirtiers will get heavily throttled. So introduce global_dirty_limit for tracking the global dirty threshold with policies - follow downwards slowly - follow up in one shot global_dirty_limit can effectively mask out the impact of sudden drop of dirtyable memory. It will be used in the next patch for two new type of dirty limits. Note that the new dirty limits are not going to avoid throttling the light dirtiers, but could limit their sleep time to 200ms. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-07-09writeback: consolidate variable names in balance_dirty_pages()Wu Fengguang
Introduce nr_dirty = NR_FILE_DIRTY + NR_WRITEBACK + NR_UNSTABLE_NFS in order to simplify many tests in the following patches. balance_dirty_pages() will eventually care only about the dirty sums besides nr_writeback. Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-07-09writeback: show bdi write bandwidth in debugfsWu Fengguang
Add a "BdiWriteBandwidth" entry and indent others in /debug/bdi/*/stats. btw, increase digital field width to 10, for keeping the possibly huge BdiWritten number aligned at least for desktop systems. Impact: this could break user space tools if they are dumb enough to depend on the number of white spaces. CC: Theodore Ts'o <tytso@mit.edu> CC: Jan Kara <jack@suse.cz> CC: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-07-09writeback: bdi write bandwidth estimationWu Fengguang
The estimation value will start from 100MB/s and adapt to the real bandwidth in seconds. It tries to update the bandwidth only when disk is fully utilized. Any inactive period of more than one second will be skipped. The estimated bandwidth will be reflecting how fast the device can writeout when _fully utilized_, and won't drop to 0 when it goes idle. The value will remain constant at disk idle time. At busy write time, if not considering fluctuations, it will also remain high unless be knocked down by possible concurrent reads that compete for the disk time and bandwidth with async writes. The estimation is not done purely in the flusher because there is no guarantee for write_cache_pages() to return timely to update bandwidth. The bdi->avg_write_bandwidth smoothing is very effective for filtering out sudden spikes, however may be a little biased in long term. The overheads are low because the bdi bandwidth update only occurs at 200ms intervals. The 200ms update interval is suitable, because it's not possible to get the real bandwidth for the instance at all, due to large fluctuations. The NFS commits can be as large as seconds worth of data. One XFS completion may be as large as half second worth of data if we are going to increase the write chunk to half second worth of data. In ext4, fluctuations with time period of around 5 seconds is observed. And there is another pattern of irregular periods of up to 20 seconds on SSD tests. That's why we are not only doing the estimation at 200ms intervals, but also averaging them over a period of 3 seconds and then go further to do another level of smoothing in avg_write_bandwidth. CC: Li Shaohua <shaohua.li@intel.com> CC: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-07-09writeback: account per-bdi accumulated written pagesJan Kara
Introduce the BDI_WRITTEN counter. It will be used for estimating the bdi's write bandwidth. Peter Zijlstra <a.p.zijlstra@chello.nl>: Move BDI_WRITTEN accounting into __bdi_writeout_inc(). This will cover and fix fuse, which only calls bdi_writeout_inc(). CC: Michael Rubin <mrubin@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-07-09writeback: make writeback_control.nr_to_write straightWu Fengguang
Pass struct wb_writeback_work all the way down to writeback_sb_inodes(), and initialize the struct writeback_control there. struct writeback_control is basically designed to control writeback of a single file, but we keep abuse it for writing multiple files in writeback_sb_inodes() and its callers. It immediately clean things up, e.g. suddenly wbc.nr_to_write vs work->nr_pages starts to make sense, and instead of saving and restoring pages_skipped in writeback_sb_inodes it can always start with a clean zero value. It also makes a neat IO pattern change: large dirty files are now written in the full 4MB writeback chunk size, rather than whatever remained quota in wbc->nr_to_write. Acked-by: Jan Kara <jack@suse.cz> Proposed-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-07-08mm/nommu.c: fix remap_pfn_range()Bob Liu
remap_pfn_range() means map physical address pfn<<PAGE_SHIFT to user addr. For nommu arch it's implemented by vma->vm_start = pfn << PAGE_SHIFT which is wrong acroding the original meaning of this function. And some driver developer using remap_pfn_range() with correct parameter will get unexpected result because vm_start is changed. It should be implementd like addr = pfn << PAGE_SHIFT but which is meanless on nommu arch, this patch just make it simply return. Parameter name and setting of vma->vm_flags also be fixed. Signed-off-by: Bob Liu <lliubbo@gmail.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: David Howells <dhowells@redhat.com> Acked-by: Greg Ungerer <gerg@uclinux.org> Cc: Mike Frysinger <vapier@gentoo.org> Cc: Bob Liu <lliubbo@gmail.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-08memcg: fix numa scan information update to be triggered by memory eventKAMEZAWA Hiroyuki
commit 889976dbcb12 ("memcg: reclaim memory from nodes in round-robin order") adds an numa node round-robin for memcg. But the information is updated once per 10sec. This patch changes the update trigger from jiffies to memcg's event count. After this patch, numa scan information will be updated when we see 1024 events of pagein/pageout under a memcg. [akpm@linux-foundation.org: attempt to repair code layout] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Ying Han <yinghan@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-08memcg: fix reclaimable lru check in memcgKAMEZAWA Hiroyuki
Now, in mem_cgroup_hierarchical_reclaim(), mem_cgroup_local_usage() is used for checking whether the memcg contains reclaimable pages or not. If no pages in it, the routine skips it. But, mem_cgroup_local_usage() contains Unevictable pages and cannot handle "noswap" condition correctly. This doesn't work on a swapless system. This patch adds test_mem_cgroup_reclaimable() and replaces mem_cgroup_local_usage(). test_mem_cgroup_reclaimable() see LRU counter and returns correct answer to the caller. And this new function has "noswap" argument and can see only FILE LRU if necessary. [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: fix kerneldoc layout] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Ying Han <yinghan@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-08mm: __tlb_remove_page() check the correct batchShaohua Li
__tlb_remove_page() switches to a new batch page, but still checks space in the old batch. This check always fails, and causes a forced tlb flush. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-08mm: vmscan: only read new_classzone_idx from pgdat when reclaiming successfullyMel Gorman
During allocator-intensive workloads, kswapd will be woken frequently causing free memory to oscillate between the high and min watermark. This is expected behaviour. Unfortunately, if the highest zone is small, a problem occurs. When balance_pgdat() returns, it may be at a lower classzone_idx than it started because the highest zone was unreclaimable. Before checking if it should go to sleep though, it checks pgdat->classzone_idx which when there is no other activity will be MAX_NR_ZONES-1. It interprets this as it has been woken up while reclaiming, skips scheduling and reclaims again. As there is no useful reclaim work to do, it enters into a loop of shrinking slab consuming loads of CPU until the highest zone becomes reclaimable for a long period of time. There are two problems here. 1) If the returned classzone or order is lower, it'll continue reclaiming without scheduling. 2) if the highest zone was marked unreclaimable but balance_pgdat() returns immediately at DEF_PRIORITY, the new lower classzone is not communicated back to kswapd() for sleeping. This patch does two things that are related. If the end_zone is unreclaimable, this information is communicated back. Second, if the classzone or order was reduced due to failing to reclaim, new information is not read from pgdat and instead an attempt is made to go to sleep. Due to this, it is also necessary that pgdat->classzone_idx be initialised each time to pgdat->nr_zones - 1 to avoid re-reads being interpreted as wakeups. Signed-off-by: Mel Gorman <mgorman@suse.de> Reported-by: Pádraig Brady <P@draigBrady.com> Tested-by: Pádraig Brady <P@draigBrady.com> Tested-by: Andrew Lutomirski <luto@mit.edu> Acked-by: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-08mm: vmscan: evaluate the watermarks against the correct classzoneMel Gorman
When deciding if kswapd is sleeping prematurely, the classzone is taken into account but this is different to what balance_pgdat() and the allocator are doing. Specifically, the DMA zone will be checked based on the classzone used when waking kswapd which could be for a GFP_KERNEL or GFP_HIGHMEM request. The lowmem reserve limit kicks in, the watermark is not met and kswapd thinks it's sleeping prematurely keeping kswapd awake in error. Signed-off-by: Mel Gorman <mgorman@suse.de> Reported-by: Pádraig Brady <P@draigBrady.com> Tested-by: Pádraig Brady <P@draigBrady.com> Tested-by: Andrew Lutomirski <luto@mit.edu> Acked-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-08mm: vmscan: do not apply pressure to slab if we are not applying pressure to ↵Mel Gorman
zone During allocator-intensive workloads, kswapd will be woken frequently causing free memory to oscillate between the high and min watermark. This is expected behaviour. When kswapd applies pressure to zones during node balancing, it checks if the zone is above a high+balance_gap threshold. If it is, it does not apply pressure but it unconditionally shrinks slab on a global basis which is excessive. In the event kswapd is being kept awake due to a high small unreclaimable zone, it skips zone shrinking but still calls shrink_slab(). Once pressure has been applied, the check for zone being unreclaimable is being made before the check is made if all_unreclaimable should be set. This miss of unreclaimable can cause has_under_min_watermark_zone to be set due to an unreclaimable zone preventing kswapd backing off on congestion_wait(). Signed-off-by: Mel Gorman <mgorman@suse.de> Reported-by: Pádraig Brady <P@draigBrady.com> Tested-by: Pádraig Brady <P@draigBrady.com> Tested-by: Andrew Lutomirski <luto@mit.edu> Acked-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-08mm: vmscan: correct check for kswapd sleeping in sleeping_prematurelyMel Gorman
During allocator-intensive workloads, kswapd will be woken frequently causing free memory to oscillate between the high and min watermark. This is expected behaviour. Unfortunately, if the highest zone is small, a problem occurs. This seems to happen most with recent sandybridge laptops but it's probably a co-incidence as some of these laptops just happen to have a small Normal zone. The reproduction case is almost always during copying large files that kswapd pegs at 100% CPU until the file is deleted or cache is dropped. The problem is mostly down to sleeping_prematurely() keeping kswapd awake when the highest zone is small and unreclaimable and compounded by the fact we shrink slabs even when not shrinking zones causing a lot of time to be spent in shrinkers and a lot of memory to be reclaimed. Patch 1 corrects sleeping_prematurely to check the zones matching the classzone_idx instead of all zones. Patch 2 avoids shrinking slab when we are not shrinking a zone. Patch 3 notes that sleeping_prematurely is checking lower zones against a high classzone which is not what allocators or balance_pgdat() is doing leading to an artifical belief that kswapd should be still awake. Patch 4 notes that when balance_pgdat() gives up on a high zone that the decision is not communicated to sleeping_prematurely() This problem affects 2.6.38.8 for certain and is expected to affect 2.6.39 and 3.0-rc4 as well. If accepted, they need to go to -stable to be picked up by distros and this series is against 3.0-rc4. I've cc'd people that reported similar problems recently to see if they still suffer from the problem and if this fixes it. This patch: correct the check for kswapd sleeping in sleeping_prematurely() During allocator-intensive workloads, kswapd will be woken frequently causing free memory to oscillate between the high and min watermark. This is expected behaviour. A problem occurs if the highest zone is small. balance_pgdat() only considers unreclaimable zones when priority is DEF_PRIORITY but sleeping_prematurely considers all zones. It's possible for this sequence to occur 1. kswapd wakes up and enters balance_pgdat() 2. At DEF_PRIORITY, marks highest zone unreclaimable 3. At DEF_PRIORITY-1, ignores highest zone setting end_zone 4. At DEF_PRIORITY-1, calls shrink_slab freeing memory from highest zone, clearing all_unreclaimable. Highest zone is still unbalanced 5. kswapd returns and calls sleeping_prematurely 6. sleeping_prematurely looks at *all* zones, not just the ones being considered by balance_pgdat. The highest small zone has all_unreclaimable cleared but the zone is not balanced. all_zones_ok is false so kswapd stays awake This patch corrects the behaviour of sleeping_prematurely to check the zones balance_pgdat() checked. Signed-off-by: Mel Gorman <mgorman@suse.de> Reported-by: Pádraig Brady <P@draigBrady.com> Tested-by: Pádraig Brady <P@draigBrady.com> Tested-by: Andrew Lutomirski <luto@mit.edu> Acked-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-07SLUB: Fix missing <linux/stacktrace.h> includePekka Enberg
This fixes the following build breakage commit d6543e3 ("slub: Enable backtrace for create/delete points"): CC mm/slub.o mm/slub.c: In function ‘set_track’: mm/slub.c:428: error: storage size of ‘trace’ isn’t known mm/slub.c:435: error: implicit declaration of function ‘save_stack_trace’ mm/slub.c:428: warning: unused variable ‘trace’ make[1]: *** [mm/slub.o] Error 1 make: *** [mm/slub.o] Error 2 Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-07-07slub: reduce overhead of slub_debugMarcin Slusarz
slub checks for poison one byte by one, which is highly inefficient and shows up frequently as a highest cpu-eater in perf top. Joining reads gives nice speedup: (Compiling some project with different options) make -j12 make clean slub_debug disabled: 1m 27s 1.2 s slub_debug enabled: 1m 46s 7.6 s slub_debug enabled + this patch: 1m 33s 3.2 s check_bytes still shows up high, but not always at the top. Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Matt Mackall <mpm@selenic.com> Cc: linux-mm@kvack.org Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-07-07slub: Add method to verify memory is not freedBen Greear
This is for tracking down suspect memory usage. Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Ben Greear <greearb@candelatech.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-07-07slub: Enable backtrace for create/delete pointsBen Greear
This patch attempts to grab a backtrace for the creation and deletion points of the slub object. When a fault is detected, we can then get a better idea of where the item was deleted. Example output from debugging some funky nfs/rpc behaviour: ============================================================================= BUG kmalloc-64: Object is on free-list ----------------------------------------------------------------------------- INFO: Allocated in rpcb_getport_async+0x39c/0x5a5 [sunrpc] age=381 cpu=3 pid=3750 __slab_alloc+0x348/0x3ba kmem_cache_alloc_trace+0x67/0xe7 rpcb_getport_async+0x39c/0x5a5 [sunrpc] call_bind+0x70/0x75 [sunrpc] __rpc_execute+0x78/0x24b [sunrpc] rpc_execute+0x3d/0x42 [sunrpc] rpc_run_task+0x79/0x81 [sunrpc] rpc_call_sync+0x3f/0x60 [sunrpc] rpc_ping+0x42/0x58 [sunrpc] rpc_create+0x4aa/0x527 [sunrpc] nfs_create_rpc_client+0xb1/0xf6 [nfs] nfs_init_client+0x3b/0x7d [nfs] nfs_get_client+0x453/0x5ab [nfs] nfs_create_server+0x10b/0x437 [nfs] nfs_fs_mount+0x4ca/0x708 [nfs] mount_fs+0x6b/0x152 INFO: Freed in rpcb_map_release+0x3f/0x44 [sunrpc] age=30 cpu=2 pid=29049 __slab_free+0x57/0x150 kfree+0x107/0x13a rpcb_map_release+0x3f/0x44 [sunrpc] rpc_release_calldata+0x12/0x14 [sunrpc] rpc_free_task+0x59/0x61 [sunrpc] rpc_final_put_task+0x82/0x8a [sunrpc] __rpc_execute+0x23c/0x24b [sunrpc] rpc_async_schedule+0x10/0x12 [sunrpc] process_one_work+0x230/0x41d worker_thread+0x133/0x217 kthread+0x7d/0x85 kernel_thread_helper+0x4/0x10 INFO: Slab 0xffffea00029aa470 objects=20 used=9 fp=0xffff8800be7830d8 flags=0x20000000004081 INFO: Object 0xffff8800be7830d8 @offset=4312 fp=0xffff8800be7827a8 Bytes b4 0xffff8800be7830c8: 87 a8 96 00 01 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a .�......ZZZZZZZZ Object 0xffff8800be7830d8: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk Object 0xffff8800be7830e8: 6b 6b 6b 6b 01 08 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkk..kkkkkkkkkk Object 0xffff8800be7830f8: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk Object 0xffff8800be783108: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b a5 kkkkkkkkkkkkkkk� Redzone 0xffff8800be783118: bb bb bb bb bb bb bb bb ������������� Padding 0xffff8800be783258: 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ Pid: 29049, comm: kworker/2:2 Not tainted 3.0.0-rc4+ #8 Call Trace: [<ffffffff811055c3>] print_trailer+0x131/0x13a [<ffffffff81105601>] object_err+0x35/0x3e [<ffffffff8110746f>] verify_mem_not_deleted+0x7a/0xb7 [<ffffffffa02851b5>] rpcb_getport_done+0x23/0x126 [sunrpc] [<ffffffffa027d0ba>] rpc_exit_task+0x3f/0x6d [sunrpc] [<ffffffffa027d4ab>] __rpc_execute+0x78/0x24b [sunrpc] [<ffffffffa027d6c0>] ? rpc_execute+0x42/0x42 [sunrpc] [<ffffffffa027d6d0>] rpc_async_schedule+0x10/0x12 [sunrpc] [<ffffffff810611b7>] process_one_work+0x230/0x41d [<ffffffff81061102>] ? process_one_work+0x17b/0x41d [<ffffffff81063613>] worker_thread+0x133/0x217 [<ffffffff810634e0>] ? manage_workers+0x191/0x191 [<ffffffff81066e10>] kthread+0x7d/0x85 [<ffffffff81485924>] kernel_thread_helper+0x4/0x10 [<ffffffff8147eb18>] ? retint_restore_args+0x13/0x13 [<ffffffff81066d93>] ? __init_kthread_worker+0x56/0x56 [<ffffffff81485920>] ? gs_change+0x13/0x13 Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Ben Greear <greearb@candelatech.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-07-02slub: Not necessary to check for empty slab on load_freelistChristoph Lameter
load_freelist is now only branched to only if there are objects available. So no need to check the object variable for NULL. Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-07-02slub: fast release on full slabChristoph Lameter
Make deactivation occur implicitly while checking out the current freelist. This avoids one cmpxchg operation on a slab that is now fully in use. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-07-02slub: Add statistics for the case that the current slab does not match the nodeChristoph Lameter
Slub reloads the per cpu slab if the page does not satisfy the NUMA condition. Track those reloads since doing so has a performance impact. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-07-02slub: Get rid of the another_slab labelChristoph Lameter
We can avoid deactivate slab in special cases if we do the deactivation of slabs in each code flow that leads to new_slab. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-07-02slub: Avoid disabling interrupts in free slowpathChristoph Lameter
Disabling interrupts can be avoided now. However, list operation still require disabling interrupts since allocations can occur from interrupt contexts and there is no way to perform atomic list operations. The acquition of the list_lock therefore has to disable interrupts as well. Dropping interrupt handling significantly simplifies the slowpath. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-07-02slub: Disable interrupts in free_debug processingChristoph Lameter
We will be calling free_debug_processing with interrupts disabled in some case when the later patches are applied. Some of the functions called by free_debug_processing expect interrupts to be off. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-07-02slub: Invert locking and avoid slab lockChristoph Lameter
Locking slabs is no longer necesary if the arch supports cmpxchg operations and if no debuggin features are used on a slab. If the arch does not support cmpxchg then we fallback to use the slab lock to do a cmpxchg like operation. The patch also changes the lock order. Slab locks are subsumed to the node lock now. With that approach slab_trylocking is no longer necessary. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-07-02slub: Rework allocator fastpathsChristoph Lameter
Rework the allocation paths so that updates of the page freelist, frozen state and number of objects use cmpxchg_double_slab(). Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-07-02slub: Pass kmem_cache struct to lock and freeze slabChristoph Lameter
We need more information about the slab for the cmpxchg implementation. Signed-off-by: Christoph Lameter <cl@linux.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-07-02slub: explicit list_lock takingChristoph Lameter
The allocator fastpath rework does change the usage of the list_lock. Remove the list_lock processing from the functions that hide them from the critical sections and move them into those critical sections. This in turn simplifies the support functions (no __ variant needed anymore) and simplifies the lock handling on bootstrap. Inline add_partial since it becomes pretty simple. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-07-02slub: Add cmpxchg_double_slab()Christoph Lameter
Add a function that operates on the second doubleword in the page struct and manipulates the object counters, the freelist and the frozen attribute. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-07-02slub: Move page->frozen handling near where the page->freelist handling occursChristoph Lameter
This is necessary because the frozen bit has to be handled in the same cmpxchg_double with the freelist and the counters. Signed-off-by: Christoph Lameter <cl@linux.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-07-02slub: Do not use frozen page flag but a bit in the page countersChristoph Lameter
Do not use a page flag for the frozen bit. It needs to be part of the state that is handled with cmpxchg_double(). So use a bit in the counter struct in the page struct for that purpose. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-07-02slub: Push irq disable into allocate_slab()Christoph Lameter
Do the irq handling in allocate_slab() instead of __slab_alloc(). __slab_alloc() is already cluttered and allocate_slab() is already fiddling around with gfp flags. v6->v7: Only increment ORDER_FALLBACK if we get a page during fallback Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-06-27memcg: fix direct softlimit reclaim to be called in limit pathKAMEZAWA Hiroyuki
Commit d149e3b25d7c ("memcg: add the soft_limit reclaim in global direct reclaim") adds a softlimit hook to shrink_zones(). By this, soft limit is called as try_to_free_pages() do_try_to_free_pages() shrink_zones() mem_cgroup_soft_limit_reclaim() Then, direct reclaim is memcg softlimit hint aware, now. But, the memory cgroup's "limit" path can call softlimit shrinker. try_to_free_mem_cgroup_pages() do_try_to_free_pages() shrink_zones() mem_cgroup_soft_limit_reclaim() This will cause a global reclaim when a memcg hits limit. This is bug. soft_limit_reclaim() should be called when scanning_global_lru(sc) == true. And the commit adds a variable "total_scanned" for counting softlimit scanned pages....it's not "total". This patch removes the variable and update sc->nr_scanned instead of it. This will affect shrink_slab()'s scan condition but, global LRU is scanned by softlimit and I think this change makes sense. TODO: avoid too much scanning of a zone when softlimit did enough work. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Ying Han <yinghan@google.com> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-06-27mm: fix assertion mapping->nrpages == 0 in end_writeback()Jan Kara
Under heavy memory and filesystem load, users observe the assertion mapping->nrpages == 0 in end_writeback() trigger. This can be caused by page reclaim reclaiming the last page from a mapping in the following race: CPU0 CPU1 ... shrink_page_list() __remove_mapping() __delete_from_page_cache() radix_tree_delete() evict_inode() truncate_inode_pages() truncate_inode_pages_range() pagevec_lookup() - finds nothing end_writeback() mapping->nrpages != 0 -> BUG page->mapping = NULL mapping->nrpages-- Fix the problem by doing a reliable check of mapping->nrpages under mapping->tree_lock in end_writeback(). Analyzed by Jay <jinshan.xiong@whamcloud.com>, lost in LKML, and dug out by Miklos Szeredi <mszeredi@suse.de>. Cc: Jay <jinshan.xiong@whamcloud.com> Cc: Miklos Szeredi <mszeredi@suse.de> Signed-off-by: Jan Kara <jack@suse.cz> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-06-27mm/memory-failure.c: fix spinlock vs mutex orderPeter Zijlstra
We cannot take a mutex while holding a spinlock, so flip the order and fix the locking documentation. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-06-27tmpfs: add shmem_read_mapping_page_gfpHugh Dickins
Although it is used (by i915) on nothing but tmpfs, read_cache_page_gfp() is unsuited to tmpfs, because it inserts a page into pagecache before calling the filesystem's ->readpage: tmpfs may have pages in swapcache which only it knows how to locate and switch to filecache. At present tmpfs provides a ->readpage method, and copes with this by copying pages; but soon we can simplify it by removing its ->readpage. Provide shmem_read_mapping_page_gfp() now, ready for that transition, Export shmem_read_mapping_page_gfp() and add it to list in shmem_fs.h, with shmem_read_mapping_page() inline for the common mapping_gfp case. (shmem_read_mapping_page_gfp or shmem_read_cache_page_gfp? Generally the read_mapping_page functions use the mapping's ->readpage, and the read_cache_page functions use the supplied filler, so I think read_cache_page_gfp was slightly misnamed.) Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-06-27tmpfs: take control of its truncate_rangeHugh Dickins
2.6.35's new truncate convention gave tmpfs the opportunity to control its file truncation, no longer enforced from outside by vmtruncate(). We shall want to build upon that, to handle pagecache and swap together. Slightly redefine the ->truncate_range interface: let it now be called between the unmap_mapping_range()s, with the filesystem responsible for doing the truncate_inode_pages_range() from it - just as the filesystem is nowadays responsible for doing that from its ->setattr. Let's rename shmem_notify_change() to shmem_setattr(). Instead of calling the generic truncate_setsize(), bring that code in so we can call shmem_truncate_range() - which will later be updated to perform its own variant of truncate_inode_pages_range(). Remove the punch_hole unmap_mapping_range() from shmem_truncate_range(): now that the COW's unmap_mapping_range() comes after ->truncate_range, there is no need to call it a third time. Export shmem_truncate_range() and add it to the list in shmem_fs.h, so that i915_gem_object_truncate() can call it explicitly in future; get this patch in first, then update drm/i915 once this is available (until then, i915 will just be doing the truncate_inode_pages() twice). Though introduced five years ago, no other filesystem is implementing ->truncate_range, and its only other user is madvise(,,MADV_REMOVE): we expect to convert it to fallocate(,FALLOC_FL_PUNCH_HOLE,,) shortly, whereupon ->truncate_range can be removed from inode_operations - shmem_truncate_range() will help i915 across that transition too. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-06-27mm: move shmem prototypes to shmem_fs.hHugh Dickins
Before adding any more global entry points into shmem.c, gather such prototypes into shmem_fs.h. Remove mm's own declarations from swap.h, but for now leave the ones in mm.h: because shmem_file_setup() and shmem_zero_setup() are called from various places, and we should not force other subsystems to update immediately. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-06-27mm: move vmtruncate_range to truncate.cHugh Dickins
You would expect to find vmtruncate_range() next to vmtruncate() in mm/truncate.c: move it there. Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-06-22mm, hotplug: protect zonelist building with zonelists_mutexDavid Rientjes
Commit 959ecc48fc75 ("mm/memory_hotplug.c: fix building of node hotplug zonelist") does not protect the build_all_zonelists() call with zonelists_mutex as needed. This can lead to races in constructing zonelist ordering if a concurrent build is underway. Protecting this with lock_memory_hotplug() is insufficient since zonelists can be rebuild though sysfs as well. Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-06-22mm, hotplug: fix error handling in mem_online_node()David Rientjes
The error handling in mem_online_node() is incorrect: hotadd_new_pgdat() returns NULL if the new pgdat could not have been allocated and a pointer to it otherwise. mem_online_node() should fail if hotadd_new_pgdat() fails, not the inverse. This fixes an issue when memoryless nodes are not onlined and their sysfs interface is not registered when their first cpu is brought up. The bug was introduced by commit cf23422b9d76 ("cpu/mem hotplug: enable CPUs online before local memory online") iow v2.6.35. Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-06-22ptrace: kill trivial tracehooksTejun Heo
At this point, tracehooks aren't useful to mainline kernel and mostly just add an extra layer of obfuscation. Although they have comments, without actual in-kernel users, it is difficult to tell what are their assumptions and they're actually trying to achieve. To mainline kernel, they just aren't worth keeping around. This patch kills the following trivial tracehooks. * Ones testing whether task is ptraced. Replace with ->ptrace test. tracehook_expect_breakpoints() tracehook_consider_ignored_signal() tracehook_consider_fatal_signal() * ptrace_event() wrappers. Call directly. tracehook_report_exec() tracehook_report_exit() tracehook_report_vfork_done() * ptrace_release_task() wrapper. Call directly. tracehook_finish_release_task() * noop tracehook_prepare_release_task() tracehook_report_death() This doesn't introduce any behavior change. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
2011-06-22ptrace: kill task_ptrace()Tejun Heo
task_ptrace(task) simply dereferences task->ptrace and isn't even used consistently only adding confusion. Kill it and directly access ->ptrace instead. This doesn't introduce any behavior change. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
2011-06-20writeback: skip tmpfs early in balance_dirty_pages_ratelimited_nr()Wu Fengguang
This helps prevent tmpfs dirtiers from skewing the per-cpu bdp_ratelimits. Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-06-17mm: avoid anon_vma_chain allocation under anon_vma lockLinus Torvalds
Hugh Dickins points out that lockdep (correctly) spots a potential deadlock on the anon_vma lock, because we now do a GFP_KERNEL allocation of anon_vma_chain while doing anon_vma_clone(). The problem is that page reclaim will want to take the anon_vma lock of any anonymous pages that it will try to reclaim. So re-organize the code in anon_vma_clone() slightly: first do just a GFP_NOWAIT allocation, which will usually work fine. But if that fails, let's just drop the lock and re-do the allocation, now with GFP_KERNEL. End result: not only do we avoid the locking problem, this also ends up getting better concurrency in case the allocation does need to block. Tim Chen reports that with all these anon_vma locking tweaks, we're now almost back up to the spinlock performance. Reported-and-tested-by: Hugh Dickins <hughd@google.com> Tested-by: Tim Chen <tim.c.chen@linux.intel.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Andi Kleen <ak@linux.intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-06-17mm: avoid repeated anon_vma lock/unlock sequences in unlink_anon_vmas()Peter Zijlstra
This matches the anon_vma_clone() case, and uses the same lock helper functions. Because of the need to potentially release the anon_vma's, it's a bit more complex, though. We traverse the 'vma->anon_vma_chain' in two phases: the first loop gets the anon_vma lock (with the helper function that only takes the lock once for the whole loop), and removes any entries that don't need any more processing. The second phase just traverses the remaining list entries (without holding the anon_vma lock), and does any actual freeing of the anon_vma's that is required. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Tested-by: Hugh Dickins <hughd@google.com> Tested-by: Tim Chen <tim.c.chen@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-06-17mm: avoid repeated anon_vma lock/unlock sequences in anon_vma_clone()Linus Torvalds
In anon_vma_clone() we traverse the vma->anon_vma_chain of the source vma, locking the anon_vma for each entry. But they are all going to have the same root entry, which means that we're locking and unlocking the same lock over and over again. Which is expensive in locked operations, but can get _really_ expensive when that root entry sees any kind of lock contention. In fact, Tim Chen reports a big performance regression due to this: when we switched to use a mutex instead of a spinlock, the contention case gets much worse. So to alleviate this all, this commit creates a small helper function (lock_anon_vma_root()) that can be used to take the lock just once rather than taking and releasing it over and over again. We still have the same "take the lock and release" it behavior in the exit path (in unlink_anon_vmas()), but that one is a bit harder to fix since we're actually freeing the anon_vma entries as we go, and that will touch the lock too. Reported-and-tested-by: Tim Chen <tim.c.chen@linux.intel.com> Tested-by: Hugh Dickins <hughd@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Andi Kleen <ak@linux.intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>