aboutsummaryrefslogtreecommitdiff
path: root/mm/page-writeback.c
AgeCommit message (Collapse)Author
2011-07-09writeback: trace global_dirty_stateWu Fengguang
Add trace event balance_dirty_state for showing the global dirty page counts and thresholds at each global_dirty_limits() invocation. This will cover the callers throttle_vm_writeout(), over_bground_thresh() and each balance_dirty_pages() loop. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-07-09writeback: introduce max-pause and pass-good dirty limitsWu Fengguang
The max-pause limit helps to keep the sleep time inside balance_dirty_pages() within MAX_PAUSE=200ms. The 200ms max sleep means per task rate limit of 8pages/200ms=160KB/s when dirty exceeded, which normally is enough to stop dirtiers from continue pushing the dirty pages high, unless there are a sufficient large number of slow dirtiers (eg. 500 tasks doing 160KB/s will still sum up to 80MB/s, exceeding the write bandwidth of a slow disk and hence accumulating more and more dirty pages). The pass-good limit helps to let go of the good bdi's in the presence of a blocked bdi (ie. NFS server not responding) or slow USB disk which for some reason build up a large number of initial dirty pages that refuse to go away anytime soon. For example, given two bdi's A and B and the initial state bdi_thresh_A = dirty_thresh / 2 bdi_thresh_B = dirty_thresh / 2 bdi_dirty_A = dirty_thresh / 2 bdi_dirty_B = dirty_thresh / 2 Then A get blocked, after a dozen seconds bdi_thresh_A = 0 bdi_thresh_B = dirty_thresh bdi_dirty_A = dirty_thresh / 2 bdi_dirty_B = dirty_thresh / 2 The (bdi_dirty_B < bdi_thresh_B) test is now useless and the dirty pages will be effectively throttled by condition (nr_dirty < dirty_thresh). This has two problems: (1) we lose the protections for light dirtiers (2) balance_dirty_pages() effectively becomes IO-less because the (bdi_nr_reclaimable > bdi_thresh) test won't be true. This is good for IO, but balance_dirty_pages() loses an important way to break out of the loop which leads to more spread out throttle delays. DIRTY_PASSGOOD_AREA can eliminate the above issues. The only problem is, DIRTY_PASSGOOD_AREA needs to be defined as 2 to fully cover the above example while this patch uses the more conservative value 8 so as not to surprise people with too many dirty pages than expected. The max-pause limit won't noticeably impact the speed dirty pages are knocked down when there is a sudden drop of global/bdi dirty thresholds. Because the heavy dirties will be throttled below 160KB/s which is slow enough. It does help to avoid long dirty throttle delays and especially will make light dirtiers more responsive. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-07-09writeback: introduce smoothed global dirty limitWu Fengguang
The start of a heavy weight application (ie. KVM) may instantly knock down determine_dirtyable_memory() if the swap is not enabled or full. global_dirty_limits() and bdi_dirty_limit() will in turn get global/bdi dirty thresholds that are _much_ lower than the global/bdi dirty pages. balance_dirty_pages() will then heavily throttle all dirtiers including the light ones, until the dirty pages drop below the new dirty thresholds. During this _deep_ dirty-exceeded state, the system may appear rather unresponsive to the users. About "deep" dirty-exceeded: task_dirty_limit() assigns 1/8 lower dirty threshold to heavy dirtiers than light ones, and the dirty pages will be throttled around the heavy dirtiers' dirty threshold and reasonably below the light dirtiers' dirty threshold. In this state, only the heavy dirtiers will be throttled and the dirty pages are carefully controlled to not exceed the light dirtiers' dirty threshold. However if the threshold itself suddenly drops below the number of dirty pages, the light dirtiers will get heavily throttled. So introduce global_dirty_limit for tracking the global dirty threshold with policies - follow downwards slowly - follow up in one shot global_dirty_limit can effectively mask out the impact of sudden drop of dirtyable memory. It will be used in the next patch for two new type of dirty limits. Note that the new dirty limits are not going to avoid throttling the light dirtiers, but could limit their sleep time to 200ms. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-07-09writeback: consolidate variable names in balance_dirty_pages()Wu Fengguang
Introduce nr_dirty = NR_FILE_DIRTY + NR_WRITEBACK + NR_UNSTABLE_NFS in order to simplify many tests in the following patches. balance_dirty_pages() will eventually care only about the dirty sums besides nr_writeback. Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-07-09writeback: bdi write bandwidth estimationWu Fengguang
The estimation value will start from 100MB/s and adapt to the real bandwidth in seconds. It tries to update the bandwidth only when disk is fully utilized. Any inactive period of more than one second will be skipped. The estimated bandwidth will be reflecting how fast the device can writeout when _fully utilized_, and won't drop to 0 when it goes idle. The value will remain constant at disk idle time. At busy write time, if not considering fluctuations, it will also remain high unless be knocked down by possible concurrent reads that compete for the disk time and bandwidth with async writes. The estimation is not done purely in the flusher because there is no guarantee for write_cache_pages() to return timely to update bandwidth. The bdi->avg_write_bandwidth smoothing is very effective for filtering out sudden spikes, however may be a little biased in long term. The overheads are low because the bdi bandwidth update only occurs at 200ms intervals. The 200ms update interval is suitable, because it's not possible to get the real bandwidth for the instance at all, due to large fluctuations. The NFS commits can be as large as seconds worth of data. One XFS completion may be as large as half second worth of data if we are going to increase the write chunk to half second worth of data. In ext4, fluctuations with time period of around 5 seconds is observed. And there is another pattern of irregular periods of up to 20 seconds on SSD tests. That's why we are not only doing the estimation at 200ms intervals, but also averaging them over a period of 3 seconds and then go further to do another level of smoothing in avg_write_bandwidth. CC: Li Shaohua <shaohua.li@intel.com> CC: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-07-09writeback: account per-bdi accumulated written pagesJan Kara
Introduce the BDI_WRITTEN counter. It will be used for estimating the bdi's write bandwidth. Peter Zijlstra <a.p.zijlstra@chello.nl>: Move BDI_WRITTEN accounting into __bdi_writeout_inc(). This will cover and fix fuse, which only calls bdi_writeout_inc(). CC: Michael Rubin <mrubin@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-07-09writeback: make writeback_control.nr_to_write straightWu Fengguang
Pass struct wb_writeback_work all the way down to writeback_sb_inodes(), and initialize the struct writeback_control there. struct writeback_control is basically designed to control writeback of a single file, but we keep abuse it for writing multiple files in writeback_sb_inodes() and its callers. It immediately clean things up, e.g. suddenly wbc.nr_to_write vs work->nr_pages starts to make sense, and instead of saving and restoring pages_skipped in writeback_sb_inodes it can always start with a clean zero value. It also makes a neat IO pattern change: large dirty files are now written in the full 4MB writeback chunk size, rather than whatever remained quota in wbc->nr_to_write. Acked-by: Jan Kara <jack@suse.cz> Proposed-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-06-20writeback: skip tmpfs early in balance_dirty_pages_ratelimited_nr()Wu Fengguang
This helps prevent tmpfs dirtiers from skewing the per-cpu bdp_ratelimits. Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-06-08writeback: skip balance_dirty_pages() for in-memory fsWu Fengguang
This avoids unnecessary checks and dirty throttling on tmpfs/ramfs. Notes about the tmpfs/ramfs behavior changes: As for 2.6.36 and older kernels, the tmpfs writes will sleep inside balance_dirty_pages() as long as we are over the (dirty+background)/2 global throttle threshold. This is because both the dirty pages and threshold will be 0 for tmpfs/ramfs. Hence this test will always evaluate to TRUE: dirty_exceeded = (bdi_nr_reclaimable + bdi_nr_writeback >= bdi_thresh) || (nr_reclaimable + nr_writeback >= dirty_thresh); For 2.6.37, someone complained that the current logic does not allow the users to set vm.dirty_ratio=0. So commit 4cbec4c8b9 changed the test to dirty_exceeded = (bdi_nr_reclaimable + bdi_nr_writeback > bdi_thresh) || (nr_reclaimable + nr_writeback > dirty_thresh); So 2.6.37 will behave differently for tmpfs/ramfs: it will never get throttled unless the global dirty threshold is exceeded (which is very unlikely to happen; once happen, will block many tasks). I'd say that the 2.6.36 behavior is very bad for tmpfs/ramfs. It means for a busy writing server, tmpfs write()s may get livelocked! The "inadvertent" throttling can hardly bring help to any workload because of its "either no throttling, or get throttled to death" property. So based on 2.6.37, this patch won't bring more noticeable changes. CC: Hugh Dickins <hughd@google.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-06-08writeback: add bdi_dirty_limit() kernel-docWu Fengguang
Clarify the bdi_dirty_limit() comment. Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-06-08writeback: introduce .tagged_writepages for the WB_SYNC_NONE sync stageWu Fengguang
sync(2) is performed in two stages: the WB_SYNC_NONE sync and the WB_SYNC_ALL sync. Identify the first stage with .tagged_writepages and do livelock prevention for it, too. Jan's commit f446daaea9 ("mm: implement writeback livelock avoidance using page tagging") is a partial fix in that it only fixed the WB_SYNC_ALL phase livelock. Although ext4 is tested to no longer livelock with commit f446daaea9, it may due to some "redirty_tail() after pages_skipped" effect which is by no means a guarantee for _all_ the file systems. Note that writeback_inodes_sb() is called by not only sync(), they are treated the same because the other callers also need livelock prevention. Impact: It changes the order in which pages/inodes are synced to disk. Now in the WB_SYNC_NONE stage, it won't proceed to write the next inode until finished with the current inode. Acked-by: Jan Kara <jack@suse.cz> CC: Dave Chinner <david@fromorbit.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-03-24Merge branch 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds
* 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block: (65 commits) Documentation/iostats.txt: bit-size reference etc. cfq-iosched: removing unnecessary think time checking cfq-iosched: Don't clear queue stats when preempt. blk-throttle: Reset group slice when limits are changed blk-cgroup: Only give unaccounted_time under debug cfq-iosched: Don't set active queue in preempt block: fix non-atomic access to genhd inflight structures block: attempt to merge with existing requests on plug flush block: NULL dereference on error path in __blkdev_get() cfq-iosched: Don't update group weights when on service tree fs: assign sb->s_bdi to default_backing_dev_info if the bdi is going away block: Require subsystems to explicitly allocate bio_set integrity mempool jbd2: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging jbd: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging fs: make fsync_buffers_list() plug mm: make generic_writepages() use plugging blk-cgroup: Add unaccounted time to timeslice_used. block: fixup plugging stubs for !CONFIG_BLOCK block: remove obsolete comments for blkdev_issue_zeroout. blktrace: Use rq->cmd_flags directly in blk_add_trace_rq. ... Fix up conflicts in fs/{aio.c,super.c}
2011-03-22writeback: make mapping->writeback_index to point to the last written pageJun'ichi Nomura
For range-cyclic writeback (e.g. kupdate), the writeback code sets a continuation point of the next writeback to mapping->writeback_index which is set the page after the last written page. This happens so that we evenly write the whole file even if pages in it get continuously redirtied. However, in some cases, sequential writer is writing in the middle of the page and it just redirties the last written page by continuing from that. For example with an application which uses a file as a big ring buffer we see: [1st writeback session] ... flush-8:0-2743 4571: block_bio_queue: 8,0 W 94898514 + 8 flush-8:0-2743 4571: block_bio_queue: 8,0 W 94898522 + 8 flush-8:0-2743 4571: block_bio_queue: 8,0 W 94898530 + 8 flush-8:0-2743 4571: block_bio_queue: 8,0 W 94898538 + 8 flush-8:0-2743 4571: block_bio_queue: 8,0 W 94898546 + 8 kworker/0:1-11 4571: block_rq_issue: 8,0 W 0 () 94898514 + 40 >> flush-8:0-2743 4571: block_bio_queue: 8,0 W 94898554 + 8 >> flush-8:0-2743 4571: block_rq_issue: 8,0 W 0 () 94898554 + 8 [2nd writeback session after 35sec] flush-8:0-2743 4606: block_bio_queue: 8,0 W 94898562 + 8 flush-8:0-2743 4606: block_bio_queue: 8,0 W 94898570 + 8 flush-8:0-2743 4606: block_bio_queue: 8,0 W 94898578 + 8 ... kworker/0:1-11 4606: block_rq_issue: 8,0 W 0 () 94898562 + 640 kworker/0:1-11 4606: block_rq_issue: 8,0 W 0 () 94899202 + 72 ... flush-8:0-2743 4606: block_bio_queue: 8,0 W 94899962 + 8 flush-8:0-2743 4606: block_bio_queue: 8,0 W 94899970 + 8 flush-8:0-2743 4606: block_bio_queue: 8,0 W 94899978 + 8 flush-8:0-2743 4606: block_bio_queue: 8,0 W 94899986 + 8 flush-8:0-2743 4606: block_bio_queue: 8,0 W 94899994 + 8 kworker/0:1-11 4606: block_rq_issue: 8,0 W 0 () 94899962 + 40 >> flush-8:0-2743 4606: block_bio_queue: 8,0 W 94898554 + 8 >> flush-8:0-2743 4606: block_rq_issue: 8,0 W 0 () 94898554 + 8 So we seeked back to 94898554 after we wrote all the pages at the end of the file. This extra seek seems unnecessary. If we continue writeback from the last written page, we can avoid it and do not cause harm to other cases. The original intent of even writeout over the whole file is preserved and if the page does not get redirtied pagevec_lookup_tag() just skips it. As an exceptional case, when I/O error happens, set done_index to the next page as the comment in the code suggests. Tested-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22mm: reclaim invalidated page ASAPMinchan Kim
invalidate_mapping_pages is very big hint to reclaimer. It means user doesn't want to use the page any more. So in order to prevent working set page eviction, this patch move the page into tail of inactive list by PG_reclaim. Please, remember that pages in inactive list are working set as well as active list. If we don't move pages into inactive list's tail, pages near by tail of inactive list can be evicted although we have a big clue about useless pages. It's totally bad. Now PG_readahead/PG_reclaim is shared. fe3cba17 added ClearPageReclaim into clear_page_dirty_for_io for preventing fast reclaiming readahead marker page. In this series, PG_reclaim is used by invalidated page, too. If VM find the page is invalidated and it's dirty, it sets PG_reclaim to reclaim asap. Then, when the dirty page will be writeback, clear_page_dirty_for_io will clear PG_reclaim unconditionally. It disturbs this serie's goal. I think it's okay to clear PG_readahead when the page is dirty, not writeback time. So this patch moves ClearPageReadahead. In v4, ClearPageReadahead in set_page_dirty has a problem which is reported by Steven Barrett. It's due to compound page. Some driver(ex, audio) calls set_page_dirty with compound page which isn't on LRU. but my patch does ClearPageRelcaim on compound page. In non-CONFIG_PAGEFLAGS_EXTENDED, it breaks PageTail flag. I think it doesn't affect THP and pass my test with THP enabling but Cced Andrea for double check. Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Reported-by: Steven Barrett <damentz@liquorix.net> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-17mm: make generic_writepages() use pluggingShaohua Li
This recovers a performance regression caused by the removal of the per-device plugging. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-03-10block: remove per-queue pluggingJens Axboe
Code has been converted over to the new explicit on-stack plugging, and delay users have been converted to use the new API for that. So lets kill off the old plugging along with aops->sync_page(). Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-01-13writeback: avoid unnecessary determine_dirtyable_memory callMinchan Kim
I think determine_dirtyable_memory() is a rather costly function since it need many atomic reads for gathering zone/global page state. But when we use vm_dirty_bytes && dirty_background_bytes, we don't need that costly calculation. This patch eliminates such unnecessary overhead. NOTE : newly added if condition might add overhead in normal path. But it should be _really_ small because anyway we need the access both vm_dirty_bytes and dirty_background_bytes so it is likely to hit the cache. [akpm@linux-foundation.org: fix used-uninitialised warning] Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13mm/page-writeback.c: fix __set_page_dirty_no_writeback() return valueBob Liu
__set_page_dirty_no_writeback() should return true if it actually transitioned the page from a clean to dirty state although it seems nobody uses its return value at present. Signed-off-by: Bob Liu <lliubbo@gmail.com> Acked-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13Merge branch 'for-next' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (43 commits) Documentation/trace/events.txt: Remove obsolete sched_signal_send. writeback: fix global_dirty_limits comment runtime -> real-time ppc: fix comment typo singal -> signal drivers: fix comment typo diable -> disable. m68k: fix comment typo diable -> disable. wireless: comment typo fix diable -> disable. media: comment typo fix diable -> disable. remove doc for obsolete dynamic-printk kernel-parameter remove extraneous 'is' from Documentation/iostats.txt Fix spelling milisec -> ms in snd_ps3 module parameter description Fix spelling mistakes in comments Revert conflicting V4L changes i7core_edac: fix typos in comments mm/rmap.c: fix comment sound, ca0106: Fix assignment to 'channel'. hrtimer: fix a typo in comment init/Kconfig: fix typo anon_inodes: fix wrong function name in comment fix comment typos concerning "consistent" poll: fix a typo in comment ... Fix up trivial conflicts in: - drivers/net/wireless/iwlwifi/iwl-core.c (moved to iwl-legacy.c) - fs/ext4/ext4.h Also fix missed 'diabled' typo in drivers/net/bnx2x/bnx2x.h while at it.
2011-01-04writeback: fix global_dirty_limits comment runtime -> real-timeMinchan Kim
Change runtime with real-time Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2010-12-22writeback: do uninterruptible sleep in balance_dirty_pages()Wu Fengguang
Using TASK_INTERRUPTIBLE in balance_dirty_pages() seems wrong. If it's going to do that then it must break out if signal_pending(), otherwise it's pretty much guaranteed to degenerate into a busywait loop. Plus we *do* want these processes to appear in D state and to contribute to load average. So it should be TASK_UNINTERRUPTIBLE. -- Andrew Morton Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26writeback: remove the internal 5% low bound on dirty_ratioWu Fengguang
The dirty_ratio was silently limited in global_dirty_limits() to >= 5%. This is not a user expected behavior. And it's inconsistent with calc_period_shift(), which uses the plain vm_dirty_ratio value. Let's remove the internal bound. At the same time, fix balance_dirty_pages() to work with the dirty_thresh=0 case. This allows applications to proceed when dirty+writeback pages are all cleaned. And ">" fits with the name "exceeded" better than ">=" does. Neil thinks it is an aesthetic improvement as well as a functional one :) Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Jan Kara <jack@suse.cz> Proposed-by: Con Kolivas <kernel@kolivas.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Neil Brown <neilb@suse.de> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Michael Rubin <mrubin@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26writeback: add nr_dirtied and nr_written to /proc/vmstatMichael Rubin
To help developers and applications gain visibility into writeback behaviour adding two entries to vm_stat_items and /proc/vmstat. This will allow us to track the "written" and "dirtied" counts. # grep nr_dirtied /proc/vmstat nr_dirtied 3747 # grep nr_written /proc/vmstat nr_written 3618 Signed-off-by: Michael Rubin <mrubin@google.com> Reviewed-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26mm: add account_page_writeback()Michael Rubin
To help developers and applications gain visibility into writeback behaviour this patch adds two counters to /proc/vmstat. # grep nr_dirtied /proc/vmstat nr_dirtied 3747 # grep nr_written /proc/vmstat nr_written 3618 These entries allow user apps to understand writeback behaviour over time and learn how it is impacting their performance. Currently there is no way to inspect dirty and writeback speed over time. It's not possible for nr_dirty/nr_writeback. These entries are necessary to give visibility into writeback behaviour. We have /proc/diskstats which lets us understand the io in the block layer. We have blktrace for more in depth understanding. We have e2fsprogs and debugsfs to give insight into the file systems behaviour, but we don't offer our users the ability understand what writeback is doing. There is no way to know how active it is over the whole system, if it's falling behind or to quantify it's efforts. With these values exported users can easily see how much data applications are sending through writeback and also at what rates writeback is processing this data. Comparing the rates of change between the two allow developers to see when writeback is not able to keep up with incoming traffic and the rate of dirty memory being sent to the IO back end. This allows folks to understand their io workloads and track kernel issues. Non kernel engineers at Google often use these counters to solve puzzling performance problems. Patch #4 adds a pernode vmstat file with nr_dirtied and nr_written Patch #5 add writeback thresholds to /proc/vmstat Currently these values are in debugfs. But they should be promoted to /proc since they are useful for developers who are writing databases and file servers and are not debugging the kernel. The output is as below: # grep threshold /proc/vmstat nr_pages_dirty_threshold 409111 nr_pages_dirty_background_threshold 818223 This patch: This allows code outside of the mm core to safely manipulate page writeback state and not worry about the other accounting. Not using these routines means that some code will lose track of the accounting and we get bugs. Modify nilfs2 to use interface. Signed-off-by: Michael Rubin <mrubin@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Wu Fengguang <fengguang.wu@intel.com> Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp> Cc: Jiro SEKIBA <jir@unicus.jp> Cc: Dave Chinner <david@fromorbit.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-28Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: ceph: fix get_ticket_handler() error handling ceph: don't BUG on ENOMEM during mds reconnect ceph: ceph_mdsc_build_path() returns an ERR_PTR ceph: Fix warnings ceph: ceph_get_inode() returns an ERR_PTR ceph: initialize fields on new dentry_infos ceph: maintain i_head_snapc when any caps are dirty, not just for data ceph: fix osd request lru adjustment when sending request ceph: don't improperly set dir complete when holding EXCL cap mm: exporting account_page_dirty ceph: direct requests in snapped namespace based on nonsnap parent ceph: queue cap snap writeback for realm children on snap update ceph: include dirty xattrs state in snapped caps ceph: fix xattr cap writeback ceph: fix multiple mds session shutdown
2010-08-24writeback: write_cache_pages doesn't terminate at nr_to_write <= 0Dave Chinner
I noticed XFS writeback in 2.6.36-rc1 was much slower than it should have been. Enabling writeback tracing showed: flush-253:16-8516 [007] 1342952.351608: wbc_writepage: bdi 253:16: towrt=1024 skip=0 mode=0 kupd=0 bgrd=1 reclm=0 cyclic=1 more=0 older=0x0 start=0x0 end=0x0 flush-253:16-8516 [007] 1342952.351654: wbc_writepage: bdi 253:16: towrt=1023 skip=0 mode=0 kupd=0 bgrd=1 reclm=0 cyclic=1 more=0 older=0x0 start=0x0 end=0x0 flush-253:16-8516 [000] 1342952.369520: wbc_writepage: bdi 253:16: towrt=0 skip=0 mode=0 kupd=0 bgrd=1 reclm=0 cyclic=1 more=0 older=0x0 start=0x0 end=0x0 flush-253:16-8516 [000] 1342952.369542: wbc_writepage: bdi 253:16: towrt=-1 skip=0 mode=0 kupd=0 bgrd=1 reclm=0 cyclic=1 more=0 older=0x0 start=0x0 end=0x0 flush-253:16-8516 [000] 1342952.369549: wbc_writepage: bdi 253:16: towrt=-2 skip=0 mode=0 kupd=0 bgrd=1 reclm=0 cyclic=1 more=0 older=0x0 start=0x0 end=0x0 Writeback is not terminating in background writeback if ->writepage is returning with wbc->nr_to_write == 0, resulting in sub-optimal single page writeback on XFS. Fix the write_cache_pages loop to terminate correctly when this situation occurs and so prevent this sub-optimal background writeback pattern. This improves sustained sequential buffered write performance from around 250MB/s to 750MB/s for a 100GB file on an XFS filesystem on my 8p test VM. Cc:<stable@kernel.org> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Wu Fengguang <fengguang.wu@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-08-22mm: exporting account_page_dirtyMichael Rubin
This allows code outside of the mm core to safely manipulate page state and not worry about the other accounting. Not using these routines means that some code will lose track of the accounting and we get bugs. This has happened once already. Signed-off-by: Michael Rubin <mrubin@google.com> Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-20lib/radix-tree.c: fix overflow in radix_tree_range_tag_if_tagged()Jan Kara
When radix_tree_maxindex() is ~0UL, it can happen that scanning overflows index and tree traversal code goes astray reading memory until it hits unreadable memory. Check for overflow and exit in that case. Signed-off-by: Jan Kara <jack@suse.cz> Cc: Christoph Hellwig <hch@lst.de> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-14mm/page-writeback: fix non-kernel-doc function commentsRandy Dunlap
Remove leading /** from non-kernel-doc function comments to prevent kernel-doc warnings. Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-12writeback: add comment to the dirty limit functionsWu Fengguang
Document global_dirty_limits() and bdi_dirty_limit(). Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-12writeback: avoid unnecessary calculation of bdi dirty thresholdsWu Fengguang
Split get_dirty_limits() into global_dirty_limits()+bdi_dirty_limit(), so that the latter can be avoided when under global dirty background threshold (which is the normal state for most systems). Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-12writeback: balance_dirty_pages(): reduce calls to global_page_stateWu Fengguang
Reducing the number of times balance_dirty_pages calls global_page_state reduces the cache references and so improves write performance on a variety of workloads. 'perf stats' of simple fio write tests shows the reduction in cache access. Where the test is fio 'write,mmap,600Mb,pre_read' on AMD AthlonX2 with 3Gb memory (dirty_threshold approx 600 Mb) running each test 10 times, dropping the fasted & slowest values then taking the average & standard deviation average (s.d.) in millions (10^6) 2.6.31-rc8 648.6 (14.6) +patch 620.1 (16.5) Achieving this reduction is by dropping clip_bdi_dirty_limit as it rereads the counters to apply the dirty_threshold and moving this check up into balance_dirty_pages where it has already read the counters. Also by rearrange the for loop to only contain one copy of the limit tests allows the pdflush test after the loop to use the local copies of the counters rather than rereading them. In the common case with no throttling it now calls global_page_state 5 fewer times and bdi_stat 2 fewer. Fengguang: This patch slightly changes behavior by replacing clip_bdi_dirty_limit() with the explicit check (nr_reclaimable + nr_writeback >= dirty_thresh) to avoid exceeding the dirty limit. Since the bdi dirty limit is mostly accurate we don't need to do routinely clip. A simple dirty limit check would be enough. The check is necessary because, in principle we should throttle everything calling balance_dirty_pages() when we're over the total limit, as said by Peter. We now set and clear dirty_exceeded not only based on bdi dirty limits, but also on the global dirty limit. The global limit check is added in place of clip_bdi_dirty_limit() for safety and not intended as a behavior change. The bdi limits should be tight enough to keep all dirty pages under the global limit at most time; occasional small exceeding should be OK though. The change makes the logic more obvious: the global limit is the ultimate goal and shall be always imposed. We may now start background writeback work based on outdated conditions. That's safe because the bdi flush thread will (and have to) double check the states. It reduces overall overheads because the test based on old states still have good chance to be right. [akpm@linux-foundation.org] fix uninitialized dirty_exceeded Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Jan Kara <jack@suse.cz> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-12mm: fix fatal kernel-doc errorRandy Dunlap
Fix a fatal kernel-doc error due to a #define coming between a function's kernel-doc notation and the function signature. (kernel-doc cannot handle this) Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-10Merge branch 'for-2.6.36' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds
* 'for-2.6.36' of git://git.kernel.dk/linux-2.6-block: (149 commits) block: make sure that REQ_* types are seen even with CONFIG_BLOCK=n xen-blkfront: fix missing out label blkdev: fix blkdev_issue_zeroout return value block: update request stacking methods to support discards block: fix missing export of blk_types.h writeback: fix bad _bh spinlock nesting drbd: revert "delay probes", feature is being re-implemented differently drbd: Initialize all members of sync_conf to their defaults [Bugz 315] drbd: Disable delay probes for the upcomming release writeback: cleanup bdi_register writeback: add new tracepoints writeback: remove unnecessary init_timer call writeback: optimize periodic bdi thread wakeups writeback: prevent unnecessary bdi threads wakeups writeback: move bdi threads exiting logic to the forker thread writeback: restructure bdi forker loop a little writeback: move last_active to bdi writeback: do not remove bdi from bdi_list writeback: simplify bdi code a little writeback: do not lose wake-ups in bdi threads ... Fixed up pretty trivial conflicts in drivers/block/virtio_blk.c and drivers/scsi/scsi_error.c as per Jens.
2010-08-09mm: implement writeback livelock avoidance using page taggingJan Kara
We try to avoid livelocks of writeback when some steadily creates dirty pages in a mapping we are writing out. For memory-cleaning writeback, using nr_to_write works reasonably well but we cannot really use it for data integrity writeback. This patch tries to solve the problem. The idea is simple: Tag all pages that should be written back with a special tag (TOWRITE) in the radix tree. This can be done rather quickly and thus livelocks should not happen in practice. Then we start doing the hard work of locking pages and sending them to disk only for those pages that have TOWRITE tag set. Note: Adding new radix tree tag grows radix tree node from 288 to 296 bytes for 32-bit archs and from 552 to 560 bytes for 64-bit archs. However, the number of slab/slub items per page remains the same (13 and 7 respectively). Signed-off-by: Jan Kara <jack@suse.cz> Cc: Dave Chinner <david@fromorbit.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Chris Mason <chris.mason@oracle.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-07writeback: Add tracing to write_cache_pagesDave Chinner
Add a trace event to the ->writepage loop in write_cache_pages to give visibility into how the ->writepage call is changing variables within the writeback control structure. Of most interest is how wbc->nr_to_write changes from call to call, especially with filesystems that write multiple pages in ->writepage. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07writeback: Add tracing to balance_dirty_pagesDave Chinner
Tracing high level background writeback events is good, but it doesn't give the entire picture. Add visibility into write throttling to catch IO dispatched by foreground throttling of processing dirtying lots of pages. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-07-06writeback: remove writeback_inodes_wbcChristoph Hellwig
This was just an odd wrapper around writeback_inodes_wb. Removing this also allows to get rid of the bdi member of struct writeback_control which was rather out of place there. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-06-11writeback: simplify and split bdi_start_writebackChristoph Hellwig
bdi_start_writeback now never gets a superblock passed, so we can just remove that case. And to further untangle the code and flatten the call stack split it into two trivial helpers for it's two callers. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-06-08writeback: limit write_cache_pages integrity scanning to current EOFDave Chinner
sync can currently take a really long time if a concurrent writer is extending a file. The problem is that the dirty pages on the address space grow in the same direction as write_cache_pages scans, so if the writer keeps ahead of writeback, the writeback will not terminate until the writer stops adding dirty pages. For a data integrity sync, we only need to write the pages dirty at the time we start the writeback, so we can stop scanning once we get to the page that was at the end of the file at the time the scan started. This will prevent operations like copying a large file preventing sync from completing as it will not write back pages that were dirtied after the sync was started. This does not impact the existing integrity guarantees, as any dirty page (old or new) within the EOF range at the start of the scan will still be captured. This patch will not prevent sync from blocking on large writes into holes. That requires more complex intervention while this patch only addresses the common append-case of this sync holdoff. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-06-08writeback: pay attention to wbc->nr_to_write in write_cache_pagesDave Chinner
If a filesystem writes more than one page in ->writepage, write_cache_pages fails to notice this and continues to attempt writeback when wbc->nr_to_write has gone negative - this trace was captured from XFS: wbc_writeback_start: towrt=1024 wbc_writepage: towrt=1024 wbc_writepage: towrt=0 wbc_writepage: towrt=-1 wbc_writepage: towrt=-5 wbc_writepage: towrt=-21 wbc_writepage: towrt=-85 This has adverse effects on filesystem writeback behaviour. write_cache_pages() needs to terminate after a certain number of pages are written, not after a certain number of calls to ->writepage are made. This is a regression introduced by 17bc6c30cf6bfffd816bdc53682dd46fc34a2cf4 ("vfs: Add no_nrwrite_index_update writeback control flag"), but cannot be reverted directly due to subsequent bug fixes that have gone in on top of it. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-06-01Revert "writeback: fix WB_SYNC_NONE writeback from umount"Jens Axboe
This reverts commit e913fc825dc685a444cb4c1d0f9d32f372f59861. We are investigating a hang associated with the WB_SYNC_NONE changes, so revert them for now. Conflicts: fs/fs-writeback.c mm/page-writeback.c Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-05-21writeback: fix mixed up arguments to bdi_start_writeback()Jens Axboe
The laptop mode timer had the nr_pages and sb_locked arguments mixed up. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-05-21writeback: fix problem with !CONFIG_BLOCK compilationJens Axboe
When CONFIG_BLOCK isn't enabled: mm/page-writeback.c: In function 'laptop_mode_timer_fn': mm/page-writeback.c:708: error: dereferencing pointer to incomplete type mm/page-writeback.c:709: error: dereferencing pointer to incomplete type Fix this by essentially eliminating the laptop sync handlers when CONFIG_BLOCK isn't set, as most are only used from the block layer code. The exception is laptop_sync_completion() which is used from sys_sync(), make that an empty declaration in that case. Reported-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-05-21writeback: fixups for !dirty_writeback_centisecsJens Axboe
Commit 69b62d01 fixed up most of the places where we would enter busy schedule() spins when disabling the periodic background writeback. This fixes up the sb timer so that it doesn't get hammered on with the delay disabled, and ensures that it gets rearmed if needed when /proc/sys/vm/dirty_writeback_centisecs gets modified. bdi_forker_task() also needs to check for !dirty_writeback_centisecs and use schedule() appropriately, fix that up too. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-05-17writeback: fix WB_SYNC_NONE writeback from umountJens Axboe
When umount calls sync_filesystem(), we first do a WB_SYNC_NONE writeback to kick off writeback of pending dirty inodes, then follow that up with a WB_SYNC_ALL to wait for it. Since umount already holds the sb s_umount mutex, WB_SYNC_NONE ends up doing nothing and all writeback happens as WB_SYNC_ALL. This can greatly slow down umount, since WB_SYNC_ALL writeback is a data integrity operation and thus a bigger hammer than simple WB_SYNC_NONE. For barrier aware file systems it's a lot slower. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-04-06laptop-mode: Make flushes per-deviceMatthew Garrett
One of the features of laptop-mode is that it forces a writeout of dirty pages if something else triggers a physical read or write from a device. The current implementation flushes pages on all devices, rather than only the one that triggered the flush. This patch alters the behaviour so that only the recently accessed block device is flushed, preventing other disks being spun up for no terribly good reason. Signed-off-by: Matthew Garrett <mjg@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-12-03writeback: remove unused nonblocking and congestion checksWu Fengguang
- no one is calling wb_writeback and write_cache_pages with wbc.nonblocking=1 any more - lumpy pageout will want to do nonblocking writeback without the congestion wait So remove the congestion checks as suggested by Chris. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Chris Mason <chris.mason@oracle.com> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Evgeniy Polyakov <zbr@ioremap.net> Cc: Alex Elder <aelder@sgi.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-10-09writeback: account IO throttling wait as iowaitWu Fengguang
It makes sense to do IOWAIT when someone is blocked due to IO throttle, as suggested by Kame and Peter. There is an old comment for not doing IOWAIT on throttle, however it has been mismatching the code for a long time. If we stop accounting IOWAIT for 2.6.32, it could be an undesirable behavior change. So restore the io_schedule. CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> CC: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-09-26writeback: pass in super_block to bdi_start_writeback()Jens Axboe
Sometimes we only want to write pages from a specific super_block, so allow that to be passed in. This fixes a problem with commit 56a131dcf7ed36c3c6e36bea448b674ea85ed5bb causing writeback on all super_blocks on a bdi, where we only really want to sync a specific sb from writeback_inodes_sb(). Signed-off-by: Jens Axboe <jens.axboe@oracle.com>