aboutsummaryrefslogtreecommitdiff
path: root/fs/fs-writeback.c
AgeCommit message (Collapse)Author
2010-10-04writeback: always use sb->s_bdi for writeback purposesChristoph Hellwig
We currently use struct backing_dev_info for various different purposes. Originally it was introduced to describe a backing device which includes an unplug and congestion function and various bits of readahead information and VM-relevant flags. We're also using for tracking dirty inodes for writeback. To make writeback properly find all inodes we need to only access the per-filesystem backing_device pointed to by the superblock in ->s_bdi inside the writeback code, and not the instances pointeded to by inode->i_mapping->backing_dev which can be overriden by special devices or might not be set at all by some filesystems. Long term we should split out the writeback-relevant bits of struct backing_device_info (which includes more than the current bdi_writeback) and only point to it from the superblock while leaving the traditional backing device as a separate structure that can be overriden by devices. The one exception for now is the block device filesystem which really wants different writeback contexts for it's different (internal) inodes to handle the writeout more efficiently. For now we do this with a hack in fs-writeback.c because we're so late in the cycle, but in the future I plan to replace this with a superblock method that allows for multiple writeback contexts per filesystem. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-09-22bdi: Fix warnings in __mark_inode_dirty for /dev/zero and friendsJan Kara
Inodes of devices such as /dev/zero can get dirty for example via utime(2) syscall or due to atime update. Backing device of such inodes (zero_bdi, etc.) is however unable to handle dirty inodes and thus __mark_inode_dirty complains. In fact, inode should be rather dirtied against backing device of the filesystem holding it. This is generally a good rule except for filesystems such as 'bdev' or 'mtd_inodefs'. Inodes in these pseudofilesystems are referenced from ordinary filesystem inodes and carry mapping with real data of the device. Thus for these inodes we have to use inode->i_mapping->backing_dev_info as we did so far. We distinguish these filesystems by checking whether sb->s_bdi points to a non-trivial backing device or not. Example: Assume we have an ext3 filesystem on /dev/sda1 mounted on /. There's a device inode A described by a path "/dev/sdb" on this filesystem. This inode will be dirtied against backing device "8:0" after this patch. bdev filesystem contains block device inode B coupled with our inode A. When someone modifies a page of /dev/sdb, it's B that gets dirtied and the dirtying happens against the backing device "8:16". Thus both inodes get filed to a correct bdi list. Cc: stable@kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-28writeback: Fix lost wake-up shutting down writeback threadJ. Bruce Fields
Setting the task state here may cause us to miss the wake up from kthread_stop(), so we need to recheck kthread_should_stop() or risk sleeping forever in the following schedule(). Symptom was an indefinite hang on an NFSv4 mount. (NFSv4 may create multiple mounts in a temporary namespace while traversing the mount path, and since the temporary namespace is immediately destroyed, it may end up destroying a mount very soon after it was created, possibly making this race more likely.) INFO: task mount.nfs4:4314 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. mount.nfs4 D 0000000000000000 2880 4314 4313 0x00000000 ffff88001ed6da28 0000000000000046 ffff88001ed6dfd8 ffff88001ed6dfd8 ffff88001ed6c000 ffff88001ed6c000 ffff88001ed6c000 ffff88001e5003a0 ffff88001ed6dfd8 ffff88001e5003a8 ffff88001ed6c000 ffff88001ed6dfd8 Call Trace: [<ffffffff8196090d>] schedule_timeout+0x1cd/0x2e0 [<ffffffff8106a31c>] ? mark_held_locks+0x6c/0xa0 [<ffffffff819639a0>] ? _raw_spin_unlock_irq+0x30/0x60 [<ffffffff8106a5fd>] ? trace_hardirqs_on_caller+0x14d/0x190 [<ffffffff819671fe>] ? sub_preempt_count+0xe/0xd0 [<ffffffff8195fc80>] wait_for_common+0x120/0x190 [<ffffffff81033c70>] ? default_wake_function+0x0/0x20 [<ffffffff8195fdcd>] wait_for_completion+0x1d/0x20 [<ffffffff810595fa>] kthread_stop+0x4a/0x150 [<ffffffff81061a60>] ? thaw_process+0x70/0x80 [<ffffffff810cc68a>] bdi_unregister+0x10a/0x1a0 [<ffffffff81229dc9>] nfs_put_super+0x19/0x20 [<ffffffff810ee8c4>] generic_shutdown_super+0x54/0xe0 [<ffffffff810ee9b6>] kill_anon_super+0x16/0x60 [<ffffffff8122d3b9>] nfs4_kill_super+0x39/0x90 [<ffffffff810eda45>] deactivate_locked_super+0x45/0x60 [<ffffffff810edfb9>] deactivate_super+0x49/0x70 [<ffffffff81108294>] mntput_no_expire+0x84/0xe0 [<ffffffff811084ef>] release_mounts+0x9f/0xc0 [<ffffffff81108575>] put_mnt_ns+0x65/0x80 [<ffffffff8122cc56>] nfs_follow_remote_path+0x1e6/0x420 [<ffffffff8122cfbf>] nfs4_try_mount+0x6f/0xd0 [<ffffffff8122d0c2>] nfs4_get_sb+0xa2/0x360 [<ffffffff810edcb8>] vfs_kern_mount+0x88/0x1f0 [<ffffffff810ede92>] do_kern_mount+0x52/0x130 [<ffffffff81963d9a>] ? _lock_kernel+0x6a/0x170 [<ffffffff81108e9e>] do_mount+0x26e/0x7f0 [<ffffffff81106b3a>] ? copy_mount_options+0xea/0x190 [<ffffffff811094b8>] sys_mount+0x98/0xf0 [<ffffffff810024d8>] system_call_fastpath+0x16/0x1b 1 lock held by mount.nfs4/4314: #0: (&type->s_umount_key#24){+.+...}, at: [<ffffffff810edfb1>] deactivate_super+0x41/0x70 Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com> Acked-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2010-08-12mm: fix writeback_in_progress()Jan Kara
Commit 83ba7b071f3 ("writeback: simplify the write back thread queue") broke writeback_in_progress() as in that commit we started to remove work items from the list at the moment we start working on them and not at the moment they are finished. Thus if the flusher thread was doing some work but there was no other work queued, writeback_in_progress() returned false. This could in particular cause unnecessary queueing of background writeback from balance_dirty_pages() or writeout work from writeback_sb_if_idle(). This patch fixes the problem by introducing a bit in the bdi state which indicates that the flusher thread is processing some work and uses this bit for writeback_in_progress() test. NOTE: Both callsites of writeback_in_progress() (namely, writeback_inodes_sb_if_idle() and balance_dirty_pages()) would actually need a different information than what writeback_in_progress() provides. They would need to know whether *the kind of writeback they are going to submit* is already queued. But this information isn't that simple to provide so let's fix writeback_in_progress() for the time being. Signed-off-by: Jan Kara <jack@suse.cz> Cc: Christoph Hellwig <hch@lst.de> Cc: Wu Fengguang <fengguang.wu@intel.com> Acked-by: Jens Axboe <jaxboe@fusionio.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-12writeback: merge for_kupdate and !for_kupdate casesWu Fengguang
Unify the logic for kupdate and non-kupdate cases. There won't be starvation because the inodes requeued into b_more_io will later be spliced _after_ the remaining inodes in b_io, hence won't stand in the way of other inodes in the next run. It avoids unnecessary redirty_tail() calls, hence the update of i_dirtied_when. The timestamp update is undesirable because it could later delay the inode's periodic writeback, or may exclude the inode from the data integrity sync operation (which checks timestamp to avoid extra work and livelock). === How the redirty_tail() comes about: It was a long story.. This redirty_tail() was introduced with wbc.more_io. The initial patch for more_io actually does not have the redirty_tail(), and when it's merged, several 100% iowait bug reports arised: reiserfs: http://lkml.org/lkml/2007/10/23/93 jfs: commit 29a424f28390752a4ca2349633aaacc6be494db5 JFS: clear PAGECACHE_TAG_DIRTY for no-write pages ext2: http://www.spinics.net/linux/lists/linux-ext4/msg04762.html They are all old bugs hidden in various filesystems that become "visible" with the more_io patch. At the time, the ext2 bug is thought to be "trivial", so not fixed. Instead the following updated more_io patch with redirty_tail() is merged: http://www.spinics.net/linux/lists/linux-ext4/msg04507.html This will in general prevent 100% on ext2 and possibly other unknown FS bugs. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Martin Bligh <mbligh@google.com> Cc: Michael Rubin <mrubin@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-12writeback: fix queue_io() orderingWu Fengguang
This was not a bug, since b_io is empty for kupdate writeback. The next patch will do requeue_io() for non-kupdate writeback, so let's fix it. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Martin Bligh <mbligh@google.com> Cc: Michael Rubin <mrubin@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-12writeback: don't redirty tail an inode with dirty pagesWu Fengguang
Avoid delaying writeback for an expire inode with lots of dirty pages, but no active dirtier at the moment. Previously we only do that for the kupdate case. Any filesystem that does delayed allocation or unwritten extent conversion after IO completion will cause this - for example, XFS. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Acked-by: Jan Kara <jack@suse.cz> Cc: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-12writeback: avoid unnecessary calculation of bdi dirty thresholdsWu Fengguang
Split get_dirty_limits() into global_dirty_limits()+bdi_dirty_limit(), so that the latter can be avoided when under global dirty background threshold (which is the normal state for most systems). Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-10Merge branch 'for-2.6.36' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds
* 'for-2.6.36' of git://git.kernel.dk/linux-2.6-block: (149 commits) block: make sure that REQ_* types are seen even with CONFIG_BLOCK=n xen-blkfront: fix missing out label blkdev: fix blkdev_issue_zeroout return value block: update request stacking methods to support discards block: fix missing export of blk_types.h writeback: fix bad _bh spinlock nesting drbd: revert "delay probes", feature is being re-implemented differently drbd: Initialize all members of sync_conf to their defaults [Bugz 315] drbd: Disable delay probes for the upcomming release writeback: cleanup bdi_register writeback: add new tracepoints writeback: remove unnecessary init_timer call writeback: optimize periodic bdi thread wakeups writeback: prevent unnecessary bdi threads wakeups writeback: move bdi threads exiting logic to the forker thread writeback: restructure bdi forker loop a little writeback: move last_active to bdi writeback: do not remove bdi from bdi_list writeback: simplify bdi code a little writeback: do not lose wake-ups in bdi threads ... Fixed up pretty trivial conflicts in drivers/block/virtio_blk.c and drivers/scsi/scsi_error.c as per Jens.
2010-08-10Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (96 commits) no need for list_for_each_entry_safe()/resetting with superblock list Fix sget() race with failing mount vfs: don't hold s_umount over close_bdev_exclusive() call sysv: do not mark superblock dirty on remount sysv: do not mark superblock dirty on mount btrfs: remove junk sb_dirt change BFS: clean up the superblock usage AFFS: wait for sb synchronization when needed AFFS: clean up dirty flag usage cifs: truncate fallout mbcache: fix shrinker function return value mbcache: Remove unused features add f_flags to struct statfs(64) pass a struct path to vfs_statfs update VFS documentation for method changes. All filesystems that need invalidate_inode_buffers() are doing that explicitly convert remaining ->clear_inode() to ->evict_inode() Make ->drop_inode() just return whether inode needs to be dropped fs/inode.c:clear_inode() is gone fs/inode.c:evict() doesn't care about delete vs. non-delete paths now ... Fix up trivial conflicts in fs/nilfs2/super.c
2010-08-09mm: avoid resetting wb_start after each writeback roundJan Kara
WB_SYNC_NONE writeback is done in rounds of 1024 pages so that we don't write out some huge inode for too long while starving writeout of other inodes. To avoid livelocks, we record time we started writeback in wbc->wb_start and do not write out inodes which were dirtied after this time. But currently, writeback_inodes_wb() resets wb_start each time it is called thus effectively invalidating this logic and making any WB_SYNC_NONE writeback prone to livelocks. This patch makes sure wb_start is set only once when we start writeback. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Christoph Hellwig <hch@lst.de> Acked-by: Jens Axboe <jaxboe@fusionio.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-09simplify checks for I_CLEAR/I_FREEINGAl Viro
add I_CLEAR instead of replacing I_FREEING with it. I_CLEAR is equivalent to I_FREEING for almost all code looking at either; it's there to keep track of having called clear_inode() exactly once per inode lifetime, at some point after having set I_FREEING. I_CLEAR and I_FREEING never get set at the same time with the current code, so we can switch to setting i_flags to I_FREEING | I_CLEAR instead of I_CLEAR without loss of information. As the result of such change, checks become simpler and the amount of code that needs to know about I_CLEAR shrinks a lot. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-07writeback: optimize periodic bdi thread wakeupsArtem Bityutskiy
Whe the first inode for a bdi is marked dirty, we wake up the bdi thread which should take care of the periodic background write-out. However, the write-out will actually start only 'dirty_writeback_interval' centisecs later, so we can delay the wake-up. This change was requested by Nick Piggin who pointed out that if we delay the wake-up, we weed out 2 unnecessary contex switches, which matters because '__mark_inode_dirty()' is a hot-path function. This patch introduces a new function - 'bdi_wakeup_thread_delayed()', which sets up a timer to wake-up the bdi thread and returns. So the wake-up is delayed. We also delete the timer in bdi threads just before writing-back. And synchronously delete it when unregistering bdi. At the unregister point the bdi does not have any users, so no one can arm it again. Since now we take 'bdi->wb_lock' in the timer, which can execute in softirq context, we have to use 'spin_lock_bh()' for 'bdi->wb_lock'. This patch makes this change as well. This patch also moves the 'bdi_wb_init()' function down in the file to avoid forward-declaration of 'bdi_wakeup_thread_delayed()'. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07writeback: prevent unnecessary bdi threads wakeupsArtem Bityutskiy
Finally, we can get rid of unnecessary wake-ups in bdi threads, which are very bad for battery-driven devices. There are two types of activities bdi threads do: 1. process bdi works from the 'bdi->work_list' 2. periodic write-back So there are 2 sources of wake-up events for bdi threads: 1. 'bdi_queue_work()' - submits bdi works 2. '__mark_inode_dirty()' - adds dirty I/O to bdi's The former already has bdi wake-up code. The latter does not, and this patch adds it. '__mark_inode_dirty()' is hot-path function, but this patch adds another 'spin_lock(&bdi->wb_lock)' there. However, it is taken only in rare cases when the bdi has no dirty inodes. So adding this spinlock should be fine and should not affect performance. This patch makes sure bdi threads and the forker thread do not wake-up if there is nothing to do. The forker thread will nevertheless wake up at least every 5 min. to check whether it has to kill a bdi thread. This can also be optimized, but is not worth it. This patch also tidies up the warning about unregistered bid, and turns it from an ugly crocodile to a simple 'WARN()' statement. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07writeback: move bdi threads exiting logic to the forker threadArtem Bityutskiy
Currently, bdi threads can decide to exit if there were no useful activities for 5 minutes. However, this causes nasty races: we can easily oops in the 'bdi_queue_work()' if the bdi thread decides to exit while we are waking it up. And even if we do not oops, but the bdi tread exits immediately after we wake it up, we'd lose the wake-up event and have an unnecessary delay (up to 5 secs) in the bdi work processing. This patch makes the forker thread to be the central place which not only creates bdi threads, but also kills them if they were inactive long enough. This better design-wise. Another reason why this change was done is to prepare for the further changes which will prevent the bdi threads from waking up every 5 sec and wasting power. Indeed, when the task does not wake up periodically anymore, it won't be able to exit either. This patch also moves the the 'wake_up_bit()' call from the bdi thread to the forker thread as well. So now the forker thread sets the BDI_pending bit, then forks the task or kills it, then clears the bit and wakes up the waiting process. The only process which may wain on the bit is 'bdi_wb_shutdown()'. This function was changed as well - now it first removes the bdi from the 'bdi_list', then waits on the 'BDI_pending' bit. Once it wakes up, it is guaranteed that the forker thread won't race with it, because the bdi is not visible. Note, the forker thread sets the 'BDI_pending' bit under the 'bdi->wb_lock' which is essential for proper serialization. And additionally, when we change 'bdi->wb.task', we now take the 'bdi->work_lock', to make sure that we do not lose wake-ups which we otherwise would when raced with, say, 'bdi_queue_work()'. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07writeback: move last_active to bdiArtem Bityutskiy
Currently bdi threads use local variable 'last_active' which stores last time when the bdi thread did some useful work. Move this local variable to 'struct bdi_writeback'. This is just a preparation for the further patches which will make the forker thread decide when bdi threads should be killed. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07writeback: do not remove bdi from bdi_listArtem Bityutskiy
The forker thread removes bdis from 'bdi_list' before forking the bdi thread. But this is wrong for at least 2 reasons. Reason #1: if we temporary remove a bdi from the list, we may miss works which would otherwise be given to us. Reason #2: this is racy; indeed, 'bdi_wb_shutdown()' expects that bdis are always in the 'bdi_list' (see 'bdi_remove_from_list()'), and when it races with the forker thread, it can shut down the bdi thread at the same time as the forker creates it. This patch makes sure the forker thread never removes bdis from 'bdi_list' (which was suggested by Christoph Hellwig). In order to make sure that we do not race with 'bdi_wb_shutdown()', we have to hold the 'bdi_lock' while walking the 'bdi_list' and setting the 'BDI_pending' flag. NOTE! The error path is interesting. Currently, when we fail to create a bdi thread, we move the bdi to the tail of 'bdi_list'. But if we never remove the bdi from the list, we cannot move it to the tail either, because then we can mess up the RCU readers which walk the list. And also, we'll have the race described above in "Reason #2". But I not think that adding to the tail is any important so I just do not do that. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07writeback: do not lose wake-ups in bdi threadsArtem Bityutskiy
Currently, bdi threads ('bdi_writeback_thread()') can lose wake-ups. For example, if 'bdi_queue_work()' is executed after the bdi thread have had finished 'wb_do_writeback()' but before it called 'schedule_timeout_interruptible()'. To fix this issue, we have to check whether we have works to process after we have changed the task state to 'TASK_INTERRUPTIBLE'. This patch also clean-ups handling of the cases when 'dirty_writeback_interval' is zero or non-zero. Additionally, this patch also removes unneeded 'list_empty_careful()' call. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07writeback: harmonize writeback threads namingArtem Bityutskiy
The write-back code mixes words "thread" and "task" for the same things. This is not a big deal, but still an inconsistency. hch: a convention I tend to use and I've seen in various places is to always use _task for the storage of the task_struct pointer, and thread everywhere else. This especially helps with having foo_thread for the actual thread and foo_task for a global variable keeping the task_struct pointer This patch renames: * 'bdi_add_default_flusher_task()' -> 'bdi_add_default_flusher_thread()' * 'bdi_forker_task()' -> 'bdi_forker_thread()' because bdi threads are 'bdi_writeback_thread()', so these names are more consistent. This patch also amends commentaries and makes them refer the forker and bdi threads as "thread", not "task". Also, while on it, make 'bdi_add_default_flusher_thread()' declaration use 'static void' instead of 'void static' and make checkpatch.pl happy. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07writeback: remove wb in get_next_work_itemMinchan Kim
83ba7b07 cleans up the writeback. So we don't use wb any more in get_next_work_item. Let's remove unnecessary argument. CC: Christoph Hellwig <hch@lst.de> Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07writeback: Add tracing to balance_dirty_pagesDave Chinner
Tracing high level background writeback events is good, but it doesn't give the entire picture. Add visibility into write throttling to catch IO dispatched by foreground throttling of processing dirtying lots of pages. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07writeback: Initial tracing supportDave Chinner
Trace queue/sched/exec parts of the writeback loop. This provides insight into when and why flusher threads are scheduled to run. e.g a sync invocation leaves traces like: sync-[...]: writeback_queue: bdi 8:0: sb_dev 8:1 nr_pages=7712 sync_mode=0 kupdate=0 range_cyclic=0 background=0 flush-8:0-[...]: writeback_exec: bdi 8:0: sb_dev 8:1 nr_pages=7712 sync_mode=0 kupdate=0 range_cyclic=0 background=0 This also lays the foundation for adding more writeback tracing to provide deeper insight into the whole writeback path. The original tracing code is from Jens Axboe, though this version is a rewrite as a result of the code being traced changing significantly. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07writeback: merge bdi_writeback_task and bdi_start_fnChristoph Hellwig
Move all code for the writeback thread into fs/fs-writeback.c instead of splitting it over two functions in two files. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07writeback: remove wb_listChristoph Hellwig
The wb_list member of struct backing_device_info always has exactly one element. Just use the direct bdi->wb pointer instead and simplify some code. Also remove bdi_task_init which is now trivial to prepare for the next patch. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-07-06writeback: simplify the write back thread queueChristoph Hellwig
First remove items from work_list as soon as we start working on them. This means we don't have to track any pending or visited state and can get rid of all the RCU magic freeing the work items - we can simply free them once the operation has finished. Second use a real completion for tracking synchronous requests - if the caller sets the completion pointer we complete it, otherwise use it as a boolean indicator that we can free the work item directly. Third unify struct wb_writeback_args and struct bdi_work into a single data structure, wb_writeback_work. Previous we set all parameters into a struct wb_writeback_args, copied it into struct bdi_work, copied it again on the stack to use it there. Instead of just allocate one structure dynamically or on the stack and use it all the way through the stack. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-07-06writeback: split writeback_inodes_wbChristoph Hellwig
The case where we have a superblock doesn't require a loop here as we scan over all inodes in writeback_sb_inodes. Split it out into a separate helper to make the code simpler. This also allows to get rid of the sb member in struct writeback_control, which was rather out of place there. Also update the comments in writeback_sb_inodes that explain the handling of inodes from wrong superblocks. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-07-06writeback: remove writeback_inodes_wbcChristoph Hellwig
This was just an odd wrapper around writeback_inodes_wb. Removing this also allows to get rid of the bdi member of struct writeback_control which was rather out of place there. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-07-01fs-writeback: fix kernel-doc warningsRandy Dunlap
Fix kernel-doc to match the function's changed args. Warning(fs/fs-writeback.c:190): No description found for parameter 'args' Warning(fs/fs-writeback.c:190): Excess function parameter 'sb' description in 'bdi_queue_work_onstack' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-06-11writeback: fix pin_sb_for_writebackChristoph Hellwig
We need to check for s_instances to make sure we don't bother working against a filesystem that is beeing unmounted, and we need to call put_super to make sure a superblock is freed when we race against umount. Also no need to keep sb_lock after we got a reference on it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-06-11writeback: add missing requeue_io in writeback_inodes_wbChristoph Hellwig
In "writeback: fix writeback_inodes_wb from writeback_inodes_sb" I accidentally removed the requeue_io if we need to skip a superblock because we can't pin it. Add it back, otherwise we're getting spurious lockups after multiple xfstests runs. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-06-11writeback: simplify and split bdi_start_writebackChristoph Hellwig
bdi_start_writeback now never gets a superblock passed, so we can just remove that case. And to further untangle the code and flatten the call stack split it into two trivial helpers for it's two callers. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-06-11writeback: simplify wakeup_flusher_threadsChristoph Hellwig
bdi_writeback_all only has one caller, so fold it to simplify the code and flatten the call stack. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-06-11writeback: fix writeback_inodes_wb from writeback_inodes_sbChristoph Hellwig
When we call writeback_inodes_wb from writeback_inodes_sb we always have s_umount held, which currently makes the whole operation a no-op. But if we are called to write out inodes for a specific superblock we always have s_umount held, so replace the incorrect logic checking for WB_SYNC_ALL which only worked by coincidence with the proper check for an explicit superblock argument. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-06-11writeback: enforce s_umount locking in writeback_inodes_sbChristoph Hellwig
Make sure that not only sync_filesystem but all callers of writeback_inodes_sb have the superblock protected against remount. As-is this disables all functionality for these callers, but the next patch relies on this locking to fix writeback_inodes_sb for sync_filesystem. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-06-11writeback: queue work on stack in writeback_inodes_sbChristoph Hellwig
If we want to rely on s_umount in the caller we need to wait for completion of the I/O submission before returning to the caller. Refactor bdi_sync_writeback into a bdi_queue_work_onstack helper and use it for this case. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-06-11writeback: fix writeback completion notificationsChristoph Hellwig
The code dealing with bdi_work->state and completion of a bdi_work is a major mess currently. This patch makes sure we directly use one set of flags to deal with it, and use it consistently, which means: - always notify about completion from the rcu callback. We only ever wait for it from on-stack callers, so this simplification does not even cause a theoretical slowdown currently. It also makes sure we don't miss out on the notification if we ever add other callers to wait for it. - make earlier completion notification depending on the on-stack allocation, not the sync mode. If we introduce new callers that want to do WB_SYNC_NONE writeback from on-stack callers this will be nessecary. Also rename bdi_wait_on_work_clear to bdi_wait_on_work_done and inline a few small functions into their only caller to make the code understandable. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-06-01Merge branch 'master' into for-linusJens Axboe
Conflicts: fs/pipe.c Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-06-01Revert "writeback: fix WB_SYNC_NONE writeback from umount"Jens Axboe
This reverts commit e913fc825dc685a444cb4c1d0f9d32f372f59861. We are investigating a hang associated with the WB_SYNC_NONE changes, so revert them for now. Conflicts: fs/fs-writeback.c mm/page-writeback.c Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-06-01Revert "writeback: ensure that WB_SYNC_NONE writeback with sb pinned is sync"Jens Axboe
This reverts commit 7c8a3554c683f512dbcee26faedb42e4c05f12fa. We are investigating a hang associated with the WB_SYNC_NONE changes, so revert them for now. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-05-25fs-writeback: check sync bit earlier in inode_wait_for_writebackRichard Kennedy
When wb_writeback() hasn't written anything it will re-acquire the inode lock before calling inode_wait_for_writeback. This change tests the sync bit first so that is doesn't need to drop & re-acquire the lock if the inode became available while wb_writeback() was waiting to get the lock. Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-21Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (69 commits) fix handling of offsets in cris eeprom.c, get rid of fake on-stack files get rid of home-grown mutex in cris eeprom.c switch ecryptfs_write() to struct inode *, kill on-stack fake files switch ecryptfs_get_locked_page() to struct inode * simplify access to ecryptfs inodes in ->readpage() and friends AFS: Don't put struct file on the stack Ban ecryptfs over ecryptfs logfs: replace inode uid,gid,mode initialization with helper function ufs: replace inode uid,gid,mode initialization with helper function udf: replace inode uid,gid,mode init with helper ubifs: replace inode uid,gid,mode initialization with helper function sysv: replace inode uid,gid,mode initialization with helper function reiserfs: replace inode uid,gid,mode initialization with helper function ramfs: replace inode uid,gid,mode initialization with helper function omfs: replace inode uid,gid,mode initialization with helper function bfs: replace inode uid,gid,mode initialization with helper function ocfs2: replace inode uid,gid,mode initialization with helper function nilfs2: replace inode uid,gid,mode initialization with helper function minix: replace inode uid,gid,mode init with helper ext4: replace inode uid,gid,mode init with helper ... Trivial conflict in fs/fs-writeback.c (mark bitfields unsigned)
2010-05-21fs-writeback.c: bitfields should be unsignedH Hartley Sweeten
This fixes sparse noise: error: dubious one-bit signed bitfield Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-05-21writeback: bdi_writeback_task() must set task state before calling schedule()Jens Axboe
Calling schedule without setting the task state to non-running will return immediately, so ensure that we set it properly and check our sleep conditions after doing so. This is a fixup for commit 69b62d01. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-05-21writeback: ensure that WB_SYNC_NONE writeback with sb pinned is syncJens Axboe
Even if the writeout itself isn't a data integrity operation, we need to ensure that the caller doesn't drop the sb umount sem before we have actually done the writeback. This is a fixup for commit e913fc82. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-05-17writeback: Update dirty flags in two stepsDmitry Monakhov
Filesystems with delalloc support may dirty inode during writepages. As result inode will have dirty metadata flags even after write_inode. In fact we have two dedicated functions for proper data and metadata writeback. It is reasonable to separate flags updates in two stages. https://bugzilla.kernel.org/show_bug.cgi?id=15906 Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-05-17writeback: fix WB_SYNC_NONE writeback from umountJens Axboe
When umount calls sync_filesystem(), we first do a WB_SYNC_NONE writeback to kick off writeback of pending dirty inodes, then follow that up with a WB_SYNC_ALL to wait for it. Since umount already holds the sb s_umount mutex, WB_SYNC_NONE ends up doing nothing and all writeback happens as WB_SYNC_ALL. This can greatly slow down umount, since WB_SYNC_ALL writeback is a data integrity operation and thus a bigger hammer than simple WB_SYNC_NONE. For barrier aware file systems it's a lot slower. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-05-17writeback: disable periodic old data writeback for !dirty_writeback_centisecsJens Axboe
Prior to 2.6.32, setting /proc/sys/vm/dirty_writeback_centisecs disabled periodic dirty writeback from kupdate. This got broken and now causes excessive sys CPU usage if set to zero, as we'll keep beating on schedule(). Cc: stable@kernel.org Reported-by: Justin Maggard <jmaggard10@gmail.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-04-09Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds
* 'for-linus' of git://git.kernel.dk/linux-2.6-block: (34 commits) cfq-iosched: Fix the incorrect timeslice accounting with forced_dispatch loop: Update mtime when writing using aops block: expose the statistics in blkio.time and blkio.sectors for the root cgroup backing-dev: Handle class_create() failure Block: Fix block/elevator.c elevator_get() off-by-one error drbd: lc_element_by_index() never returns NULL cciss: unlock on error path cfq-iosched: Do not merge queues of BE and IDLE classes cfq-iosched: Add additional blktrace log messages in CFQ for easier debugging i2o: Remove the dangerous kobj_to_i2o_device macro block: remove 16 bytes of padding from struct request on 64bits cfq-iosched: fix a kbuild regression block: make CONFIG_BLK_CGROUP visible Remove GENHD_FL_DRIVERFS block: Export max number of segments and max segment size in sysfs block: Finalize conversion of block limits functions block: Fix overrun in lcm() and move it to lib vfs: improve writeback_inodes_wb() paride: fix off-by-one test drbd: fix al-to-on-disk-bitmap for 4k logical_block_size ...
2010-03-30include cleanup: Update gfp.h and slab.h includes to prepare for breaking ↵Tejun Heo
implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-12vfs: improve writeback_inodes_wb()Edward Shishkin
Do not pin/unpin superblock for every inode in writeback_inodes_wb(), pin it for the whole group of inodes which belong to the same superblock and call writeback_sb_inodes() handler for them. Signed-off-by: Edward Shishkin <edward.shishkin@gmail.com> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>