aboutsummaryrefslogtreecommitdiff
path: root/fs/jbd
AgeCommit message (Collapse)Author
2010-10-27Merge branch 'for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6: (24 commits) quota: Fix possible oops in __dquot_initialize() ext3: Update kernel-doc comments jbd/2: fixed typos ext2: fixed typo. ext3: Fix debug messages in ext3_group_extend() jbd: Convert atomic_inc() to get_bh() ext3: Remove misplaced BUFFER_TRACE() in ext3_truncate() jbd: Fix debug message in do_get_write_access() jbd: Check return value of __getblk() ext3: Use DIV_ROUND_UP() on group desc block counting ext3: Return proper error code on ext3_fill_super() ext3: Remove unnecessary casts on bh->b_data ext3: Cleanup ext3_setup_super() quota: Fix issuing of warnings from dquot_transfer quota: fix dquot_disable vs dquot_transfer race v2 jbd: Convert bitops to buffer fns ext3/jbd: Avoid WARN() messages when failing to write the superblock jbd: Use offset_in_page() instead of manual calculation jbd: Remove unnecessary goto statement jbd: Use printk_ratelimited() in journal_alloc_journal_head() ...
2010-10-28jbd/2: fixed typosAndrea Gelmini
"wakup" Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net> Signed-off-by: Jan Kara <jack@suse.cz>
2010-10-28jbd: Convert atomic_inc() to get_bh()Namhyung Kim
Convert atomic_inc(&bh->b_count) to get_bh(bh) for consistency. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
2010-10-28jbd: Fix debug message in do_get_write_access()Namhyung Kim
'buffer_head' should be 'journal_head'. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
2010-10-28jbd: Check return value of __getblk()Namhyung Kim
Fail journal creation if __getblk() returns NULL. unlikely() is added because it is called in a loop and we've been OK without the check until now. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
2010-10-28jbd: Convert bitops to buffer fnsNamhyung Kim
Convert set/clear_bit(BH_JWrite, ...) to set/clear_buffer_jwrite() for consistency. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
2010-10-28ext3/jbd: Avoid WARN() messages when failing to write the superblockDarrick J. Wong
This fixes a WARN backtrace in mark_buffer_dirty() that occurs during unmount when the underlying block device is removed. This bug has been seen on System Z when removing all paths from a multipath-backed ext3 mount; on System P when injecting enough PCI EEH errors to make the SCSI controller go offline; and similar warnings have been seen (and patched) with ext2/ext4. The super block update from a previous operation has marked the buffer as in error, and the flag has to be cleared before doing the update. Similar changes have been made to ext4 by commit 914258bf2cb22bf4336a1b1d90c551b4b11ca5aa. Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> Signed-off-by: Jan Kara <jack@suse.cz>
2010-10-28jbd: Use offset_in_page() instead of manual calculationNamhyung Kim
Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
2010-10-28jbd: Remove unnecessary goto statementNamhyung Kim
Remove goto statement which jumps to very next line. Also remove target label because it is no longer used anywhere. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
2010-10-28jbd: Use printk_ratelimited() in journal_alloc_journal_head()Namhyung Kim
Use printk_ratelimited() instead of doing it manually. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
2010-10-28jbd: Move debug message into #ifdef areaNamhyung Kim
Move call to jbd_debug() into #ifdef CONFIG_JBD_DEBUG block because 'dropped' is declared there. The code could be compiled without this change anyway, simply because jbd_debug() expands to nothing if !CONFIG_JBD_DEBUG but IMHO it doesn't look good in general. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
2010-10-22Merge branch 'for-2.6.37/barrier' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds
* 'for-2.6.37/barrier' of git://git.kernel.dk/linux-2.6-block: (46 commits) xen-blkfront: disable barrier/flush write support Added blk-lib.c and blk-barrier.c was renamed to blk-flush.c block: remove BLKDEV_IFL_WAIT aic7xxx_old: removed unused 'req' variable block: remove the BH_Eopnotsupp flag block: remove the BLKDEV_IFL_BARRIER flag block: remove the WRITE_BARRIER flag swap: do not send discards as barriers fat: do not send discards as barriers ext4: do not send discards as barriers jbd2: replace barriers with explicit flush / FUA usage jbd2: Modify ASYNC_COMMIT code to not rely on queue draining on barrier jbd: replace barriers with explicit flush / FUA usage nilfs2: replace barriers with explicit flush / FUA usage reiserfs: replace barriers with explicit flush / FUA usage gfs2: replace barriers with explicit flush / FUA usage btrfs: replace barriers with explicit flush / FUA usage xfs: replace barriers with explicit flush / FUA usage block: pass gfp_mask and flags to sb_issue_discard dm: convey that all flushes are processed as empty ...
2010-09-20cfq: improve fsync performance for small filesCorrado Zoccolo
Fsync performance for small files achieved by cfq on high-end disks is lower than what deadline can achieve, due to idling introduced between the sync write happening in process context and the journal commit. Moreover, when competing with a sequential reader, a process writing small files and fsync-ing them is starved. This patch fixes the two problems by: - marking journal commits as WRITE_SYNC, so that they get the REQ_NOIDLE flag set, - force all queues that have REQ_NOIDLE requests to be put in the noidle tree. Having the queue associated to the fsync-ing process and the one associated to journal commits in the noidle tree allows: - switching between them without idling, - fairness vs. competing idling queues, since they will be serviced only after the noidle tree expires its slice. Acked-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Tested-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-09-10jbd: replace barriers with explicit flush / FUA usageChristoph Hellwig
Switch to the WRITE_FLUSH_FUA flag for journal commits and remove the EOPNOTSUPP detection for barriers. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-18remove SWRITE* I/O typesChristoph Hellwig
These flags aren't real I/O types, but tell ll_rw_block to always lock the buffer instead of giving up on a failed trylock. Instead add a new write_dirty_buffer helper that implements this semantic and use it from the existing SWRITE* callers. Note that the ll_rw_block code had a bug where it didn't promote WRITE_SYNC_PLUG properly, which this patch fixes. In the ufs code clean up the helper that used to call ll_rw_block to mirror sync_dirty_buffer, which is the function it implements for compound buffers. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18kill BH_Ordered flagChristoph Hellwig
Instead of abusing a buffer_head flag just add a variant of sync_dirty_buffer which allows passing the exact type of write flag required. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-07-21ext3: Fix set but unused variablesAndi Kleen
[tytso@mit.edu: Fix compilation with CONFIG_JBD_DEBUG enabled] Acked-by: tytso@mit.edu cc: linux-ext4@vger.kernel.org Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Jan Kara <jack@suse.cz>
2010-05-21ext3: Fix waiting on transaction during fsyncJan Kara
log_start_commit() returns 1 only when it started a transaction commit. Thus in case transaction commit is already running, we fail to wait for the commit to finish. Fix the issue by always waiting for the commit regardless of the log_start_commit return value. Signed-off-by: Jan Kara <jack@suse.cz>
2010-05-21jbd: Provide function to check whether transaction will issue data barrierJan Kara
Provide a function which returns whether a transaction with given tid will send a barrier to the filesystem device. The function will be used by ext3 to detect whether fsync needs to send a separate barrier or not. Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-30include cleanup: Update gfp.h and slab.h includes to prepare for breaking ↵Tejun Heo
implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-08Merge branch 'for-next' into for-linusJiri Kosina
Conflicts: Documentation/filesystems/proc.txt arch/arm/mach-u300/include/mach/debug-macro.S drivers/net/qlge/qlge_ethtool.c drivers/net/qlge/qlge_main.c drivers/net/typhoon.c
2010-03-05jbd: Delay discarding buffers in journal_unmap_bufferJan Kara
Delay discarding buffers in journal_unmap_buffer until we know that "add to orphan" operation has definitely been committed, otherwise the log space of committing transation may be freed and reused before truncate get committed, updates may get lost if crash happens. This patch is a backport of JBD2 fix by dingdinghua <dingdinghua@nrchpc.ac.cn>. Signed-off-by: Jan Kara <jack@suse.cz>
2010-02-09tree-wide: Assorted spelling fixesDaniel Mack
In particular, several occurances of funny versions of 'success', 'unknown', 'therefore', 'acknowledge', 'argument', 'achieve', 'address', 'beginning', 'desirable', 'separate' and 'necessary' are fixed. Signed-off-by: Daniel Mack <daniel@caiaq.de> Cc: Joe Perches <joe@perches.com> Cc: Junio C Hamano <gitster@pobox.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2009-12-23jbd: jbd-debug and jbd2-debug should be writableYin Kangkai
jbd-debug and jbd2-debug is currently read-only (S_IRUGO), which is not correct. Make it writable so that we can start debuging. Signed-off-by: Yin Kangkai <kangkai.yin@intel.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jan Kara <jack@suse.cz>
2009-12-17Revert "task_struct: make journal_info conditional"Linus Torvalds
This reverts commit e4c570c4cb7a95dbfafa3d016d2739bf3fdfe319, as requested by Alexey: "I think I gave a good enough arguments to not merge it. To iterate: * patch makes impossible to start using ext3 on EXT3_FS=n kernels without reboot. * this is done only for one pointer on task_struct" None of config options which define task_struct are tristate directly or effectively." Requested-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15task_struct: make journal_info conditionalHiroshi Shimamoto
journal_info in task_struct is used in journaling file system only. So introduce CONFIG_FS_JOURNAL_INFO and make it conditional. Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> Cc: Chris Mason <chris.mason@oracle.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-11-12fs/jbd: Export log_start_commit to fix ext3 build.Stefan Schmidt
This fixes: ERROR: "log_start_commit" [fs/ext3/ext3.ko] undefined! Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>
2009-11-11JBD/JBD2: free j_wbuf if journal init fails.Tao Ma
If journal init fails, we need to free j_wbuf. Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Jan Kara <jack@suse.cz>
2009-09-16jbd: Annotate transaction start also for journal_restart()Jan Kara
lockdep annotation for a transaction start has been at the end of journal_start(). But a transaction is also started from journal_restart(). Move the lockdep annotation to start_this_handle() which covers both cases. Signed-off-by: Jan Kara <jack@suse.cz>
2009-09-16jbd: Journal block numbers can ever be only 32-bit use unsigned int for themJan Kara
It does not make sense to store block number for journal as unsigned long since they can be only 32-bit (because of on-disk format limitation). So change in-memory structures and variables to use unsigned int instead. Signed-off-by: Jan Kara <jack@suse.cz>
2009-09-16JBD: round commit timer up to avoid uncommitted transactionAndreas Dilger
Fix jiffie rounding in jbd commit timer setup code. Rounding down could cause the timer to be fired before the corresponding transaction has expired. That transaction can stay not committed forever if no new transaction is created or explicit sync/umount happens. Signed-off-by: Andreas Dilger <adilger@sun.com> Signed-off-by: Jan Kara <jack@suse.cz>
2009-07-21jbd: fix race between write_metadata_buffer and get_write_accessdingdinghua
The function journal_write_metadata_buffer() calls jbd_unlock_bh_state(bh_in) too early; this could potentially allow another thread to call get_write_access on the buffer head, modify the data, and dirty it, and allowing the wrong data to be written into the journal. Fortunately, if we lose this race, the only time this will actually cause filesystem corruption is if there is a system crash or other unclean shutdown of the system before the next commit can take place. Signed-off-by: dingdinghua <dingdinghua85@gmail.com> Acked-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Jan Kara <jack@suse.cz>
2009-07-15jbd: Fix a race between checkpointing code and journal_get_write_access()Jan Kara
The following race can happen: CPU1 CPU2 checkpointing code checks the buffer, adds it to an array for writeback do_get_write_access() ... lock_buffer() unlock_buffer() flush_batch() submits the buffer for IO __jbd_journal_file_buffer() So a buffer under writeout is returned from do_get_write_access(). Since the filesystem code relies on the fact that journaled buffers cannot be written out, it does not take the buffer lock and so it can modify buffer while it is under writeout. That can lead to a filesystem corruption if we crash at the right moment. The similar problem can happen with the journal_get_create_access() path. We fix the problem by clearing the buffer dirty bit under buffer_lock even if the buffer is on BJ_None list. Actually, we clear the dirty bit regardless the list the buffer is in and warn about the fact if the buffer is already journalled. Thanks for spotting the problem goes to dingdinghua <dingdinghua85@gmail.com>. Reported-by: dingdinghua <dingdinghua85@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
2009-07-15jbd: Fail to load a journal if it is too shortJan Kara
Due to on disk corruption, it can happen that journal is too short. Fail to load it in such case so that we don't oops somewhere later. Reported-by: Nageswara R Sastry <rnsastry@linux.vnet.ibm.com> Signed-off-by: Jan Kara <jack@suse.cz>
2009-06-18jbd: clean up journal_try_to_free_buffers()Hisashi Hifumi
I delete the following patch "commit 3f31fddfa26b7594b44ff2b34f9a04ba409e0f91 Author: Mingming Cao <cmm@us.ibm.com> Date: Fri Jul 25 01:46:22 2008 -0700 jbd: fix race between free buffer and commit transaction This patch is no longer needed because if race between freeing buffer and committing transaction functionality occurs and dio gets error, currently dio falls back to buffered IO by the following patch. commit 6ccfa806a9cfbbf1cd43d5b6aa47ef2c0eb518fd Author: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp> Date: Tue Sep 2 14:35:40 2008 -0700 VFS: fix dio write returning EIO when try_to_release_page fails Signed-off-by: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp> Cc: Theodore Tso <tytso@mit.edu> Cc: Mingming Cao <cmm@us.ibm.com> Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-09jbd: fix race in buffer processing in commit codeJan Kara
In commit code, we scan buffers attached to a transaction. During this scan, we sometimes have to drop j_list_lock and then we recheck whether the journal buffer head didn't get freed by journal_try_to_free_buffers(). But checking for buffer_jbd(bh) isn't enough because a new journal head could get attached to our buffer head. So add a check whether the journal head remained the same and whether it's still at the same transaction and list. This is a nasty bug and can cause problems like memory corruption (use after free) or trigger various assertions in JBD code (observed). Signed-off-by: Jan Kara <jack@suse.cz> Cc: <stable@kernel.org> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-24Merge branch 'for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: Fix potential inode allocation soft lockup in Orlov allocator ext4: Make the extent validity check more paranoid jbd: use SWRITE_SYNC_PLUG when writing synchronous revoke records jbd2: use SWRITE_SYNC_PLUG when writing synchronous revoke records ext4: really print the find_group_flex fallback warning only once
2009-04-14jbd: use SWRITE_SYNC_PLUG when writing synchronous revoke recordsTheodore Ts'o
The revoke records must be written using the same way as the rest of the blocks during the commit process; that is, either marked as synchronous writes or as asynchornous writes. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-04-13jbd: update locking comentsJan Kara
Update information about locking in JBD revoke code. Reported-by: Lin Tan <tammy000@gmail.com>. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-06jbd: use WRITE_SYNC_PLUG instead of WRITE_SYNCJens Axboe
When you are going to be submitting several sync writes, we want to give the IO scheduler a chance to merge some of them. Instead of using the implicitly unplugging WRITE_SYNC variant, use WRITE_SYNC_PLUG and rely on sync_buffer() doing the unplug when someone does a wait_on_buffer()/lock_buffer(). Signed-off-by: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-03Merge branch 'ext3-latency-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 * 'ext3-latency-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext3: Add replace-on-rename hueristics for data=writeback mode ext3: Add replace-on-truncate hueristics for data=writeback mode ext3: Use WRITE_SYNC for commits which are caused by fsync() block_write_full_page: Use synchronous writes for WBC_SYNC_ALL writebacks
2009-04-02jbd: fix oops in jbd_journal_init_inode() on corrupted fsJan Kara
On 32-bit system with CONFIG_LBD getblk can fail because provided block number is too big. Make JBD gracefully handle that. Signed-off-by: Jan Kara <jack@suse.cz> Cc: <dmaciejak@fortinet.com> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-03-27ext3: Use WRITE_SYNC for commits which are caused by fsync()Theodore Ts'o
If a commit is triggered by fsync(), set a flag indicating the journal blocks associated with the transaction should be flushed out using WRITE_SYNC. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Acked-by: Jan Kara <jack@suse.cz>
2009-02-11jbd: fix return value of journal_start_commit()Jan Kara
journal_start_commit() returns 1 if either a transaction is committing or the function has queued a transaction commit. But it returns 0 if we raced with somebody queueing the transaction commit as well. This resulted in ext3_sync_fs() not functioning correctly (description from Arthur Jones): In the case of a data=ordered umount with pending long symlinks which are delayed due to a long list of other I/O on the backing block device, this causes the buffer associated with the long symlinks to not be moved to the inode dirty list in the second phase of fsync_super. Then, before they can be dirtied again, kjournald exits, seeing the UMOUNT flag and the dirty pages are never written to the backing block device, causing long symlink corruption and exposing new or previously freed block data to userspace. This can be reproduced with a script created by Eric Sandeen <sandeen@redhat.com>: #!/bin/bash umount /mnt/test2 mount /dev/sdb4 /mnt/test2 rm -f /mnt/test2/* dd if=/dev/zero of=/mnt/test2/bigfile bs=1M count=512 touch /mnt/test2/thisisveryveryveryveryveryveryveryveryveryveryveryveryveryveryveryverylongfilename ln -s /mnt/test2/thisisveryveryveryveryveryveryveryveryveryveryveryveryveryveryveryverylongfilename /mnt/test2/link umount /mnt/test2 mount /dev/sdb4 /mnt/test2 ls /mnt/test2/ This patch fixes journal_start_commit() to always return 1 when there's a transaction committing or queued for commit. Cc: Eric Sandeen <sandeen@redhat.com> Cc: Mike Snitzer <snitzer@gmail.com> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-08jbd: remove excess kernel-doc notationRandy Dunlap
Remove excess kernel-doc from fs/jbd/transaction.c: Warning(linux-2.6.28-git5//fs/jbd/transaction.c:764): Excess function parameter 'credits' description in 'journal_get_write_access' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-08jbd: improve fsync batchingJosef Bacik
There is a flaw with the way jbd handles fsync batching. If we fsync() a file and we were not the last person to run fsync() on this fs then we automatically sleep for 1 jiffie in order to wait for new writers to join into the transaction before forcing the commit. The problem with this is that with really fast storage (ie a Clariion) the time it takes to commit a transaction to disk is way faster than 1 jiffie in most cases, so sleeping means waiting longer with nothing to do than if we just committed the transaction and kept going. Ric Wheeler noticed this when using fs_mark with more than 1 thread, the throughput would plummet as he added more threads. This patch attempts to fix this problem by recording the average time in nanoseconds that it takes to commit a transaction to disk, and what time we started the transaction. If we run an fsync() and we have been running for less time than it takes to commit the transaction to disk, we sleep for the delta amount of time and then commit to disk. We acheive sub-jiffie sleeping using schedule_hrtimeout. This means that the wait time is auto-tuned to the speed of the underlying disk, instead of having this static timeout. I weighted the average according to somebody's comments (Andreas Dilger I think) in order to help normalize random outliers where we take way longer or way less time to commit than the average. I also have a min() check in there to make sure we don't sleep longer than a jiffie in case our storage is super slow, this was requested by Andrew. I unfortunately do not have access to a Clariion, so I had to use a ramdisk to represent a super fast array. I tested with a SATA drive with barrier=1 to make sure there was no regression with local disks, I tested with a 4 way multipathed Apple Xserve RAID array and of course the ramdisk. I ran the following command fs_mark -d /mnt/ext3-test -s 4096 -n 2000 -D 64 -t $i where $i was 2, 4, 8, 16 and 32. I mkfs'ed the fs each time. Here are my results type threads with patch without patch sata 2 24.6 26.3 sata 4 49.2 48.1 sata 8 70.1 67.0 sata 16 104.0 94.1 sata 32 153.6 142.7 xserve 2 246.4 222.0 xserve 4 480.0 440.8 xserve 8 829.5 730.8 xserve 16 1172.7 1026.9 xserve 32 1816.3 1650.5 ramdisk 2 2538.3 1745.6 ramdisk 4 2942.3 661.9 ramdisk 8 2882.5 999.8 ramdisk 16 2738.7 1801.9 ramdisk 32 2541.9 2394.0 Signed-off-by: Josef Bacik <jbacik@redhat.com> Cc: Andreas Dilger <adilger@sun.com> Cc: Arjan van de Ven <arjan@infradead.org> Cc: Ric Wheeler <rwheeler@redhat.com> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-06jbd: don't give up looking for space so easily in __log_wait_for_spaceTheodore Ts'o
Commit be07c4ed introducd a regression because it assumed that if there were no transactions ready to be checkpointed, that no progress could be made on making space available in the journal, and so the journal should be aborted. This assumption is false; it could be the case that simply calling cleanup_journal_tail() will recover the necessary space, or, for small journals, the currently committing transaction could be responsible for chewing up the required space in the log, so we need to wait for the currently committing transaction to finish before trying to force a checkpoint operation. This patch fixes the bug reported by Meelis Roos at: http://bugzilla.kernel.org/show_bug.cgi?id=11937 Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Duane Griffin <duaneg@dghda.com> Cc: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
2008-10-30fs: remove excess kernel-docRandy Dunlap
Delete excess kernel-doc notation in fs/ subdirectory: Warning(linux-2.6.27-git10//fs/jbd/transaction.c:886): Excess function parameter or struct member 'credits' description in 'journal_get_undo_access' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-23jbd: abort instead of waiting for nonexistent transactionsDuane Griffin
The __log_wait_for_space function sits in a loop checkpointing transactions until there is sufficient space free in the journal. However, if there are no transactions to be processed (e.g. because the free space calculation is wrong due to a corrupted filesystem) it will never progress. Check for space being required when no transactions are outstanding and abort the journal instead of endlessly looping. This patch fixes the bug reported by Sami Liedes at: http://bugzilla.kernel.org/show_bug.cgi?id=10976 Signed-off-by: Duane Griffin <duaneg@dghda.com> Tested-by: Sami Liedes <sliedes@cc.hut.fi> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-23jbd: test BH_Write_EIO to detect errors on metadata buffersHidehiro Kawai
__try_to_free_cp_buf(), __process_buffer(), and __wait_cp_io() test BH_Uptodate flag to detect write I/O errors on metadata buffers. But by commit 95450f5a7e53d5752ce1a0d0b8282e10fe745ae0 "ext3: don't read inode block if the buffer has a write error"(*), BH_Uptodate flag can be set to inode buffers with BH_Write_EIO in order to avoid reading old inode data. So now, we have to test BH_Write_EIO flag of checkpointing inode buffers instead of BH_Uptodate. This patch does it. Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Acked-by: Jan Kara <jack@suse.cz> Acked-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>